Reinforcement Learning

2024-05-23

Basic Concepts

State: The status of the agent with respect to the environment.
State Space: The set of all states .
Action.
Action Space of a State: .
State Transition: .
Using Probability to describle State Transition, for example (It could be stochastic).
Policy: tell the agent what actions to take at a state. We can use conditional probability to represent , represents a policy.
Reward: after taking an action, we can get an real number. Math: (It could be stochastic).
Trajectory: a state-action-reward chain , the return of it is the sum of all the rewards (0 + 0 + 0 + 1).
Discounted return: if a trajectory is infinite, the return is also infinite. Hence, we need to introduce a discount rate if there is a ,
Episode: if the agent stops at some terminal states, the resulting trajectory is called an episode (or a trial)

There are some key elements in MDP:

Consider a single-step process: .
For this process, there are some probability distributions:
1. is equal to
2. is equal to
3. is equal to
Consider a multi-step process: .
The discounted return is
Now, we can introduce state value. It is defined as the expectation of .
$v_\pi(s)=\mathbb{E}[G_t|S_t=s]$
Here, we should note that if the trajectory is deterministic, state value would be euqal to return.

It describes the relationship among the values of all states.
I would also use above multi-step process as example. The return can be written as

then, substitute into

Now, let’s calculate the first term, which is immediate rewards after doing action from state

When it comes to the second term, which represents the expectation of future rewards

Now we can combine the first term with second term

We can write convert the Bellman Equation to its matrix form
Let’s assume . Then, we can rewrite the equation

Until now, we use Bellman Equation to represent the relationship of state value between different state. How to solve state value?

Method 1: Directly calculate closed-form solution , but it need to calculate matrix inverse.
Method 2: We can also use iterative method . Here, we can briefly prove the feasibility of the iterative method
Assume and , substituting them into the iteration equation

Because of and . Hence, .

Using two equations to represent the relationship between state value and action value, we use to represent action value