Basic Concepts
State: The status of the agent with respect to the environment.
State Space: The set of all states
. Action.
Action Space of a State:
. State Transition:
. Using Probability to describle State Transition, for example
(It could be stochastic).Policy: tell the agent what actions to take at a state. We can use conditional probability to represent
, represents a policy.Reward: after taking an action, we can get an real number. Math:
(It could be stochastic).Trajectory: a state-action-reward chain
, the return of it is the sum of all the rewards (0 + 0 + 0 + 1).Discounted return: if a trajectory is infinite, the return is also infinite. Hence, we need to introduce a discount rate
if there is a ,Episode: if the agent stops at some terminal states, the resulting trajectory is called an episode (or a trial)
Markov Decision Process
There are some key elements in MDP:
Set (State set
; Action set , ; Reward set )Probability Distribution
- State transition probability:
- Reward probability:
- State transition probability:
Policy: at state s, the probability to choose action a is
Markov Property: it is memoryless property
State Value
Consider a single-step process:
.
For this process, there are some probability distributions: is equal to is equal to is equal to
Consider a multi-step process:
.
The discounted return isNow, we can introduce state value. It is defined as the expectation of
.Here, we should note that if the trajectory is deterministic, state value would be euqal to return.
Bellman Equation
- It describes the relationship among the values of all states.
I would also use above multi-step process as example. The return can be written as
then, substitute
Now, let’s calculate the first term, which is immediate rewards after doing action from state
When it comes to the second term, which represents the expectation of future rewards
Now we can combine the first term with second term
We can write convert the Bellman Equation to its matrix form
Let’s assume
Until now, we use Bellman Equation to represent the relationship of state value between different state. How to solve state value?
- Method 1: Directly calculate closed-form solution
, but it need to calculate matrix inverse. - Method 2: We can also use iterative method
. Here, we can briefly prove the feasibility of the iterative method
Assume and , substituting them into the iteration equation
Because of and . Hence, .
Action Value
- Using two equations to represent the relationship between state value and action value, we use
to represent action value
Bellman Optimality Equation
we can also write the matrix form