Reinforcement Learning

Basic Concepts

  • State: The status of the agent with respect to the environment.

  • State Space: The set of all states .

  • Action.

  • Action Space of a State: .

  • State Transition: .

  • Using Probability to describle State Transition, for example (It could be stochastic).

  • Policy: tell the agent what actions to take at a state. We can use conditional probability to represent , represents a policy.

  • Reward: after taking an action, we can get an real number. Math: (It could be stochastic).

  • Trajectory: a state-action-reward chain , the return of it is the sum of all the rewards (0 + 0 + 0 + 1).

  • Discounted return: if a trajectory is infinite, the return is also infinite. Hence, we need to introduce a discount rate if there is a ,

  • Episode: if the agent stops at some terminal states, the resulting trajectory is called an episode (or a trial)

Markov Decision Process

There are some key elements in MDP:

  • Set (State set ; Action set , ; Reward set )

  • Probability Distribution

    • State transition probability:
    • Reward probability:
  • Policy: at state s, the probability to choose action a is

  • Markov Property: it is memoryless property

State Value

  • Consider a single-step process: .
    For this process, there are some probability distributions:

    1. is equal to
    2. is equal to
    3. is equal to
  • Consider a multi-step process: .
    The discounted return is

  • Now, we can introduce state value. It is defined as the expectation of .

    Here, we should note that if the trajectory is deterministic, state value would be euqal to return.

Bellman Equation

  • It describes the relationship among the values of all states.
    I would also use above multi-step process as example. The return can be written as

then, substitute into

Now, let’s calculate the first term, which is immediate rewards after doing action from state

When it comes to the second term, which represents the expectation of future rewards

Now we can combine the first term with second term

We can write convert the Bellman Equation to its matrix form
Let’s assume . Then, we can rewrite the equation

Until now, we use Bellman Equation to represent the relationship of state value between different state. How to solve state value?

  • Method 1: Directly calculate closed-form solution , but it need to calculate matrix inverse.
  • Method 2: We can also use iterative method . Here, we can briefly prove the feasibility of the iterative method
    Assume and , substituting them into the iteration equation

    Because of and . Hence, .

Action Value

  • Using two equations to represent the relationship between state value and action value, we use to represent action value

Bellman Optimality Equation


  • we can also write the matrix form