Policy Gradients — Paper Note

Shamane Siriwardhana
10 min readMar 14, 2018

This is somewhat hard thing to understand since there are much more mathematical intuition and different terms . I really found it difficult to understand even though I hardly followed Davil Silver’s RL course . There are many resources understand Deep Q Learning . Which was the “BOOM” in Deep Reinforcement Learning . It is mainly a state-value function approximation .

Flow of the deep Q learning

This is easy to understand . This is basically the policy Iteration . Where the policy evaluation step is conducted by a Neural Net . Any way in order to learn what is policy gradient methods you need to go bit deep. And There’s this nice paper Policy Gradient Methods for Reinforcement Learning with Function Approximation

Let’s summarize that paper

Mainly in the Abstract part they have mentioned why policy approximation is essential in the reinforcement learning . Value function approximation is famous but it has drawbacks which are explicitly described in the paper .

Such as

  1. In value function approximation we do not directly approximate the policy .We take state-action values pairs and use the argmax . This can add noise even for some problems it can diverge . We have no guarantee that value function approximation can converge the RL problem into an optimal policy .
  2. Next thing is the problem with continuous action RL problems . Think of a movement of a Robot arm. Here if we are to take Argmax(Q value pairs) it is infeasible .
  3. Change in the value function cause the whole policy going in another direction . This is not desired . Even though there are some smoothing techniques used .
  4. Values approximation methods are not fully theoretically converging algorithms . (But practically they have given good results — Atari Games from Deep Mind)
  5. Also there can be not much of exploration . Since we use -greedy methods we might drive into very deterministic policies where for some real world problems stochastic function approximation is the desirable way .

In the first part they have clearly mentioned value function approximation methods like

  1. Sarsa
  2. Q Learning
  3. Dynamic Programming

Next the paper has mentioned about the approach . Here there’s a neural net which takes the states as inputs and predict action probabilities as outputs .

Remember this is for the discrete action scenario . If the action space is continuous the neural network will output a mean and a variance . From that we can sample the action

Then it is important to understand about the notations :D

θ= Vector of policy parameters (neural network parameters) .

ρ= Performance measure of our policy . We need to maximize this (Function).

Here we use Gradient of our function to update the parameters of neural network .

α= This is the step size of the gradient update for the parameters .

Before going into details they have mentioned two used case of policy gradient methods .

They are

  1. REINFORCE METHOD (Monte Carlo Policy Gradient Method)
  2. Action Critic Methods

RAINFORCE METHOD (Monte Carlo Policy Gradient Method)

This is also a policy approximation method but without a value function approximation . In the paper they have mentioned in naively thinking that we already know about the policy approximation function . So don’t worry if you don’t know what is policy gradient algorithm. For those who know I will put a screen shot from David Silver’s 7th lecture to recap.

Here basically in the objective function we calculate the long term reward for performing each action with monte-carlo method which is really slow and will add lot more variance to the system.

Actor Critic Methods

This is again a RL method which used the policy evaluation method with policy gradients . Here unlike using above monte carlo method to get the value function we use an approximation like in Q learning . Actually this paper is to prove the actor critic method or policy iteration methods with policy

This part actually express the notations belong to Rewards , state transition probabilities , policy . Also what these things are depend on .

Also they have mentioned we assume the policy is differentiable when we take actions .

The next part is all about describing the Objective Function

It is important to understand how objective functions are formulated . They simply mentioned we need a policy that will maximize the average reward over the time steps .

First method is to take the average value of rewards under the current policy .

Here d(s) is the probability of each state visit under the current policy . It is the state transition probability where we think that this probability is independent from the initial state . This is like how each state can be visited .

The second objective functions that they are describing is you start from some kind of starting state . From that state you will collect the reward until it get terminated . Then again start from that state and collect time discounted rewards . Just think of a game of chess . You start from beginning every time and play until a checkmate move and collect rewards .

Here is the discount factor .You can see the state-action(Q Function) value function also calculated given the each state and each action .

So these are the objective functions we need to work with . We need to maximize them . Because end result should be a maximum of reward under our generalized policy .

Now the paper is going towards for the mathematical stuff .

The above function is for single time step . Each time step

This is an important equation . You will find this more and more when going through policy gradient . In the paper this has proven . It’s very clear .

Please go to that part .

Here the first in simply the value function under a given policy . It’s simple bellman expectation I might say . What it do is it will calculate the Value function by taking expectation from different Q values over the action space .

The have proved for both average reward for time state and earlier mentioned (chess case) start state improvement functions .

So up to now main thing you need to know about the how to formulate the objective function to maximize average rewards and what it is consist of .

So here you can see we take take the gradient of our policy and multiply it with state action values then take the expectation .

Now the paper start to question how to get the Q(s,a). This is not known . We need to know it in order to boost our policy algorithm towards an optimum .

One way is to get the actual returns . Rewards at each time step . This is above REINFORCE algorithm . But this us slow .

The second part is about can we approximate the policy function . This is really really important . Modern day deep learning can really help in this . Now we are slowly into the actor critic models.

But this approximation should be good. Now we gonna approximate both directly

  • Policy
  • Q Function

And this Q function should have direct connection with the policy approximator. Well that make sense otherwise this can introduce too much variance to the system .

Then the following part

This is very important . Here they take the function that rely on state and action as the Q(s,a) approximator . Now it has a parameter called w .

In order to learn this approximation there should be some kind of target . He we can the mean squared error .

The best way to understand this equation is like this . Your agent will follow some action according to a stochastic policy where we can tune its parameters . Now our goal make the learning task easy by introducing the action value approximator .

So agent will follow some action and we need to predict the Q value for the state action . It should be an unbiased one . Then we our neural net will also predict some value . But what we want to do is every time when predicting Q values make sure it is not change hugely with value we got following the policy .

Equation 3 is all about convergint the function approximator to a local optimum.

Sum terms are same as again we get the expected values . It’s like in each state for each actions for given policy how this loss behaved .

Then they come to the theorem 2.

Remember theorem 1 is all about setting up a direct objective functions over rewards we get when executing .

If you try carefully the 4th equation is all about trying to make the connection between Q values and policy approximate .

Also the Eq-4 specially say the connection between parameters and how we can justify the optimization of the objective function .

Here this is the main point which you can understand how to substitute the approximated values to the objective function directly .

I will put up the steps

Should start from EQ; 3

Here this can be simplified into the

Again this is useful because here it says gradient of the Q value function approximator is not directly work with the gradient of the policy . It is working with the gradient of the log probability of the policy ,

Again there comes some math from the Eq 6.

It’s simple if there’s no error in the estimate value is zeros the question is always zero regardless the gradient of the policy parameterization . That is the (s,a;) .

Now this part is again easy . You can subtract 6th equation form anything since it is zero .

So finally the policy gradients directly based on derivatives of the policy (s,a;) and the function approximation for the Q values which is parameterized by the w .

Any way we need to optimize policy parameters as well as the value approximators parameters .

But how can we calculate the derivatives of w .

Here the Q value can be our true reward . Or we can use the experience replay methods which we use in deep Q learning (read Atari paper by deep mind) .

Upto now we first had an objective function . And then we discussed how we multiply the gradient of the policy by Q values function .

Then we discussed about how to approximate this Q values function . Monte Carlo (REINFORCE) methods were there . But why we need to go for the approximators . Because we need to work with more sample efficient and online manner in order to update the parameters .

So in the next part the paper is discussing about those methods and how to apply these imight say Actor Critic methods to a problem.

If your policy is like this — For the discrete case

So by simplifying this you will understand how the value approximator can also have a linear relationship as the same policy .

Then the paper discuss about baseline functions which can reduce the variance but does not affect the bias . Then they introduce the well known Advantage Function .

There is an easy proof which means by subtracting some baseline function from the Q approximation does not affect the bias .

Then how to choose an advantage function

This above path is going bit more deep .

It says best approximate for the baseline function is the state value function .

So this is basically you are measuring how much you will going away from the current state value .

If it’s something less than the current action value your policy should learn taking these kind of actions will not give me good results I should change .

So the logic is we need a neural net to predict the action values and new values should be something similar to one step lookup values .

From the David Silver’s lecture 7

Here it is easy to understand how we can use TD error as the advantage function .

More like we take an action and check the state value we get directly and use of one step look ahead in TD learning .

This one step look ahead can be vary and you can also use eligibility traces when updating these things .

As the last path this paper will prove the convergence . Which is with math . You really need to go through the paper for that . Carefully play with the equation , upper bounds etc .

--

--

Shamane Siriwardhana

Ph.D. Candidate — The University of Auckland, New Zealand