Policy gradient, an explanation.

NOTE: Most of blogs that try to explain RL approaches somehow either explain concepts on a very low level and/or just jump in to complicated Deep examples. Research papers usually just explain concepts on a high level with maybe some pseudocode which is not straightforward to implement. Code implementations on GitHub are usually made to be self-contained and are complicated/require a lot of time to actually find basic examples. These are the reasons why I decided to make a basic step-by-step introduction, as well as to make a reminder for myself. Also, I needed to practice converting EQUATIONS to CODE.

The full python implementation of the notions presented here integrated to test in the OpenAI's gym environment, can be found on my GitHub https://github.com/nemanja-rakicevic/rl_implementations/tree/master/policy_gradient_basic.

Policy gradient approach intuition

The goal of the policy gradient approach, and in general policy search methods, is to find a policy $\pi$, which is basically a function that maps some state observation information to the next action the agent should perform.

A policy can be:

  • deterministic    $\large a = \pi(s)$       : always gives a deterministic value of an action for a certain input state observation $s$.
  • stochastic        $\large a\sim\pi( a | s )$    : gives a probability distribution over the possible actions based on the input state.

Since the policy is a funcion, there are many ways to chose the type of this function and the parameters $\large \theta$ which shape it. Therefore, in order to define our policy function $\large \pi_{\theta}(a|s)$, which will select the appropriate action in every state, we need to select the function family and find its optimal parameters.

One way to measure the efficacy of a policy is to optimise some cost function.

$$ \Large J(\pi_{\theta}) = \mathbb{E}[R(\tau|\pi_{\theta})] $$

In this case, the cost funciton is defined as the expected value of the returns along some trajectory obtained following a policy. This trajectory is defined as $\large \tau = (s_0, a_0, r_0, s_1, a_1, r_1, ... , s_T)$, a sequence of state, action and reward tuples, where $s_T$ is the terminal state.

The return function $\large R(\tau)$ takes the sequence of rewards from a trajectory $\tau$ and returns one value. This can be either a cummulative sum, or more usually sum of discounted rewards $ \large R(\tau) = \sum_{i=0}^{T} \gamma^{i}r_i $ where $\large\gamma$ is a discount factor that determines the influence of the rewards that occur further in the future.

Obtaining the optimal parameters

We have defined what we need to do, now we need to maximese the expected returns for our policy, by tweaking the policy parameters: $$ \Large \theta^* = argmax_{\theta}(J(\pi_{\theta})) $$

This is done iteratively by updating the parameter values $\large\theta = \theta + \Delta\theta$ where usually $\large\Delta\theta = - \alpha \frac{\partial J(\pi_{\theta})}{\partial \theta}$ with $\alpha$ being the learning rate and $\frac{\partial J(\pi_{\theta})}{\partial \theta}$ a derivative of the cost function with respect to the parameters.

Deriving the cost function

Here's the full representation of the cost function, according to the definition of expectation: $$ \Large J(\pi_{\theta}) = \mathbb{E}[R(\tau|\pi_{\theta})] = \int^T p(\tau|\pi_\theta)R(\tau|\pi_\theta)d\tau $$ Where $p(\tau|\pi_\theta)$ is basically the probability of seeing a trajectory $\tau$ conditioned on following the policy $\pi_\theta$, and it is calculated multiplying the probabilities of each step:

$ \large p(\tau|\pi_\theta) = p(s_0) \prod_{t=1}^T p(s_{t+1}|s_t,a_t) \pi(a_t|s_t) $

where $p(s_{t+1}|s_t,a_t)$ is the transition probability (i.e. the MDP model).

The derivative of the cost funtion w.r.t. the parameters is then:

$ \Large \frac{\partial J(\pi_{\theta})}{\partial \theta} = \nabla_\theta J(\pi_{\theta}) = \nabla_\theta \int p(\tau|\pi_\theta)R(\tau|\pi_\theta)d\tau = \int \nabla_\theta p(\tau|\pi_\theta)R(\tau|\pi_\theta)d\tau $

Since this would require deriving a very long product, a logarithm derivative rule is applied where $ \large \nabla \log f(x) = \frac{1}{f(x)}\nabla f(x) $. In order to get this right-hand side form in equation above, we need to multiply it by $\large \frac{p(\tau|\pi_\theta)}{p(\tau|\pi_\theta)}$.

Thus we have:

$ \Large \nabla_\theta J(\pi_{\theta}) = \int p(\tau|\pi_\theta) \frac{\nabla_\theta p(\tau|\pi_\theta)}{p(\tau|\pi_\theta)} R(\tau|\pi_\theta)d\tau = \int p(\tau|\pi_\theta) \underline{ \nabla_\theta \log p(\tau|\pi_\theta) R(\tau|\pi_\theta)} d\tau $


What this actually tells us is that:

  • the gradient of the expected returns is the expectation of the gradient of the log of the probability of the returns times the returns (underlined part).

$$ \Large \nabla_\theta J(\pi_{\theta}) = \mathbb{E}[ \nabla_\theta \log p(\tau|\pi_\theta) R(\tau|\pi_{\theta})] $$

  • When we now develop $p(\tau|\pi_\theta)$, we see that now there is no need to take derivatives of the product, but of the sum, which is much easier. $ \large p(\tau|\pi_\theta) = p(s_0) \prod_{t=1}^T p(s_{t+1}|s_t,a_t) \pi(a_t|s_t) \\ \large \log p(\tau|\pi_\theta) = \sum_{t=1}^T \log \pi(a_t|s_t) + const $
  • And its derivative becomes: $ \large \nabla_\theta \log p(\tau|\pi_\theta) = \sum_{t=1}^T \nabla_\theta \log \pi(a_t|s_t) $

Finally, the gradient of the cost function can be calculated approximately by sampling, using the iterative formula over K trajectories: $$ \Large \nabla_\theta J(\pi_{\theta}) = \frac{1}{K} \sum_{i=0}^K \left [ \sum_{t=1}^T \nabla_\theta \log \pi(a_t|s_t) \right ] R(\tau|\pi_{\theta}) $$

Deriving the policy. Characteristic eligibility

The remaining piece of the puzzle is the $\large \nabla_\theta \log \pi(a_t|s_t)$, also called Characteristic eligibility. This is now dependent on a particular function type we select for our policy. Below are a couple of examples, their derivations and code implementations.


Implementing the Gradient Updates

Softmax policy

The Softmax policy parameterisation as presented in the original REINFORCE paper by Williams, is derived here.

This policy gives the probability for each of the discrete action dimensions, so that the actions can be sampled:

$\boxed{ \Large a \sim \pi(a|s) = \frac{e^{\theta_a^T s}}{\sum_{a'}e^{\theta_{a'}^T s}} }$

In order to get the optimal policy we need to find the optimal parameters $\large \theta$, through the gradient of the cost function, so we need to calculate the characteristic eligibility. Each column of the parameter matrix $\large \theta_x$ corresponds to the approriate action $\large x$, and it's easier if we show them separately here:

Using the property of the logarithm of a fraction is the nominator minus the denominator:

$ \Large \nabla_{\theta_x} \log \left [\frac{e^{\theta_a^T s}}{\sum_{a'}e^{\theta_{a'}^T s}}\right ] \\ \Large = \nabla_{\theta_x} \log e^{\theta_a^T s} - \nabla_{\theta} \log \sum_{a'}e^{\theta_{a'}^T s} $

Now, the first element is non-zero only if $\large \theta = \theta_a$. Meaning, we are calculating the parameters corresposponding to a particular action separately, so this first element will contribute only to the parameter column corresponting to action a.

$ \Large \nabla_{\theta_x} \log e^{\theta_a^T s} = \begin{cases} s, & \text{if } a = x\\ 0, & \text{otherwise} \end{cases} $

While the second part is further derived:

$ \Large \nabla_{\theta_x} \log \sum_{a'}e^{\theta_{a'}^T s} = \frac{1}{\sum_{a'}e^{\theta_{a'}^T s}} \nabla_{\theta} \sum_{a'}e^{\theta_{a'}^T s} = \frac{ s \cdot e^{\theta_{a}^T s}}{\sum_{a'}e^{\theta_{a'}^T s}} = s \cdot \pi_{\theta_x}(a_x | s) $

Where the gradient of the sum is nonzero only if $\large a=x$.

Therefore, the updates will be:

$ \Large \text{if } \theta = \theta_x : s (1 - \pi_{\theta_x}(a_x | s) ) \\ \Large \text{if } \theta \neq \theta_x : - \pi_{\theta_x}(a_x | s) $

In python code, it would look like this, where we implement the gradient updates in a matrix form:

In [ ]:
# Calculate characteristic eligibility
common_deriv = np.tile(np.append(self.states[i], 1.), (self.num_action,1)) * self.action_probs[i].reshape(-1,1)
# encode if x==a
onehot = np.zeros((self.num_action,1))
onehot[self.actions[i]] = 1
# final matrix of derivatives
deriv = np.tile(np.append(self.states[i], 1.), (self.num_action,1)) * onehot - common_deriv

Putting the whole parameter update together in python:

In [ ]:
def updatePolicy_episodic(self):
    
    grad_pp = np.zeros_like(self.pp)  
    
    # Calculate episodic gradient update
    num_iter = len(self.states) # episode length
    for i in range(num_iter):
        
        # Calculate reward contribution with baseline
        total_return = sum([self.gamma **i * rew for i, rew in enumerate(self.rewards[i:])])
        advantage   = total_return - np.mean(self.rewards[i:])
        
        # Calculate characteristic eligibility
        current_state = np.append(self.states[i], 1)
        common_deriv = np.tile(current_state, (self.num_action,1)) * self.action_probs[i].reshape(-1,1)
        onehot = np.zeros((self.num_action,1))
        onehot[self.actions[i]] = 1
        deriv = np.tile(current_state, (self.num_action,1)) * onehot - common_deriv
        
        # Calculate gradient
        grad_pp += deriv * advantage
    
    # Normalise gradient updates over the episode
    grad_pp /= num_iter
    
    # Update policy parameters
    self.pp -= self.lr * grad_pp 

Gaussian Policy

The Gaussian policy parameterisation as presented in the original REINFORCE paper by Williams, is derived here.

The idea is to have a Gaussian function from which (continuous) actions can be sampled:

$\boxed{ \Large a \sim \pi(a|s) =\mathcal{N}(\mu, \sigma)}$

Usually, the $\mu$ and $\sigma$ parameters of the policy can be obtained directly as 2 regression outputs of a Neural Network. However, here another approach is applied, where 3 sets of parameters are learned:

  • Linear transformation parameters (like with softmax) - $\large \theta$
  • Gaussian's mean - $\large \mu$
  • Gaussian's covariance - $\large \sigma$

The equation of the policy is given as in the paper as: $$ \Large \pi(a | s,\mu,\sigma,\theta) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} $$

Therefore we need to derive the update characteristic eligibility equations for each of the parameters in order to get the parameter updates at each step.

Parameter $\theta$

Using the property of the logarithm of a fraction is the nominator minus the denominator:

$ \Large \nabla_{\theta} \log \left [\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} \right ] \\ \Large = \nabla_{\theta} \log e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} - \nabla_{\theta} \log \sigma\sqrt{2\pi} $

The log and exp cancel each other out, while second part does not cantain $\theta$ so it's gradient is 0.

$ \Large = \nabla_{\theta} \left ( -\frac{(\theta^T s - \mu)^2}{2\sigma^2} \right )- 0 = -\frac{1}{2\sigma^2} 2 (\theta^T s - \mu) x = \boxed{\frac{\mu - \theta^T s}{\sigma^2}x} $

Parameter $\mu$

Same property:

$ \Large \nabla_{\mu} \log \left [\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} \right ] \\ \Large = \nabla_{\mu} \log e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} - \nabla_{\mu} \log \sigma\sqrt{2\pi} $

Again the log and exp cancel each other out, while second part does not cantain $\theta$ so it's gradient is 0, and then just chain rule.

$ \Large = \nabla_{\mu} \left ( -\frac{(\theta^T s - \mu)^2}{2\sigma^2} \right )- 0 = -\frac{1}{2\sigma^2} 2 (\theta^T s - \mu) (-1) = \boxed{\frac{\theta^T s - \mu}{\sigma^2}} $

Parameter $\sigma$

Again:

$ \Large \nabla_{\sigma} \log \left [\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} \right ] \\ \Large = \nabla_{\sigma} \log e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} - \nabla_{\sigma} \log \sigma\sqrt{2\pi} $

Again the log and exp cancel each other out, and using the chain rule, the derivative of a fraction $\left(\frac{g}{h}\right)' = \frac{g'h - gh'}{h^2}$ and putting everything over a common denominator. The gradient of an expression which doesn't contain $\sigma$ is 0.

$ \Large = \nabla_{\sigma} \left ( -\frac{(\theta^T s - \mu)^2}{2\sigma^2} \right ) - \frac{1}{\sigma\sqrt{2\pi}} \sqrt{2\pi} \\ \Large = -\frac{ \nabla_{\sigma}(\theta^T s - \mu)^2\sigma - (\theta^T s - \mu)^2 \nabla_{\sigma}(2\sigma^2)}{4\sigma^4} - \frac{1}{\sigma} = -\frac{ 0 - (\theta^T s - \mu)^2 4\sigma}{4\sigma^4} - \frac{1}{\sigma} = \boxed{\frac{(\theta^T s - \mu)^2 - \sigma^2}{\sigma^3}} $

Code. Putting all together

Below is the python implementation of the equations above.

The parameters are updated after each episode, and the for loop is over every step in the episode.

In [ ]:
def updatePolicy_episodic(self):
    
    grad_pp0 = np.zeros_like(self.pp[0]) # theta
    grad_pp1 = np.zeros_like(self.pp[1]) # mu 
    grad_pp2 = np.zeros_like(self.pp[2]) # sigma
    
    # Calculate episodic gradient update
    num_iter = len(self.states) # episode length
    for i in range(num_iter):
        
        # Calculate reward contribution with baseline
        total_return = sum([self.gamma **i * rew for i, rew in enumerate(self.rewards[i:])])
        advantage    = total_return - np.mean(self.rewards[i:])

        # Calculate characteristic eligibility 
        current_state = np.append(self.states[i], 1)
        y = np.dot(self.pp[0].T, current_state)

        # parameter linear
        eligib_0 = ((self.pp[1] - y)/self.pp[2]**2) * np.tile(current_state.reshape(-1,1), self.num_action)
        grad_pp0 += eligib_0 * advantage
        # parameter mean
        eligib1 = (y - self.pp[1])/self.pp[2]**2
        grad_pp1 += eligib1 * advantage
        # parameter sigma
        eligib2 = ((y - self.pp[1])**2 - self.pp[2]**2)/self.pp[2]**3
        grad_pp2 += eligib2 * advantage
        
    # Normalise gradient updates over the episode
    grad_pp0 /= num_iter
    grad_pp1 /= num_iter
    grad_pp2 /= num_iter
    
    # Update parameters
    self.pp[0] -= self.lr * grad_pp0
    self.pp[1] -= self.lr*self.pp[2]**2 * grad_pp1
    self.pp[2] -= self.lr*self.pp[2]**2 * grad_pp2 

Multi-layer Perceptron Policy