NOTE: Most of blogs that try to explain RL approaches somehow either explain concepts on a very low level and/or just jump in to complicated Deep examples. Research papers usually just explain concepts on a high level with maybe some pseudocode which is not straightforward to implement. Code implementations on GitHub are usually made to be self-contained and are complicated/require a lot of time to actually find basic examples. These are the reasons why I decided to make a basic step-by-step introduction, as well as to make a reminder for myself. Also, I needed to practice converting EQUATIONS to CODE.
The full python implementation of the notions presented here integrated to test in the OpenAI's gym environment, can be found on my GitHub https://github.com/nemanja-rakicevic/rl_implementations/tree/master/policy_gradient_basic.
The goal of the policy gradient approach, and in general policy search methods, is to find a policy $\pi$, which is basically a function that maps some state observation information to the next action the agent should perform.
A policy can be:
Since the policy is a funcion, there are many ways to chose the type of this function and the parameters $\large \theta$ which shape it. Therefore, in order to define our policy function $\large \pi_{\theta}(a|s)$, which will select the appropriate action in every state, we need to select the function family and find its optimal parameters.
One way to measure the efficacy of a policy is to optimise some cost function.
$$ \Large J(\pi_{\theta}) = \mathbb{E}[R(\tau|\pi_{\theta})] $$
In this case, the cost funciton is defined as the expected value of the returns along some trajectory obtained following a policy. This trajectory is defined as $\large \tau = (s_0, a_0, r_0, s_1, a_1, r_1, ... , s_T)$, a sequence of state, action and reward tuples, where $s_T$ is the terminal state.
The return function $\large R(\tau)$ takes the sequence of rewards from a trajectory $\tau$ and returns one value. This can be either a cummulative sum, or more usually sum of discounted rewards $ \large R(\tau) = \sum_{i=0}^{T} \gamma^{i}r_i $ where $\large\gamma$ is a discount factor that determines the influence of the rewards that occur further in the future.
We have defined what we need to do, now we need to maximese the expected returns for our policy, by tweaking the policy parameters: $$ \Large \theta^* = argmax_{\theta}(J(\pi_{\theta})) $$
This is done iteratively by updating the parameter values $\large\theta = \theta + \Delta\theta$ where usually $\large\Delta\theta = - \alpha \frac{\partial J(\pi_{\theta})}{\partial \theta}$ with $\alpha$ being the learning rate and $\frac{\partial J(\pi_{\theta})}{\partial \theta}$ a derivative of the cost function with respect to the parameters.
Here's the full representation of the cost function, according to the definition of expectation: $$ \Large J(\pi_{\theta}) = \mathbb{E}[R(\tau|\pi_{\theta})] = \int^T p(\tau|\pi_\theta)R(\tau|\pi_\theta)d\tau $$ Where $p(\tau|\pi_\theta)$ is basically the probability of seeing a trajectory $\tau$ conditioned on following the policy $\pi_\theta$, and it is calculated multiplying the probabilities of each step:
$ \large p(\tau|\pi_\theta) = p(s_0) \prod_{t=1}^T p(s_{t+1}|s_t,a_t) \pi(a_t|s_t) $
where $p(s_{t+1}|s_t,a_t)$ is the transition probability (i.e. the MDP model).
The derivative of the cost funtion w.r.t. the parameters is then:
$ \Large \frac{\partial J(\pi_{\theta})}{\partial \theta} = \nabla_\theta J(\pi_{\theta}) = \nabla_\theta \int p(\tau|\pi_\theta)R(\tau|\pi_\theta)d\tau = \int \nabla_\theta p(\tau|\pi_\theta)R(\tau|\pi_\theta)d\tau $
Since this would require deriving a very long product, a logarithm derivative rule is applied where $ \large \nabla \log f(x) = \frac{1}{f(x)}\nabla f(x) $. In order to get this right-hand side form in equation above, we need to multiply it by $\large \frac{p(\tau|\pi_\theta)}{p(\tau|\pi_\theta)}$.
Thus we have:
$ \Large \nabla_\theta J(\pi_{\theta}) = \int p(\tau|\pi_\theta) \frac{\nabla_\theta p(\tau|\pi_\theta)}{p(\tau|\pi_\theta)} R(\tau|\pi_\theta)d\tau = \int p(\tau|\pi_\theta) \underline{ \nabla_\theta \log p(\tau|\pi_\theta) R(\tau|\pi_\theta)} d\tau $
What this actually tells us is that:
$$ \Large \nabla_\theta J(\pi_{\theta}) = \mathbb{E}[ \nabla_\theta \log p(\tau|\pi_\theta) R(\tau|\pi_{\theta})] $$
Finally, the gradient of the cost function can be calculated approximately by sampling, using the iterative formula over K trajectories: $$ \Large \nabla_\theta J(\pi_{\theta}) = \frac{1}{K} \sum_{i=0}^K \left [ \sum_{t=1}^T \nabla_\theta \log \pi(a_t|s_t) \right ] R(\tau|\pi_{\theta}) $$
The remaining piece of the puzzle is the $\large \nabla_\theta \log \pi(a_t|s_t)$, also called Characteristic eligibility. This is now dependent on a particular function type we select for our policy. Below are a couple of examples, their derivations and code implementations.
The Softmax policy parameterisation as presented in the original REINFORCE paper by Williams, is derived here.
This policy gives the probability for each of the discrete action dimensions, so that the actions can be sampled:
$\boxed{ \Large a \sim \pi(a|s) = \frac{e^{\theta_a^T s}}{\sum_{a'}e^{\theta_{a'}^T s}} }$
In order to get the optimal policy we need to find the optimal parameters $\large \theta$, through the gradient of the cost function, so we need to calculate the characteristic eligibility. Each column of the parameter matrix $\large \theta_x$ corresponds to the approriate action $\large x$, and it's easier if we show them separately here:
Using the property of the logarithm of a fraction is the nominator minus the denominator:
$ \Large \nabla_{\theta_x} \log \left [\frac{e^{\theta_a^T s}}{\sum_{a'}e^{\theta_{a'}^T s}}\right ] \\ \Large = \nabla_{\theta_x} \log e^{\theta_a^T s} - \nabla_{\theta} \log \sum_{a'}e^{\theta_{a'}^T s} $
Now, the first element is non-zero only if $\large \theta = \theta_a$. Meaning, we are calculating the parameters corresposponding to a particular action separately, so this first element will contribute only to the parameter column corresponting to action a.
$ \Large \nabla_{\theta_x} \log e^{\theta_a^T s} = \begin{cases} s, & \text{if } a = x\\ 0, & \text{otherwise} \end{cases} $
While the second part is further derived:
$ \Large \nabla_{\theta_x} \log \sum_{a'}e^{\theta_{a'}^T s} = \frac{1}{\sum_{a'}e^{\theta_{a'}^T s}} \nabla_{\theta} \sum_{a'}e^{\theta_{a'}^T s} = \frac{ s \cdot e^{\theta_{a}^T s}}{\sum_{a'}e^{\theta_{a'}^T s}} = s \cdot \pi_{\theta_x}(a_x | s) $
Where the gradient of the sum is nonzero only if $\large a=x$.
Therefore, the updates will be:
$ \Large \text{if } \theta = \theta_x : s (1 - \pi_{\theta_x}(a_x | s) ) \\ \Large \text{if } \theta \neq \theta_x : - \pi_{\theta_x}(a_x | s) $
In python code, it would look like this, where we implement the gradient updates in a matrix form:
# Calculate characteristic eligibility
common_deriv = np.tile(np.append(self.states[i], 1.), (self.num_action,1)) * self.action_probs[i].reshape(-1,1)
# encode if x==a
onehot = np.zeros((self.num_action,1))
onehot[self.actions[i]] = 1
# final matrix of derivatives
deriv = np.tile(np.append(self.states[i], 1.), (self.num_action,1)) * onehot - common_deriv
Putting the whole parameter update together in python:
def updatePolicy_episodic(self):
grad_pp = np.zeros_like(self.pp)
# Calculate episodic gradient update
num_iter = len(self.states) # episode length
for i in range(num_iter):
# Calculate reward contribution with baseline
total_return = sum([self.gamma **i * rew for i, rew in enumerate(self.rewards[i:])])
advantage = total_return - np.mean(self.rewards[i:])
# Calculate characteristic eligibility
current_state = np.append(self.states[i], 1)
common_deriv = np.tile(current_state, (self.num_action,1)) * self.action_probs[i].reshape(-1,1)
onehot = np.zeros((self.num_action,1))
onehot[self.actions[i]] = 1
deriv = np.tile(current_state, (self.num_action,1)) * onehot - common_deriv
# Calculate gradient
grad_pp += deriv * advantage
# Normalise gradient updates over the episode
grad_pp /= num_iter
# Update policy parameters
self.pp -= self.lr * grad_pp
The Gaussian policy parameterisation as presented in the original REINFORCE paper by Williams, is derived here.
The idea is to have a Gaussian function from which (continuous) actions can be sampled:
$\boxed{ \Large a \sim \pi(a|s) =\mathcal{N}(\mu, \sigma)}$
Usually, the $\mu$ and $\sigma$ parameters of the policy can be obtained directly as 2 regression outputs of a Neural Network. However, here another approach is applied, where 3 sets of parameters are learned:
The equation of the policy is given as in the paper as: $$ \Large \pi(a | s,\mu,\sigma,\theta) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} $$
Therefore we need to derive the update characteristic eligibility equations for each of the parameters in order to get the parameter updates at each step.
Using the property of the logarithm of a fraction is the nominator minus the denominator:
$ \Large \nabla_{\theta} \log \left [\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} \right ] \\ \Large = \nabla_{\theta} \log e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} - \nabla_{\theta} \log \sigma\sqrt{2\pi} $
The log and exp cancel each other out, while second part does not cantain $\theta$ so it's gradient is 0.
$ \Large = \nabla_{\theta} \left ( -\frac{(\theta^T s - \mu)^2}{2\sigma^2} \right )- 0 = -\frac{1}{2\sigma^2} 2 (\theta^T s - \mu) x = \boxed{\frac{\mu - \theta^T s}{\sigma^2}x} $
Same property:
$ \Large \nabla_{\mu} \log \left [\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} \right ] \\ \Large = \nabla_{\mu} \log e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} - \nabla_{\mu} \log \sigma\sqrt{2\pi} $
Again the log and exp cancel each other out, while second part does not cantain $\theta$ so it's gradient is 0, and then just chain rule.
$ \Large = \nabla_{\mu} \left ( -\frac{(\theta^T s - \mu)^2}{2\sigma^2} \right )- 0 = -\frac{1}{2\sigma^2} 2 (\theta^T s - \mu) (-1) = \boxed{\frac{\theta^T s - \mu}{\sigma^2}} $
Again:
$ \Large \nabla_{\sigma} \log \left [\frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} \right ] \\ \Large = \nabla_{\sigma} \log e^{-\frac{(\theta^T s - \mu)^2}{2\sigma^2}} - \nabla_{\sigma} \log \sigma\sqrt{2\pi} $
Again the log and exp cancel each other out, and using the chain rule, the derivative of a fraction $\left(\frac{g}{h}\right)' = \frac{g'h - gh'}{h^2}$ and putting everything over a common denominator. The gradient of an expression which doesn't contain $\sigma$ is 0.
$ \Large = \nabla_{\sigma} \left ( -\frac{(\theta^T s - \mu)^2}{2\sigma^2} \right ) - \frac{1}{\sigma\sqrt{2\pi}} \sqrt{2\pi} \\ \Large = -\frac{ \nabla_{\sigma}(\theta^T s - \mu)^2\sigma - (\theta^T s - \mu)^2 \nabla_{\sigma}(2\sigma^2)}{4\sigma^4} - \frac{1}{\sigma} = -\frac{ 0 - (\theta^T s - \mu)^2 4\sigma}{4\sigma^4} - \frac{1}{\sigma} = \boxed{\frac{(\theta^T s - \mu)^2 - \sigma^2}{\sigma^3}} $
Below is the python implementation of the equations above.
The parameters are updated after each episode, and the for loop is over every step in the episode.
def updatePolicy_episodic(self):
grad_pp0 = np.zeros_like(self.pp[0]) # theta
grad_pp1 = np.zeros_like(self.pp[1]) # mu
grad_pp2 = np.zeros_like(self.pp[2]) # sigma
# Calculate episodic gradient update
num_iter = len(self.states) # episode length
for i in range(num_iter):
# Calculate reward contribution with baseline
total_return = sum([self.gamma **i * rew for i, rew in enumerate(self.rewards[i:])])
advantage = total_return - np.mean(self.rewards[i:])
# Calculate characteristic eligibility
current_state = np.append(self.states[i], 1)
y = np.dot(self.pp[0].T, current_state)
# parameter linear
eligib_0 = ((self.pp[1] - y)/self.pp[2]**2) * np.tile(current_state.reshape(-1,1), self.num_action)
grad_pp0 += eligib_0 * advantage
# parameter mean
eligib1 = (y - self.pp[1])/self.pp[2]**2
grad_pp1 += eligib1 * advantage
# parameter sigma
eligib2 = ((y - self.pp[1])**2 - self.pp[2]**2)/self.pp[2]**3
grad_pp2 += eligib2 * advantage
# Normalise gradient updates over the episode
grad_pp0 /= num_iter
grad_pp1 /= num_iter
grad_pp2 /= num_iter
# Update parameters
self.pp[0] -= self.lr * grad_pp0
self.pp[1] -= self.lr*self.pp[2]**2 * grad_pp1
self.pp[2] -= self.lr*self.pp[2]**2 * grad_pp2