Proximal Policy Optimization (PPO) algorithm is arguably the default choice in modern reinforcement learning (RL) libraries.
In this post we understand how to derive PPO from first principles. First, we brush up our memory on the underlying Markov Decision Process (MDP) model.
1. Preliminaries on Markov Decision Process (MDP)
In an MDP, an agent (say, a robot navigating a maze) interacts with an environment that evolves over time. If at step the agent is in state
(the robot’s position) and takes action
(the robot’s velocity vector), then
- the agent receives a reward
(the robot’s distance from the goal);
- the environment transitions to state
at the next step
, with probability
(the robot’s new position).
Note that the Markov property is assumed: the knowledge of the current state is sufficient to evaluate the transition probability to the next state.
The agent’s behavior is described by policy , where
defines the probability of taking action
in state
. The initial state
is distributed according to an arbitrary distribution
. The long-term reward is defined as the expected
-discounted sum of rewards:
where the notation means that the sequence of states and actions is obtained by following strategy
.
The agent optimizes its strategy to maximize the long-term reward
:
1.1. Advantage function
To improve the performance of a policy, it is important to learn whether a deviation from the base policy brings any benefit. This is formalized by the advantage function. We call the advantage of i) choosing action
in state
and then following policy
over ii) adopting
since state
:
where is the long-term reward when the initial state is
. In other words, if
, then taking action
in state
is more advantageous than adopting strategy
.
1.2. Policy parametrization
When the number of states is large, we typically avoid defining the agent’s strategy in each state individually. Instead, it is convenient to define the policy as a certain function of (a relatively small number of) parameters
, that we call
. Note that
usually consist of the weights and biases of a neural network.
2. Performance gain
PPO is an algorithm that iteratively improves the performance of an initial policy. It is then useful to evaluate the performance gain achieved by a policy
with respect to a baseline policy
:
As shown in [1], [2], can be expressed as:
The right-hand side of (4) reads as follows:
Follow the new strategy and observe a trajectory of actions
and states
. Along this trajectory, evaluate the advantage of taking such actions with respect to the old policy
and sum them up. Finally, take the expectation across all possible trajectories.
Visually:

2.1. A more explicit formula for
It is convenient to formulate more explicitly. We start by rewriting (4) as:
We then exchange the sum order:
where is the (discounted) frequency at which MDP visits state
over time:
Notice that in expression (5), the effect of policy on the state distribution (in the outer sum) is separated from its effect on the actions selected in each state (in the inner sum).
3. How to make positive?
To improve upon our baseline policy , we want to find a second policy
such that
Idea 1: Maximizing . This is unfeasible, since this would amount to finding the optimal policy, which is our ultimate goal!
Idea 2: Policy improvement. Let us look back at expression (5). The policy influences both the state frequency
, in the outer sum, and the action choice, in the inner sum. Unfortunately, the former dependency, between
and
, is difficult to characterize. However, notice that the outer terms
are already non-negative. Then, if we can ensure that each inner sum (i.e., the average advantage in each state) is positive, we no longer need to worry about
!
In other words, to obtain it suffices that
Such approach generalizes the classic policy improvement algorithm, that maximizes the advantage function in each state:
However, when the number of states is large, such method presents several issues:
- i) Enumerating all states may be infeasible in the first place
- ii) The advantage function
is not known precisely; thus,
may take wrong decisions resulting in
- ii) When the policy is a parametric function, its degrees of freedom are reduced; hence, optimizing the policy independently in each state becomes impossible.
Idea 3: Maximize a lower bound PPO, as well its ancestor TRPO, overcome the limitations above by maximizing a convenient lower bound for . Once the lower bound becomes positive,
is guaranteed to be positive as well. Next, we derive such lower bound.
4. A lower bound for the performance gain
We derive a lower bound for the performance gain in three steps.
Step 1: Approximating . As already mentioned, the performance gain
can be decomposed into two components:
- (inner sum) the expected action advantage
in each state
- (outer sum) the weighted average of the inner terms with respect to the state visitation frequency
It turns out that when a policy is “close” to the base policy
, the short-term factor provides the dominant contribution. Thus, we can neglect the dependence of
on the state visitation frequency
, replacing it with
, induced by the base policy
. This yields the approximate performance gain
:
Step 2: Quantify the “closeness” between strategies and
. This is done by computing the maximum KullbackโLeibler divergence between the distributions
across states
:
Observe that means that
, while higher values of
indicate larger behavior discrepancies between the two policies in at least one state.
Step 3: Deriving the lower bound for . We can lower bound the performance gain
as the approximate
minus (a scaled version of) the divergence between the two policies [2]:
where . Next, we discuss why maximizing the lower bound is a good idea, and how to do it.
5. TRPO: Maximize the lower bound
Trust Region Policy Optimization (TRPO) [2], PPO’s direct ancestor, maximizes the performance gain lower bound (12):
In this case, we strive to jointly
- Select advantageous actions to increase the approximate gain
- Remain close to the base policy
to reduce the KL divergence.
Observe that, as a side-product, the latter point is of practical importance to ensure stability.
As promised, . This stems from its lower bound being nonnegative, as illustrated below.

5.1. Towards practice (1): Turn penalty into constraint
In practice, the coefficient penalizing the divergence in TRPO forces tiny policy update and is hard to tune it. A better option is to replace the penalty on
with a hard constraint, in a way reminiscent of Lagrangian theory. Thus, we maximize the approximate gain while remaining “close” to
:
5.2. Towards practice (2): Estimating via sample averages
Computing the analytical expression of (10) is challenging due to the presence of the state frequency
.
Estimating turns out to be easier. In fact, one can
- Express it as an expectation with respect to the base policy
;
- Compute empirical averages across multiple trajectories obtained by following
.
In detail, we first multiply and divide each term of the inner sum of expression (10) by (assuming it is positive).
Then, we can estimate as the empirical average
across trajectory roll-outs
generated by the old strategy
:
Visually:

5.3. Towards practice (3): Estimating the advantage function
In this post we do not discuss how to estimate the advantage function. For this, the reader is encouraged to consult the excellent reference [4].
5.4. Addressing the issues of policy improvement
As promised, TRPO solves (or at least, mitigates) the issues with policy improvement (PI):
- i) Unlike PI, TRPO does not need to enumerate all states;
- ii) TRPO is more robust to uncertainties in the advantage function than PI, since TRPO maximizes an aggregate advantage function rather than each individual advantage term
;
- iii) For the same reason, TRPO is more suitable than PI for parametric policies, for which actions cannot be optimized independently in each state.
5.5. All that glitters is not (yet) gold
Finding the constrained TRPO policy by solving the optimization problem (14) is still hard, due to the constraint on the KL divergence. This is where PPO comes to the rescue, as we will see next.
6. Proximal Policy Optimization (PPO)
PPO gets away with the hard-to-handle TRPO’s constraint on the KL divergence by re-incorporating it in the objective function in a friendly way, i.e., solvable by gradient descent. To understand how, we follow 4 steps.
Step 1. Let us first recall informally TRPO’s goal:
Maximize the approximate performance gain while maintaining the new policy
close to the old policy
.
Step 2. What goes wrong when we omit the constraint on being close to
? In this case, we would simply maximize
:
The optimal policy would tend to assign high probability to actions with a large ratio “advantage” to “probability of being chosen by the old policy”:
Visually:

Thus, actions that were chosen very infrequently under but have even a slightly positive advantage would be heavily favored by the new policy. This is risky: the less frequently an action is selected, the less accurate its advantage estimate. Therefore, the risk of overcommitting to suboptimal actions is high.
On the other hand, actions with an even slightly negative advantage would be completely ruled out. Again, this is not wise, due to the advantage estimation inaccuracies.
Step 3. PPO prevents the issues above by artificially limiting the reward for choosing policies that deviate too far from the old policy . In this way, the new policy naturally remains in a close neighborhood of
.
In details, PPO maximizes a new objective function
where the individual terms are defined as
The equation above is best understood via an illustration:

This shows that even when action exhibits a large ratio
, PPO has no incentive to place all its weight on
, since the associated reward plateaus. Instead, it will increase its probability of being selected by a small amount with respect to
, on the order of
. Similarly, when an action has a negative advantage, PPO will benefit only from slightly decreasing its probability of being chosen.
Step 4. PPO can now be solved via gradient descent. The policy to be optimized is chosen to be a parametric one,
, where
is the set of parameters. Importantly,
must be differentiable with respect to
, for every state
and action
. This ensures that the PPO’s objective function is differentiable, enabling the parameters
to be optimized using gradient-based techniques, which are well understood and efficient.
6.1. PPO’s main steps
To sum up, here are the main steps of PPO algorithm:
- a) Start from an arbitrary randomized policy
- b) Repeat until convergence:
- 1) Run policy
and collect trajectories
- 2) Estimate the advantage function
, for all steps
- 3) Solve via gradient-based methods:
- 4) Replace
- 1) Run policy
References
[1] Kakade, S., Langford, J. (2002, July). Approximately optimal approximate reinforcement learning. In Proceedings of the nineteenth international conference on machine learning (pp. 267-274).
[2] Schulman, J., Levine, S., Abbeel, P., Jordan, M., Moritz, P. (2015, June). Trust region policy optimization. In International conference on machine learning (pp. 1889-1897). PMLR.
[3] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
[4] Schulman, J., Moritz, P., Levine, S., Jordan, M., Abbeel, P. (2015). High-Dimensional Continuous Control Using Generalized Advantage Estimation arXiv:1506.02438.
Leave a reply to Lorenzo Maggi Cancel reply