Policy gradient for black-box optimization

Policy gradient method are widely used in the Reinforcement Learning settings. In this post we build policy gradient from the ground up, starting from the easier static scenario first, where we maximize a reward function {r} depending solely on our control variable {x}. In subsequent posts, we will turn our attention to the contextual bandit setting, where the reward also depends on a “state” that evolves. Finally, we will turn to the “full-blown” Reinforcement Learning scenario, where state evolves endogenously, as a function of the control variable.

1. Main procedure

Suppose that we want to maximize a function {r:\mathbb R^d \rightarrow \mathbb R}:

\displaystyle \max_{x\in \mathbb R^d} r(x) \ \ \ \ \ (1)

The function {r} is initially unknown and can be observed via noisy measurements that we denote by {\widetilde{r}(x)=r(x)+\epsilon}. Noise {\epsilon} is supposed i.i.d. across consecutive samplings and has zero mean, such that {\mathbb E[\widetilde{r}(x)]=r(x)} for all {x}.

There exist in the literature several methods to maximize {r}, called derivative-free (since the gradient of {r} is not directly accessible) or black-box (as we just learn about {r} about its input-output behavior), such as Bayesian optimization [1] or Nelder-Mead algorithm [2]. In policy gradient, instead of optimizing the input variable {x} directly, we rather consider a stochastic policy {\pi_\theta(x)} that defines the probability of selecting {x}, defined by some parameters {\theta}; then, we optimize {\theta} instead, via gradient ascent.

When the policy {\pi_\theta} is used, we call {J(\theta)} the corresponding average reward:

\displaystyle J(\theta) = \int r(x) \pi_\theta(x) dx \ \ \ \ \ (2)

Our goal is to optimize {\theta} by performing gradient ascent on {J(\theta)}. The gradient of {J(\theta)} with respect to the parameters {\theta} can be expressed as:

\displaystyle \nabla_\theta J(\theta) = \int r(x) \nabla_\theta \pi_\theta(x) dx \ \ \ \ \ (3)

\displaystyle \qquad = \int r(x) \frac{\pi_\theta(x)}{\pi_\theta(x)} \nabla_\theta \pi_\theta(x) dx \ \ \ \ \ (4)

\displaystyle \qquad = \int r(x) \pi_\theta(x) \nabla_\theta \log \pi_\theta(x) dx \ \ \ \ \ (5)

where in (5) we exploit the relation {\nabla_\theta \log\pi_\theta(x)=\frac{\nabla_\theta \pi_\theta(x)}{\pi_\theta(x)}}. Note that equation (5) is the expectation with respect to the observation noise and of the control variable {x} of the noisy measurements {\widetilde{r}} multiplied by the gradient of {\log \pi_\theta}, i.e.,

\displaystyle \nabla_\theta J(\theta) = \mathbb E_{\epsilon,x\sim \pi_\theta(x)} \, \big[ \widetilde{r}(x) \nabla_\theta \log\pi_\theta(x) \big]. \ \ \ \ \ (6)

Hence, via a simple trick, we have transformed our problem of evaluating the integral (3), which would require to know the function {r} everywhere (which is clearly in contrast with our assumptions!) to a different problem involving the calculating an expectation.

The good news is that such expectation can be approximated by randomly sampling the function {r} at points drawn from distribution {\pi_\theta}. Yet, before crying victory, we still need to check if we are allowed to approximate (6) and still apply gradient descent to converge to a (local) optimum of {\theta}. Fortunately, this is the case, as Stochastic Gradient Ascent [3] teaches us. Thus, one can sample points {\widetilde{r}(x_i) \nabla_\theta \log \pi_\theta(x_i)}, {i=1,\dots,n} and update {\theta} as:

\displaystyle \theta_{t+1} = \theta_t + \alpha_t \sum_{i=1}^n \widetilde{r}(x_i) \nabla_\theta \log \pi_\theta(x_i)\big|_{\theta=\theta_t} \ \ \ \ \ (7)

where {\alpha_t} is the learning rate [3], that must converge to 0 but “not too fast” ({\sum_t \alpha_t =\infty} and {\sum_t \alpha_t^2<\infty}, e.g., {\alpha_t \propto 1/t}).

To recap, the (basic) policy gradient procedure is:

  1. Define policy {\pi_\theta(x), \forall \, x} and initialize {\theta:=\theta_0}, {t:=0}
  2. Draw {x_1,\dots,x_n} from {\pi_{\theta_t}(.)}
  3. Observe the corresponding samples {\{\widetilde{r}(x_i)\}_{i=1,\dots,n}}
  4. Update {\theta_{t+1} = \theta_t + \alpha_t \sum_{i=1}^n \widetilde{r}(x_i) \nabla_\theta \log \pi_\theta(x_i)}
  5. Increment {t} and return to 2).

1.1. A concrete example: Gaussian policy

A natural choice for the parametrized policy {\pi_\theta(x)} is the multi-variate Gaussian distribution {\pi_\theta(x) := \mathcal N (\mu,\Sigma)} where the parameters {\theta} consist of the mean vector {\mu} and covariance matrix {\Sigma}. More explicitly,

\displaystyle \pi_\theta(x) = \, (2\pi)^{-d/2} \det(\Sigma)^{-1/2}\exp\left(-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu) \right) \ \ \ \ \ (8)

\displaystyle \log \pi_\theta(x) = -\frac{d}{2}\log 2\pi+\frac{1}{2} \log \det \Sigma^{-1} - \frac{1}{2} (x-\mu)^T \Sigma^{-1} (x-\mu) \ \ \ \ \ (9)

To compute the gradient of {\log \pi_\theta(x)} we exploit the following linear algebra properties:

\displaystyle \nabla_y y^T A y = (A+A^T)y \ \ \ \ \ (10)

\displaystyle \nabla_A \log \det A^{-1} = -A^{-1} \ \ \ \ \ (11)

\displaystyle \nabla_A x^T A^{-1} y = -\left(A^{-1}\right)^T x y^T \left(A^{-1}\right)^T \ \ \ \ \ (12)

along with the fact that the inverse of a symmetric matrix is also symmetric. Then, we can compute:

\displaystyle \nabla_\mu \log \pi_\theta(x) = \Sigma^{-1} (x-\mu) \ \ \ \ \ (13)

\displaystyle \nabla_\Sigma \log \pi_\theta(x) = -\frac{1}{2} \Sigma^{-1} \left( I - (x-\mu) (x-\mu)^T \Sigma^{-1} \right). \ \ \ \ \ (14)

The uni-variate case. To gather intuitions on how policy gradient works in practice, we investigate the uni-dimensional case ({d=1}). Here, {\mu} and {\Sigma} boil down to the uni-dimensional mean {m} and variance {\sigma^2}, respectively, thus {\theta:=[m,\sigma^2]}. Expressions (13)(14) simplify, and the parameter update in (7) becomes:

\displaystyle m_{t+1} = m_t + \frac{\alpha_t}{\sigma_t^2} \sum_{i=1}^n (x_i-m_t) \widetilde{r}(x_i) \ \ \ \ \ (15)

\displaystyle \sigma^2_{t+1} = \sigma^2_t - \frac{\alpha_t}{2\sigma_t^2} \sum_{i=1}^n \left[1-\left(\tfrac{x_i-m_t}{\sigma_t}\right)^2\right] \widetilde{r}(x_i) \ \ \ \ \ (16)

Let us have a look at (15) first. We observe that the mean moves by a quantity proportional to imbalance between the (weighted) sampled rewards at the right of {m_t} and at its left:

\displaystyle m_{t+1}-m_t \propto \sum_{i:x_i>m_t}|x_i-m_t|\widetilde{r}(x_i) - \sum_{i:x_i<m_t}|x_i-m_t|\widetilde{r}(x_i). \ \ \ \ \ (17)

Second, we notice that the rewards are weighted by a coefficient {(x_i-m_t)} whose magnitude is the distance {x_i} from the current {m_t}. This compensates the fact that, as {x_i} moves away from {m_t}, the probability of being chosen by {\pi_\theta} vanishes.

Let us now turn to the variance update formula (16). The variances gets updated by a quantity proportional to

\displaystyle \sigma^2_{t+1}-\sigma^2_t \propto \sum_{i:|x_i-m_t|>\sigma_t} a_i \widetilde{r}(x_i) - \sum_{i:|x_i-m_t|<\sigma_t} a_i\widetilde{r}(x_i) \ \ \ \ \ (18)

where {a_i:=| (\tfrac{x_i-m_t}{\sigma_t})^2 - 1 |}. Hence, the variance increases if the rewards collected “far” (i.e., more than {\sigma_t} away) from current mean {m_t} exceed the rewards collected “near” (i.e., less than {\sigma_t} away from) {m_t}. Rewards are weighted by a coefficient {a_i} that amplifies rewards that are collected far from the “boundary” {m_t\pm \sigma_t}.

2. Variance reduction: The baseline trick

Although {\widetilde{r}(x) \nabla_\theta \log\pi_\theta(x)} is an unbiased estimator of {\nabla_\theta J(\theta)} (hence, the parameter update in (7) converges almost surely to the optimal {\theta^*}), its variance may still be large, which leads to an erratic evolution of parameters {\theta_t} across iterations {t}.

The following crucial observation serves us well: by subtracting a constant baseline {b} from the collected rewards {\widetilde{r}}, the resulting estimator:

\displaystyle (\widetilde{r}(x)-b) \nabla_\theta \pi_\theta(x) \ \ \ \ \ (19)

is still unbiased and, if {b} is well chosen, has considerably lower variance than the original estimator.

Let us first show that (19) is unbiased:

\displaystyle \mathbb E_{\epsilon,x\sim \pi_\theta(x)} \, \big[ (\widetilde{r}(x)-b) \nabla_\theta \log\pi_\theta(x) \big] = \int (r(x)-b) \nabla_\theta \pi_\theta(x) dx \ \ \ \ \ (20)

\displaystyle \qquad \quad = \int r(x) \nabla_\theta \pi_\theta(x) dx - b \nabla_\theta \int \pi_\theta(x)dx \ \ \ \ \ (21)

\displaystyle \qquad \quad = \int r(x) \nabla_\theta \pi_\theta(x) dx - b \nabla_\theta 1 \ \ \ \ \ (22)

\displaystyle \qquad \quad = \int r(x) \nabla_\theta \pi_\theta(x) dx = \nabla_\theta J(\theta). \ \ \ \ \ (23)

which proves that, in expectation, the new estimator (19) still equals the gradient of the expected reward.

\displaystyle \mathrm{Var} \big[ (\widetilde{r}(x)-b) \nabla_\theta \log\pi_\theta(x) \big] = \mathbb{E} \left[ (\widetilde{r}(x)-b)^2 (\nabla_\theta \log\pi_\theta(x))^2 \right] -\left(\mathbb{E} \left[ (\widetilde{r}(x)-b) (\nabla_\theta \log\pi_\theta(x)) \right]\right)^2. \ \ \ \ \ (24)

By computing the derivative of (24) with respect to the baseline {b} and setting it to zero, we obtain the value {b^*} of the baseline that minimizes the variance of the new estimator (19) as:

\displaystyle b^*=\frac{\int r(x) \pi_\theta \left( \nabla_\theta \log \pi_\theta(x) \right)^2 dx}{\int \pi_\theta \left( \nabla_\theta \log \pi_\theta(x) \right)^2 dx} \ \ \ \ \ (25)

Once again, one can estimate {b^*} by sampling. In practice, even a poor estimation of {b^*} helps with the gradient ascent stability, and does not jeopardize its unbiasednes, as we saw earlier.

For the interested reader, [4] offers a great introduction to policy gradient algorithms.

References

[1] Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., De Freitas, N. (2015). Taking the human out of the loop: A review of Bayesian optimization. Proceedings of the IEEE, 104(1), 148-175.

[2] Nelder, J. A., Mead, R. (1965). A simplex method for function minimization. The computer journal, 7(4), 308-313.

[3] Bottou, L. (1998). Online algorithms and stochastic approximations. Online learning in neural networks.

[4] Sutton, R. S., & Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.


Posted

in

by

Tags:

Comments

Leave a comment