Backpropagation

Chain rule to the rescue!

To train a deep neural network (DNN), one needs to repeatedly compute the gradient of the empirical loss with respect to all the DNN parameters {\theta}. Since their number {N^{\theta}} can be huge, it is crucial that the procedure for gradient computation grows “slowly” with {N^{\theta}}. Back-propagation addresses this successfully by exploiting shrewdly the derivative chain rule, as we discuss in this post.

Supervised learning scenario. In supervised learning, the goal is to predict variable {Y} from variable {X} by training a model on a dataset {\mathcal D} containing pairs of realizations {(x,y)} of the aforementioned variables. Classically, the model is defined by a family of functions {f(.,\theta)} parametrized by {\theta}. Then, the empirical loss is minimized:

\displaystyle \min_{\theta} \sum_{(x,y)\in \mathcal D} \frac{1}{2} \| f(x,\theta)-y\|^2:=\mathcal L(x,y,\theta). \ \ \ \ \ (1)

Feed-forward DNN. We here consider that the model {f(.\theta)} takes on the form of a feed-forward DNN, with {L} fully connected hidden layers where each neuron is connected to all the neurons of the next layer.

We call the (column) vector {x:=z^0=[z_i^0]_{i=1,\dots,M^0}}, which is multiplied by the {M^1\times M^0} weight matrix {W^1} and summed to the bias vector {b^1}, to finally obtain the vector {a^1}. Then, the non-linear (e.g., sigmoid) activation function {h^1}, applied individually to each element of {a^1}, produces the intermediate variable {z^1}. The same procedure is iterated over all {L} hidden layers as follows:

\displaystyle z^{\ell} = h^{\ell} (a^{\ell}), \qquad \forall\, \ell=1,\dots,L \ \ \ \ \ (3)

\displaystyle a^{\ell} = W^{\ell} z^{\ell-1} + b^{\ell} \ \ \ \ \ (2)

\displaystyle z^{0} = x \ \ \ \ \ (4)

where {a^{\ell}\in \mathbb{R}^{M^\ell\times 1}} for all {\ell}. We call {f(.,\theta)} the function resulting from applying equations (2),(3) iteratively over all layers, where {\theta=\{W^\ell,b^\ell\}_{1\le\ell\le L}} is the collection of all weights and biases. Hence, the DNN output produced by injecting {x} as input is {z^L=f(x,\theta)}.

Gradient of the loss. To minimize the loss {\mathcal L(\theta)} via (any variant of) gradient descent, we must compute for each {(x,y)} pair the gradient {\nabla_{\theta}\mathcal L(x,y,\theta)}.

Finite difference? A (naive) option to compute the gradient {\nabla_{\theta}\mathcal L(x,y,\theta)} is via finite difference:

\displaystyle \nabla_{\theta} \mathcal L (x,y,\theta) \approx \left[\frac{\| f(x,\theta + \epsilon \delta^i)-y\|^2 - \| f(x,\theta)-y\|^2}{2\epsilon} \right]_{1\le i \le N^\theta} \ \ \ \ \ (5)

where {N^{\theta}} is the number of DNN parameters and {\delta^i} is zero everywhere except for its {i}-th element being 1.

Let us investigate the computation complexity of (5). Evaluating {f_{\theta}(x)} requires to iterate (2) and (3) across all layers, each calling for {N^{\theta}/L} and {\sqrt{N^{\theta}/L}} basic operations (i.e., products, sums and computation of {h}). To compute the finite difference, such evaluation is then repeated {2 N^{\theta}} times. Thus, the total complexity is in the order of

\displaystyle (N^{\theta})^2 + N^{\theta}\sqrt{L N^{\theta}}.

Since the number of parameters can be huge, a complexity quadratic in {N^{\theta}} can be prohibitive. The natural question is: Can we do better? The answer is yes, as we show next.

Preliminaries: Jacobian matrix. Define the function {\phi:\mathbb{R}^n\rightarrow \mathbb{R}^m}. The Jacobian of {\phi} evaluated at {x} is called {J_\phi(x)} and is defined as:

\displaystyle J_\phi(x) = \begin{bmatrix} \frac{\partial \phi_1}{\partial x_1}\big|_x & \dots & \frac{\partial \phi_1}{\partial x_n}\big|_x \\ \vdots & \ddots \vdots \\ \frac{\partial \phi_m}{\partial x_1}\big|_x & \dots & \frac{\partial \phi_m}{\partial x_n}\big|_x \end{bmatrix} \ \ \ \ \ (6)

where {\frac{\partial \phi_j}{\partial x_i}\big|_x} denotes the partial derivative of {\phi_j} with respect to its {i}-th input and evaluated at the point {x}.

Observe that if {\phi} is affine, i.e., {\phi(x)=Wx+b}, where matrix {W\in \mathbb R^{m\times n}} and column vector {b\in \mathbb{R}^{m\times 1}}, then {J_\phi(x)=W}.

If {\phi} is separable across elements of {x}, i.e., {\phi_i(x)=\phi_i(x_i)} for all {i}‘s, then {J_\phi(x)} is the diagonal matrix {\mathrm{diag}(\phi'_i|_{x_i})_i}.

Preliminaries: Chain rule for derivative. Define two functions {g:\mathbb{R}^n\rightarrow \mathbb{R}^m} and {z:\mathbb{R}^m\rightarrow \mathbb{R}^l} and their composition {\phi=z \circ g}, such that {\phi(x)=z(g(x))} for all {x\in \mathbb{R}^n}. If {g} and {z} are both unidimensional ({n=m=l=1}), then

\displaystyle \phi'|_x = z'|_{g(x)} g'|_x \ \ \ \ \ (7)

In the general case, the derivative of the {j}-th component of the composite function {\phi=z\circ g} with respect to the {i}-th input variable {x_i} equals the sum over all intermediate functions {g_k} of the individual contributions:

\displaystyle \frac{\partial \phi_j}{\partial x_i}\Big|_x = \sum_{k=1}^{m} \frac{\partial z_j}{\partial g_k}\Big |_{g(x)} \frac{\partial g_k}{\partial x_i}\Big|_x , \quad \forall\, i,j. \ \ \ \ \ (8)

More compactly, (8) can be rewritten as the product of the Jacobian’s of the individual functions:

\displaystyle J_{\phi}(x):=J_{z \circ g}(x) = J_{z}(g(x)) \cdot J_{g}(x). \ \ \ \ \ (9)

Back-propagation (BP) rule. Our goal is to compute the derivative of the loss function with respect to any weight {W_{i,j}^{\ell}} or any bias coefficient {b_i^{\ell}} as efficiently as possible. We start by noticing that if {W_{i,j}^{\ell}} changes then only {a^{\ell}_i} changes at layer {\ell}. By chain reaction, this modifies {z^{\ell},a^{\ell+1},z^{\ell+1},\dots,z^L} and, finally, the loss {\mathcal L}.

Hence, by chain rule:

\displaystyle \frac{\partial \mathcal L}{\partial W_{i,j}^{\ell}}\Big|_{x,y,\theta} = \frac{\partial \mathcal L}{\partial a_{i}^{\ell}} \, \frac{\partial a_{i}^{\ell}}{\partial W_{i,j}^{\ell}}. \ \ \ \ \ (10)

It stems from (2) that the second term is simply {\frac{\partial a_{i}^{\ell}}{\partial W_{i,j}^{\ell}}=z_j^{\ell-1}}. Similarly, the derivative with respect to the bias term writes:

\displaystyle \frac{\partial \mathcal L}{\partial b_{i}^{\ell}}\Big|_{x,y,\theta} = \frac{\partial \mathcal L}{\partial a_{i}^{\ell}} \, \frac{\partial a_{i}^{\ell}}{\partial b_{i}^{\ell}}. \ \ \ \ \ (11)

where the second term {\frac{\partial a_{i}^{\ell}}{\partial b_{i}^{\ell}}=1}.

Thus, to compute (10) and (11) we are left with evaluating {\frac{\partial \mathcal L}{\partial a_{i}^{\ell}}}. By chain rule (9), the vector of partial derivatives {\left[\frac{\partial \mathcal L}{\partial a_{i}^{\ell}}\right]_i:=\delta^\ell}, commonly referred to as the error term, can be computed as:

\displaystyle \delta^\ell = J_{\mathcal L} \cdot J_{h^L} \cdot J_{W^L z^{L-1}+b^L} \cdot J_{h^{L-1}} \cdots J_{h^\ell} \ \ \ \ \ (12)

where each term is calculated as follows:

\displaystyle J_{\mathcal L}|_{z^L}=[(z^L_i-y_i)]_i.

\displaystyle J_{h^l}|_{a^{l}} = \mathrm{diag}(h'^l(a^l)), \qquad \forall\, l=\ell,\dots,L

\displaystyle J_{W^{l}z^{l-1}+b^{l}}|_{z^{l-1}}=W^l, \qquad \forall\, l=\ell,\dots,L

The key observation is that the error term {\delta^\ell} does not need to be recomputed from scratch at every layer. Rather, one can use an iterative computation with small incremental complexity, in a backward fashion from higher to lower layers.

Back-propagation procedure.

  • Forward: Compute the value of all the intermediate variables {\{a^\ell,z^\ell\}_{1\le \ell\le L}} given the input {x:=z^0} by evaluating (2),(3) at all layers.
  • Backward: For {\ell=L,L-1,\dots,1}:
    • If {\ell=L}, then compute {\delta^L = [(z^L_i-y_i)]_i \cdot \mathrm{diag}(h'^L(a^L))}.
      Else, compute incrementally {\delta^{\ell} = \delta^{\ell+1} \cdot W^{\ell+1} \cdot \mathrm{diag}(h'^\ell(a^\ell))}
    • Compute {\frac{\partial \mathcal L}{\partial W_{i,j}^{\ell}}=\delta_i^{\ell} z_j^{\ell-1}}
    • Set {\frac{\partial \mathcal L}{\partial b_{i}^{\ell}}=\delta_i^{\ell}}
  • Return {\left\{\frac{\partial \mathcal L}{\partial W_{i,j}^{\ell}},\frac{\partial \mathcal L}{\partial b_{i}^{\ell}}\right\}_{i,j,\ell}}.

Complexity. As we showed before, evaluating the forward pass requires {N^{\theta} + \sqrt{L N^{\theta}}} operations. Remarkably, the backward pass requires the same (order of) number of operations. This leads to an overall complexity being linear in the number of parameters {N^{\theta}}, in contrast to the finite difference case, characterized by a quadratic complexity.

References.

[1] C. M. Bishop, H. Bishop. Deep Learning: Foundations and Concepts. Springer, 2024.

[2] I. Goodfellow, Y. Bengio, A. Courville. Deep learning. MIT Press, 2016.

[3] S.J. Prince (2023). Understanding Deep Learning. MIT press.


Posted

in

by

Tags:

Comments

Leave a comment