2023-06-23
Machine learning as data-driven programming: collect a training set with known labels and feed these into a machine learning algorithm, which will(if done well), automatically produce a program that solves this task. Every machine learning algorithm consists of three different elements:
In the k-class classification setting:
Training data: $x^{(i)} \in \mathbb{R}^n,y^{(i)} \in {1,...,k}$ for $i=1,...,m$.
A linear hypothesis function maps inputs $x \in \mathbb{R}^n$ to $k$ dimensional vectors
$$ h:\mathbb{R}^n \rightarrow \mathbb{R}^k $$
where $ℎ_i(x)$ indicates some measure of “belief” in how much likely the label is to be class $i$ (i.e., “most likely” prediction is coordinate $i$ with largest $ℎ_i(x)$).
A linear hypothesis function uses a linear operator (i.e. matrix multiplication) for this transformation
$$h_{\theta}(x)=\theta^Tx$$
for parameters $\theta \in \mathbb{R}^{n\times k}$
Often more convenient(for efficient coding) to write the data and operations in matrix batch form
$$ X \in \mathbb{m \times n}= \begin{bmatrix}-x^{{(1)}^T}- \\ \vdots \\ -x^{{(m)}^T}-\end{bmatrix},y \in {1,\dots,k}^m=\begin{bmatrix}y^{(1)} \\ \vdots \\ y^{(m)} \end{bmatrix} $$
Then the linear hypothesis applied to this batch can be written as
$$ h_{\theta}(X)= \begin{bmatrix}-h_{\theta}(x^{(1)})^T- \\ \vdots \\ -h_{\theta}(x^{(m)})^T-\end{bmatrix}=\begin{bmatrix}-x^{{(1)}^T}\theta- \\ \vdots \\ -x^{{(m)}^T}\theta- \end{bmatrix}= X\theta $$
$$ l_{error}(h(x),y)= \begin{cases} 0 \quad \text{if }arg\max_{i}h_{i}(x)=y\\ 1 \quad \text{otherwise} \end{cases} $$
We typically use this loss function to assess the quality of classifiers. However this error is a bad loss function to use for optimization, i.e., selecting the best parameters, because it is not differentiable.
$$ z_i=p(label=i)=\frac{exp(h_i(x))}{\sum_{j=1}^k exp(h_j(x))} \Longleftrightarrow z \equiv normalize(exp(h(x))) $$ $$ l_{ce}(h(x),y)=-\log p(label=y)=-h_y(x)+log\sum_{j=1}^kexp(h_j(x)) $$
Solving the associated optimization problem
$$ minimize_{\theta}\frac{1}{m}\sum_{i=1}^ml(h_{\theta}(x^{(i)}),y^{(i)}) $$
Softmax regression(linear hypothesis class and softmax loss):
$$ minimize_{\theta}\frac{1}{m}\sum_{i=1}^ml_{ce}(\theta^Tx^{(i)},y^{(i)}) $$
For a matrix-input, scalar output function $f: \mathbb{R}^{n \times k} \rightarrow \mathbb{R}$, the gradient is defined as the matrix of partial derivatives
$$ \nabla_{\theta}f(\theta) \in \mathbb{R}^{n \times k}=\begin{bmatrix}\frac{\partial f(\theta)}{\partial \theta_{11}} & \dots & \frac{\partial f(\theta)}{\partial \theta_{1k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\theta)}{\partial \theta_{n1}} & \dots & \frac{\partial f(\theta)}{\partial \theta_{nk}} \end{bmatrix} $$
Gradient points in the direction that most increases $f$ (locally). To minimize a function, the gradient descent algorithm proceeds by iteratively taking steps in the direction of the negative gradient
$$ \theta := \theta -\alpha\nabla_{\theta}f(\theta) $$
where $\alpha > 0$ is a step size or learning rate.
If our objective (as is the case in machine learning) is the sum of individual losses, we don’t want to compute the gradient using all examples to make a single update to the parameters Instead, take many gradient steps each based upon a minibatch (small partition of the data), to make many parameter updates using a single “pass” over data
Repeat:
The gradient of the softmax objective is $\nabla_{\theta}l_{ce}(\theta^Tx,y)$.
For vector $h \in \mathbb{R}^k$
$$ \begin{aligned} \frac{\partial l_{ce}(h,y)}{\partial h_i}&=\frac{\partial}{\partial h_i}(-h_y+\log \sum_{j=1}^{k}\exp h_j) \\ &=-1{i=y}+\frac{\exp h_i}{\sum_{j=1}^k \exp h_j} \end{aligned} $$
In vector form: $\nabla_hl_{ce}(h,y)=z-e_y$, where $z=normalize(\exp(h))$ since the normalized vector of X is a vector in the same direction but with norm (length) 1.
Chain rule of multivariate calculus
One trick to calculate this is to treat it as single variable and adjust the notion to match the dimension of the supposed result.
$$ \frac{\partial}{\partial \theta}l_{ce}(\theta^Tx,y)=\frac{l_{ce}(\theta^Tx,y)}{\partial\theta^Tx}\frac{\partial(\theta^Tx)}{\partial\theta}=(z-e_y)(x) $$
$$ \nabla_{\theta}l_{ce}(\theta^Tx,y)\in \mathbb{R}^{n \times k}=x(z-e_y)^T $$
Same process works if we use “matrix batch” form of the loss
$$ \begin{aligned} \nabla_{\theta}\ell_{\rm ce}(X\theta, y) \in \mathbb{R}^{n \times k} &= X^{\mathsf{T}}(Z - I_y),\hspace{3ex} \\ Z &= {\rm normalize}(\exp(X\theta)) \end{aligned} $$
Repeat until parameters/loss converges
Linear hypothesis function forms k linear functions of the input and then predicts the class with the largest value: equivalent to partitioning the input into k linear regions corresponding to each class.
We want some way to separate these points via a nonlinear set of class boundaries. One idea is to apply a linear classifier to some(potentially higher-dimensional) features of the data
$$ \begin{aligned} h_{\theta}(x)&=\theta^T\phi(x) \\ \theta \in \mathbb{R}^{d\times k}&,\phi:\mathbb{R}^n \rightarrow \mathbb{R}^d \end{aligned} $$
Feature function $\phi$
If $\phi(x)=W^Tx$. Linear feature function doesn't work, because it is just equivalent to another linear classifier.
$$ h_{\theta}(x)=\theta^T\phi(x)=\theta^TW^Tx=\tilde{\theta}x $$ Essentially any nonlinear function of linear features $\phi(x)=\sigma(W^Tx)$ where $W \in \mathbb{R}^{n \times d}$, and $\sigma: \mathbb{R}^d \rightarrow \mathbb{R}^d$ is any nolinear function. A neural network refers to a particular type of hypothesis class, consisting of multiple, parameterized differentiable functions (a.k.a. “layers”) composed together in any manner to form the output. Two layer neural network $$ \begin{aligned} h_{\theta}(x) &= W_2^T\sigma(W_1^Tx)\\ \theta &= {W_1\in \mathbb{R}^{n \times d},W_2 \in \mathbb{R}^{d \times k}} \\ \end{aligned} $$ where $\sigma: \mathbb{R} \rightarrow \mathbb{R}$ is a nonlinear function applied elementwise to the vector(e.g. sigmoid, ReLU). Written in batch matrix form $$ h_{\theta}(X)=\sigma(XW_1)W_2 $$
Theorem(1D case):
Given any smooth function $f:\mathbb{R}\rightarrow \mathbb{R}$,closed region $D \subset \mathbb{R}$, and $\epsilon > 0$,we can construct a one-hidden-layer neural network $\hat{f}$ such that
$$ \max_{x\in D}\mid f(x)-\hat{f}(x)\mid \leq \epsilon $$
Proof
Select some dense sampling of points $(x^{(i)},f(x^{(i)}))$ over $D$. Create a neural network that passes exactly through these points. Because the neural network function is piecewise linear, and the function f is smooth, by choosing the $x^{(i)}$ close enough together, we can approximate the function arbitrarily closely.
A more generic form of a L-layer neural network –a.k.a. ”Multi-layer perceptron” (MLP), feedforward network, fully-connected network – (in batch form)
$$ \begin{aligned} Z_{i+1}&=\sigma_{i}(Z_iW_i),i=1,\dots,L\\ Z_1&=X,\\ h_{\theta}(X)&=Z_{L+1}\\ [Z_i\in \mathbb{R}^{m \times n_i}&, W_i\in \mathbb{R}^{n_i \times n_{i+1}}] \end{aligned} $$
with nonlinearities $\sigma_i:\mathbb{R} \rightarrow \mathbb{R}$ applied elementwise, and parameters
$$ \theta = {W_1,\dots,W_L} $$
Solve the optimization problem using SGD, just with $h_{\theta}(x)$ now being a neural network.
The gradient(s) of a two-layer network
$$ \nabla_{{W_1,W_2}}l_{ce}(\sigma(XW_1)W_2,y) $$
The gradient w.r.t. $W_2$ looks identical to the softmax regression case:
$$ \begin{aligned} \frac{\partial \ell_{\rm ce}(\sigma(XW_1)W_2, y)}{\partial W_2} &= \frac{\partial \ell_{\rm ce}(\sigma(XW_1)W_2, y)}{\partial \sigma(XW_1)W_2} \cdot \frac{\partial \sigma(XW_1)W_2}{\partial W_2} \\ &= (S-I_y)\cdot\sigma(XW_1) \\ S &= {\rm normalize}(\exp(\sigma(XW_1)W_2)) \end{aligned} $$
so (matching sizes $d\times k$) the gradient is
$$ \nabla_{W_2}\ell_{\rm ce}(\sigma(XW_1)W_2, y) = \sigma(XW_1)^\mathsf{T}(S-I_y) $$
The gradient w.r.t. $W_1$
$$ \begin{aligned} \frac{\partial \ell_{\rm ce}(\sigma(XW_1)W_2, y)}{\partial W_1} &= \frac{\partial \ell_{\rm ce}(\sigma(XW_1)W_2, y)}{\partial \sigma(XW_1)W_2} \cdot \frac{\partial \sigma(XW_1)W_2}{\partial \sigma(XW_1)} \cdot \frac{\partial \sigma(XW_1)}{\partial XW_1}\cdot \frac{\partial XW_1}{\partial W_1} \\ &= (S-I_y)\cdot W_2\cdot(\sigma'(XW_1)) \cdot X \end{aligned} $$
and so the gradient is
$$ \nabla_{W_1}\ell_{\rm ce}(\sigma(XW_1)W_2, y) = X^{\mathsf{T}}\left((S-I_y)W_2^{\mathsf{T}}\circ\sigma'(XW_1)\right) $$
where $\circ$ denotes elementwise multiplication.
Backpropagation in general
Consider our fully-connected network:
$Z_{i+1}=\sigma_i(Z_iW_i),i=1,\dots,L$
Then the gradient of the loss w.r.t. the parameters $W_i$ is
$$ \frac{\partial \ell(Z_{L+1}, y)}{\partial W_i} = \frac{\partial \ell}{\partial Z_{L+1}} \cdot \frac{\partial Z_{L+1}}{\partial Z_L} \cdot \frac{\partial Z_{L}}{\partial Z_{L-1}} \cdot \frac{\partial Z_{L-1}}{\partial Z_{L-2}} \cdot \cdots \cdot \frac{\partial Z_{i+2}}{\partial Z_{i+1}}\cdot \frac{\partial Z_{i+1}}{\partial W_i} $$
Let $G_{i+1}=\frac{\partial l(Z_{L+1}, y)}{\partial Z_{i+1}}$
Then we have a simple backward iteration to compute the $G_i$'s
$$ G_{i} = G_{i+1} \cdot \frac{\partial Z_{i+1}}{\partial Z_i} = G_{i+1} \cdot \frac{\partial \sigma_i(Z_iW_i)}{\partial Z_iW_i} \cdot \frac{\partial Z_iW_i}{\partial Z_i} = G_{i+1}\cdot \sigma'(Z_iW_i)\cdot W_i $$
so matching the size, $G_{i}=\frac{\partial l(Z_{L+1}, y)}{\partial Z_{i}}=\nabla_{Z_i}l(Z_{L+1},y) \in \mathbb{R}^{m\times n_i}$
$$ G_{i} = G_{i+1} \cdot \sigma'(Z_iW_i)\cdot W_i=(G_{i+1}\cdot \sigma'(Z_iW_i))\circ W_i^T $$
Similar formula for the gradient w.r.t. $W_i$, $\nabla_{W_i}l(Z_{L+1},y) \in \mathbb{R}^{n_i\times n_{i+1}}$
$$ \begin{aligned} & \frac{\partial \ell(Z_{L+1}, y)}{\partial W_i} = G_{i+1} \cdot \frac{\partial \sigma_i(Z_iW_i)}{\partial Z_iW_i} \cdot \frac{\partial Z_iW_i}{\partial W_i} = G_{i+1}\cdot \sigma'(Z_iW_i)\cdot Z_i \ \Longrightarrow & \nabla_{W_1}\ell(Z_{L+1}, y) = Z_i^{\mathsf{T}}\left(G_{i+1} \circ \sigma'(Z_iW_i)\right) \end{aligned} $$
Putting it all together, we can efficiently compute all the gradients we need for a neural network by following the procedure below
And we can compute all the needed gradients along the way
$$ \nabla_{W_1}\ell(Z_{k+1}, y) = Z_i^{\mathsf{T}}\left(G_{i+1} \circ \sigma'(Z_iW_i)\right) $$
“Backpropagation” is just chain rule + intelligent caching of intermediate results
Each layer needs to be able to multiply the “incoming backward” gradient $G_{i+1}$ by its derivatives, $\frac{\partial Z_{i+1}}{\partial W_i}$, an operation called the “vector Jacobian product”. This process can be generalized to arbitrary computation graphs.
Directly compute the partial gradient by definition
$$ \frac{f(\theta)}{\partial \theta_i} = \lim_{\epsilon \rightarrow 0}\frac{f(\theta + \epsilon e_i) - f(\theta)}{\epsilon} $$
A more numerically accurate way to approximate the gradient is
(Recall that $f(\theta + \sigma)=f(\theta)+f'(\theta)\sigma + \frac{1}{2}f''(\theta)\sigma^2+o(\sigma^3)$)
$$ \frac{\partial f(\theta)}{\partial \theta_i} = \frac{f(\theta+\epsilon e_i) - f(\theta-\epsilon e_i)}{2\epsilon} + o(\epsilon^2) $$
Suffer from numerical error, less efficient to compute.
However, numerical differentiation is a powerful tool to check an implement of an automatic differentiation algorithm in unit test cases
If we check for every $\theta_i$, the numerical gradient can still be very costly. Instead:
$$ \delta^{\mathsf{T}}\nabla_{\theta}f(\theta) = \frac{f(\theta+\epsilon\delta) - f(\theta-\epsilon\delta)}{2\epsilon} + o(\epsilon^2) $$
Pick $\delta$ from unit ball, check the above invariance.
$$ \begin{aligned} \frac{\partial(f(\theta)+g(\theta))}{\partial \theta}&=\frac{\partial f(\theta)}{\partial \theta}+\frac{\partial g(\theta)}{\partial \theta} \\ \frac{\partial(f(\theta)g(\theta))}{\partial \theta}&=g(\theta)\frac{\partial f(\theta)}{\partial \theta}+f(\theta)\frac{\partial g(\theta)}{\partial \theta} \\ \frac{\partial(f(g(\theta))}{\partial \theta} &= \frac{\partial(f(g(\theta))}{\partial g(\theta)}\frac{\partial g(\theta)}{\partial \theta} \end{aligned} $$
Naively do so can result in wasted computations(don't reuse computations)
DAG
Each node represent an (intermediate) value in the computation. Edges present input output relations.
Define $\dot{v}_i = \frac{\partial v_i}{\partial x_1}$.
We can then compute the $\dot{v}_i$ iteratively in the forward topological order of the computational graph.
For $f:\mathbb{R}^n \rightarrow \mathbb{R}^k$, we need n forward AD passes to get the gradient with respect to each input. It's good if n is small and k is large.
We mostly care about the cases where $k=1$ and large $n$.
Define adjoint $\overline{v_i} = \frac{\partial y}{\partial v_i}$
We can then compute the $\overline{v_i}$ iteratively in the reverse topological order of the computational graph.
$v_1$ is being used in multiple pathways($v_2$ and $v3$)
y can be written in the form of $y=f(v_2,v_3)$
$\overline{v_1}=\overline{v_2}\frac{\partial v_2}{\partial v_1}+\overline{v_3}\frac{\partial v_3}{\partial v_1}$
Define partial adjoint $\overline{v_{i\rightarrow j}} = \overline{v_j}\frac{\partial v_j}{\partial v_i}$ for each input output node pair i and j. We can compute partial adjoints separately then sum them together.
$$ \overline{v_i}=\sum_{j\in next(i)}\overline{v_{i\rightarrow j}} $$
def gradient(out):
# Dictionary that records a list of partial adjoints of each node
node_to_grad = {out: [1]}
for i in reverse_topo_order(out):
# Sum up partial adjoints
# \overline{v_i}=\sum_j \overline{v_{i\rightarrow j}}
adjoint_i = sum(partial_adjoint_i_j) = sum(node_to_grad[i])
for k in inputs(i):
# compute \overline{v_{k->i}}}=\overline{v_i}\frac{\partial v_i}{\partial v_k}
partial_adjoint_k_i = adjoin_i * partial_grad(v_i, v_k)
# “Propagates” partial adjoint to its input
# append \overline{v_{k->i}} to node_to_grad[k]
node_to_grad[k].append(partial_adjoint_k_i)
return adjoint_input
Two advantages of reverse mode AD over backpropagation:
Define adjoint for tensor $\overline{Z}=\begin{bmatrix} \frac{y}{Z_{1,1}} & \dots & \frac{y}{Z_{1,n}} \\ \vdots & \ddots & \vdots \\ \frac{y}{Z_{m,1}} & \dots & \frac{y}{Z_{m,n}} \\ \end{bmatrix}$
Forward evaluation trace
$$ Z_{ij}=\sum_k X_{ik}W_{kj} \\ v=f(Z) $$
Reverse evaluation in scalar form
$$ \overline{X_{i,k}}=\sum_j \frac{\partial Z_{i,j}}{\partial X_{i,k}}\overline{Z_{ij}}=\sum_j W_{k,j}\overline{Z_{i,j}} \\ $$
Forward matrix form
$$ Z=XW\\ v=f(Z) $$
Reverse matrix form
$$ \overline{X}=\overline{Z}W^{\mathsf{T}} $$
Define adjoint value usually in the same data type as the forward value and adjoint propagation rule. The the sample algorithm works.