Chufan Chen's Homepage

Deep Learning System

2023-06-23

Lec1

Machine learning as data-driven programming: collect a training set with known labels and feed these into a machine learning algorithm, which will(if done well), automatically produce a program that solves this task. Every machine learning algorithm consists of three different elements:

  1. The hypothesis class: the “program structure”, parameterized via a set of parameters, that describes how we map inputs (e.g., images of digits) to outputs (e.g., class labels, or probabilities of different class labels)
  2. The loss function: a function that specifies how “well” a given hypothesis(i.e., a choice of parameters) performs on the task of interest
  3. An optimization method: a procedure for determining a set of parameters that (approximately) minimize the sum of losses over the training set

Multi-class classification setting

In the k-class classification setting:

Training data: $x^{(i)} \in \mathbb{R}^n,y^{(i)} \in {1,...,k}$ for $i=1,...,m$.

A linear hypothesis function maps inputs $x \in \mathbb{R}^n$ to $k$ dimensional vectors

$$ h:\mathbb{R}^n \rightarrow \mathbb{R}^k $$

where $ℎ_i(x)$ indicates some measure of “belief” in how much likely the label is to be class $i$ (i.e., “most likely” prediction is coordinate $i$ with largest $ℎ_i(x)$).

A linear hypothesis function uses a linear operator (i.e. matrix multiplication) for this transformation

$$h_{\theta}(x)=\theta^Tx$$

for parameters $\theta \in \mathbb{R}^{n\times k}$

Often more convenient(for efficient coding) to write the data and operations in matrix batch form

$$ X \in \mathbb{m \times n}= \begin{bmatrix}-x^{{(1)}^T}- \\ \vdots \\ -x^{{(m)}^T}-\end{bmatrix},y \in {1,\dots,k}^m=\begin{bmatrix}y^{(1)} \\ \vdots \\ y^{(m)} \end{bmatrix} $$

Then the linear hypothesis applied to this batch can be written as

$$ h_{\theta}(X)= \begin{bmatrix}-h_{\theta}(x^{(1)})^T- \\ \vdots \\ -h_{\theta}(x^{(m)})^T-\end{bmatrix}=\begin{bmatrix}-x^{{(1)}^T}\theta- \\ \vdots \\ -x^{{(m)}^T}\theta- \end{bmatrix}= X\theta $$

Classification error

$$ l_{error}(h(x),y)= \begin{cases} 0 \quad \text{if }arg\max_{i}h_{i}(x)=y\\ 1 \quad \text{otherwise} \end{cases} $$

We typically use this loss function to assess the quality of classifiers. However this error is a bad loss function to use for optimization, i.e., selecting the best parameters, because it is not differentiable.

Softmax/cross-entropy loss

$$ z_i=p(label=i)=\frac{exp(h_i(x))}{\sum_{j=1}^k exp(h_j(x))} \Longleftrightarrow z \equiv normalize(exp(h(x))) $$ $$ l_{ce}(h(x),y)=-\log p(label=y)=-h_y(x)+log\sum_{j=1}^kexp(h_j(x)) $$

Solving the associated optimization problem

$$ minimize_{\theta}\frac{1}{m}\sum_{i=1}^ml(h_{\theta}(x^{(i)}),y^{(i)}) $$

Softmax regression(linear hypothesis class and softmax loss):

$$ minimize_{\theta}\frac{1}{m}\sum_{i=1}^ml_{ce}(\theta^Tx^{(i)},y^{(i)}) $$

Gradient descent

For a matrix-input, scalar output function $f: \mathbb{R}^{n \times k} \rightarrow \mathbb{R}$, the gradient is defined as the matrix of partial derivatives

$$ \nabla_{\theta}f(\theta) \in \mathbb{R}^{n \times k}=\begin{bmatrix}\frac{\partial f(\theta)}{\partial \theta_{11}} & \dots & \frac{\partial f(\theta)}{\partial \theta_{1k}} \\ \vdots & \ddots & \vdots \\ \frac{\partial f(\theta)}{\partial \theta_{n1}} & \dots & \frac{\partial f(\theta)}{\partial \theta_{nk}} \end{bmatrix} $$

Gradient points in the direction that most increases $f$ (locally). To minimize a function, the gradient descent algorithm proceeds by iteratively taking steps in the direction of the negative gradient

$$ \theta := \theta -\alpha\nabla_{\theta}f(\theta) $$

where $\alpha > 0$ is a step size or learning rate.

Stochastic gradient descent

If our objective (as is the case in machine learning) is the sum of individual losses, we don’t want to compute the gradient using all examples to make a single update to the parameters Instead, take many gradient steps each based upon a minibatch (small partition of the data), to make many parameter updates using a single “pass” over data

Repeat:

The gradient of the softmax objective is $\nabla_{\theta}l_{ce}(\theta^Tx,y)$.

For vector $h \in \mathbb{R}^k$

$$ \begin{aligned} \frac{\partial l_{ce}(h,y)}{\partial h_i}&=\frac{\partial}{\partial h_i}(-h_y+\log \sum_{j=1}^{k}\exp h_j) \\ &=-1{i=y}+\frac{\exp h_i}{\sum_{j=1}^k \exp h_j} \end{aligned} $$

In vector form: $\nabla_hl_{ce}(h,y)=z-e_y$, where $z=normalize(\exp(h))$ since the normalized vector of X is a vector in the same direction but with norm (length) 1.

Chain rule of multivariate calculus

One trick to calculate this is to treat it as single variable and adjust the notion to match the dimension of the supposed result.

$$ \frac{\partial}{\partial \theta}l_{ce}(\theta^Tx,y)=\frac{l_{ce}(\theta^Tx,y)}{\partial\theta^Tx}\frac{\partial(\theta^Tx)}{\partial\theta}=(z-e_y)(x) $$

$$ \nabla_{\theta}l_{ce}(\theta^Tx,y)\in \mathbb{R}^{n \times k}=x(z-e_y)^T $$

Same process works if we use “matrix batch” form of the loss

$$ \begin{aligned} \nabla_{\theta}\ell_{\rm ce}(X\theta, y) \in \mathbb{R}^{n \times k} &= X^{\mathsf{T}}(Z - I_y),\hspace{3ex} \\ Z &= {\rm normalize}(\exp(X\theta)) \end{aligned} $$

Repeat until parameters/loss converges

  1. Iterate over minibatches $X \in \mathbb{R}^{B\times n},y \in {1,\dots,k}^B$ of training set
  2. Update the parameters $\theta:=\theta-\frac{\alpha}{B}X^T(Z-I_y)$

Lec2

Linear hypothesis function forms k linear functions of the input and then predicts the class with the largest value: equivalent to partitioning the input into k linear regions corresponding to each class.

We want some way to separate these points via a nonlinear set of class boundaries. One idea is to apply a linear classifier to some(potentially higher-dimensional) features of the data

$$ \begin{aligned} h_{\theta}(x)&=\theta^T\phi(x) \\ \theta \in \mathbb{R}^{d\times k}&,\phi:\mathbb{R}^n \rightarrow \mathbb{R}^d \end{aligned} $$

Feature function $\phi$

  1. Manual engineering of features relevant to the problem
  2. Learned from data

If $\phi(x)=W^Tx$. Linear feature function doesn't work, because it is just equivalent to another linear classifier.

$$ h_{\theta}(x)=\theta^T\phi(x)=\theta^TW^Tx=\tilde{\theta}x $$ Essentially any nonlinear function of linear features $\phi(x)=\sigma(W^Tx)$ where $W \in \mathbb{R}^{n \times d}$, and $\sigma: \mathbb{R}^d \rightarrow \mathbb{R}^d$ is any nolinear function. A neural network refers to a particular type of hypothesis class, consisting of multiple, parameterized differentiable functions (a.k.a. “layers”) composed together in any manner to form the output. Two layer neural network $$ \begin{aligned} h_{\theta}(x) &= W_2^T\sigma(W_1^Tx)\\ \theta &= {W_1\in \mathbb{R}^{n \times d},W_2 \in \mathbb{R}^{d \times k}} \\ \end{aligned} $$ where $\sigma: \mathbb{R} \rightarrow \mathbb{R}$ is a nonlinear function applied elementwise to the vector(e.g. sigmoid, ReLU). Written in batch matrix form $$ h_{\theta}(X)=\sigma(XW_1)W_2 $$

Universal function approximation

Theorem(1D case):

Given any smooth function $f:\mathbb{R}\rightarrow \mathbb{R}$,closed region $D \subset \mathbb{R}$, and $\epsilon > 0$,we can construct a one-hidden-layer neural network $\hat{f}$ such that

$$ \max_{x\in D}\mid f(x)-\hat{f}(x)\mid \leq \epsilon $$

Proof

Select some dense sampling of points $(x^{(i)},f(x^{(i)}))$ over $D$. Create a neural network that passes exactly through these points. Because the neural network function is piecewise linear, and the function f is smooth, by choosing the $x^{(i)}$ close enough together, we can approximate the function arbitrarily closely.

Fully-connected deep networks

A more generic form of a L-layer neural network –a.k.a. ”Multi-layer perceptron” (MLP), feedforward network, fully-connected network – (in batch form)

$$ \begin{aligned} Z_{i+1}&=\sigma_{i}(Z_iW_i),i=1,\dots,L\\ Z_1&=X,\\ h_{\theta}(X)&=Z_{L+1}\\ [Z_i\in \mathbb{R}^{m \times n_i}&, W_i\in \mathbb{R}^{n_i \times n_{i+1}}] \end{aligned} $$

with nonlinearities $\sigma_i:\mathbb{R} \rightarrow \mathbb{R}$ applied elementwise, and parameters

$$ \theta = {W_1,\dots,W_L} $$

Solve the optimization problem using SGD, just with $h_{\theta}(x)$ now being a neural network.

The gradient(s) of a two-layer network

$$ \nabla_{{W_1,W_2}}l_{ce}(\sigma(XW_1)W_2,y) $$

The gradient w.r.t. $W_2$ looks identical to the softmax regression case:

$$ \begin{aligned} \frac{\partial \ell_{\rm ce}(\sigma(XW_1)W_2, y)}{\partial W_2} &= \frac{\partial \ell_{\rm ce}(\sigma(XW_1)W_2, y)}{\partial \sigma(XW_1)W_2} \cdot \frac{\partial \sigma(XW_1)W_2}{\partial W_2} \\ &= (S-I_y)\cdot\sigma(XW_1) \\ S &= {\rm normalize}(\exp(\sigma(XW_1)W_2)) \end{aligned} $$

so (matching sizes $d\times k$) the gradient is

$$ \nabla_{W_2}\ell_{\rm ce}(\sigma(XW_1)W_2, y) = \sigma(XW_1)^\mathsf{T}(S-I_y) $$

The gradient w.r.t. $W_1$

$$ \begin{aligned} \frac{\partial \ell_{\rm ce}(\sigma(XW_1)W_2, y)}{\partial W_1} &= \frac{\partial \ell_{\rm ce}(\sigma(XW_1)W_2, y)}{\partial \sigma(XW_1)W_2} \cdot \frac{\partial \sigma(XW_1)W_2}{\partial \sigma(XW_1)} \cdot \frac{\partial \sigma(XW_1)}{\partial XW_1}\cdot \frac{\partial XW_1}{\partial W_1} \\ &= (S-I_y)\cdot W_2\cdot(\sigma'(XW_1)) \cdot X \end{aligned} $$

and so the gradient is

$$ \nabla_{W_1}\ell_{\rm ce}(\sigma(XW_1)W_2, y) = X^{\mathsf{T}}\left((S-I_y)W_2^{\mathsf{T}}\circ\sigma'(XW_1)\right) $$

where $\circ$ denotes elementwise multiplication.

Backpropagation in general

Consider our fully-connected network:

$Z_{i+1}=\sigma_i(Z_iW_i),i=1,\dots,L$

Then the gradient of the loss w.r.t. the parameters $W_i$ is

$$ \frac{\partial \ell(Z_{L+1}, y)}{\partial W_i} = \frac{\partial \ell}{\partial Z_{L+1}} \cdot \frac{\partial Z_{L+1}}{\partial Z_L} \cdot \frac{\partial Z_{L}}{\partial Z_{L-1}} \cdot \frac{\partial Z_{L-1}}{\partial Z_{L-2}} \cdot \cdots \cdot \frac{\partial Z_{i+2}}{\partial Z_{i+1}}\cdot \frac{\partial Z_{i+1}}{\partial W_i} $$

Let $G_{i+1}=\frac{\partial l(Z_{L+1}, y)}{\partial Z_{i+1}}$

Then we have a simple backward iteration to compute the $G_i$'s

$$ G_{i} = G_{i+1} \cdot \frac{\partial Z_{i+1}}{\partial Z_i} = G_{i+1} \cdot \frac{\partial \sigma_i(Z_iW_i)}{\partial Z_iW_i} \cdot \frac{\partial Z_iW_i}{\partial Z_i} = G_{i+1}\cdot \sigma'(Z_iW_i)\cdot W_i $$

so matching the size, $G_{i}=\frac{\partial l(Z_{L+1}, y)}{\partial Z_{i}}=\nabla_{Z_i}l(Z_{L+1},y) \in \mathbb{R}^{m\times n_i}$

$$ G_{i} = G_{i+1} \cdot \sigma'(Z_iW_i)\cdot W_i=(G_{i+1}\cdot \sigma'(Z_iW_i))\circ W_i^T $$

Similar formula for the gradient w.r.t. $W_i$, $\nabla_{W_i}l(Z_{L+1},y) \in \mathbb{R}^{n_i\times n_{i+1}}$

$$ \begin{aligned} & \frac{\partial \ell(Z_{L+1}, y)}{\partial W_i} = G_{i+1} \cdot \frac{\partial \sigma_i(Z_iW_i)}{\partial Z_iW_i} \cdot \frac{\partial Z_iW_i}{\partial W_i} = G_{i+1}\cdot \sigma'(Z_iW_i)\cdot Z_i \ \Longrightarrow & \nabla_{W_1}\ell(Z_{L+1}, y) = Z_i^{\mathsf{T}}\left(G_{i+1} \circ \sigma'(Z_iW_i)\right) \end{aligned} $$

Putting it all together, we can efficiently compute all the gradients we need for a neural network by following the procedure below

  1. Initialize $Z_i=X$, Iterate $Z_{i+1}=\sigma_i(Z_iW_i),i=1,\dots,L$
  2. Initialize $G_{L+1}=\nabla_{Z_{L+1}}\ell(Z_{L+1},y)=S-I_y$, Iterate $G_{i}=(G_{i+1}\circ \sigma_i'(Z_i W_i))W_i^T, i=L,\dots,1$

And we can compute all the needed gradients along the way

$$ \nabla_{W_1}\ell(Z_{k+1}, y) = Z_i^{\mathsf{T}}\left(G_{i+1} \circ \sigma'(Z_iW_i)\right) $$

“Backpropagation” is just chain rule + intelligent caching of intermediate results

Each layer needs to be able to multiply the “incoming backward” gradient $G_{i+1}$ by its derivatives, $\frac{\partial Z_{i+1}}{\partial W_i}$, an operation called the “vector Jacobian product”. This process can be generalized to arbitrary computation graphs.

Lec3

Numerical differentiation

Directly compute the partial gradient by definition

$$ \frac{f(\theta)}{\partial \theta_i} = \lim_{\epsilon \rightarrow 0}\frac{f(\theta + \epsilon e_i) - f(\theta)}{\epsilon} $$

A more numerically accurate way to approximate the gradient is

(Recall that $f(\theta + \sigma)=f(\theta)+f'(\theta)\sigma + \frac{1}{2}f''(\theta)\sigma^2+o(\sigma^3)$)

$$ \frac{\partial f(\theta)}{\partial \theta_i} = \frac{f(\theta+\epsilon e_i) - f(\theta-\epsilon e_i)}{2\epsilon} + o(\epsilon^2) $$

Suffer from numerical error, less efficient to compute.

However, numerical differentiation is a powerful tool to check an implement of an automatic differentiation algorithm in unit test cases

If we check for every $\theta_i$, the numerical gradient can still be very costly. Instead:

$$ \delta^{\mathsf{T}}\nabla_{\theta}f(\theta) = \frac{f(\theta+\epsilon\delta) - f(\theta-\epsilon\delta)}{2\epsilon} + o(\epsilon^2) $$

Pick $\delta$ from unit ball, check the above invariance.

Symbolic differentiation

$$ \begin{aligned} \frac{\partial(f(\theta)+g(\theta))}{\partial \theta}&=\frac{\partial f(\theta)}{\partial \theta}+\frac{\partial g(\theta)}{\partial \theta} \\ \frac{\partial(f(\theta)g(\theta))}{\partial \theta}&=g(\theta)\frac{\partial f(\theta)}{\partial \theta}+f(\theta)\frac{\partial g(\theta)}{\partial \theta} \\ \frac{\partial(f(g(\theta))}{\partial \theta} &= \frac{\partial(f(g(\theta))}{\partial g(\theta)}\frac{\partial g(\theta)}{\partial \theta} \end{aligned} $$

Naively do so can result in wasted computations(don't reuse computations)

Computational graph

DAG

Each node represent an (intermediate) value in the computation. Edges present input output relations.

Forward mode automatic differentiation (AD)

Define $\dot{v}_i = \frac{\partial v_i}{\partial x_1}$.

We can then compute the $\dot{v}_i$ iteratively in the forward topological order of the computational graph.

For $f:\mathbb{R}^n \rightarrow \mathbb{R}^k$, we need n forward AD passes to get the gradient with respect to each input. It's good if n is small and k is large.

We mostly care about the cases where $k=1$ and large $n$.

Reverse mode automatic differentiation(AD)

Define adjoint $\overline{v_i} = \frac{\partial y}{\partial v_i}$

We can then compute the $\overline{v_i}$ iteratively in the reverse topological order of the computational graph.

Derivation for the multiple pathway case

$v_1$ is being used in multiple pathways($v_2$ and $v3$)

y can be written in the form of $y=f(v_2,v_3)$

$\overline{v_1}=\overline{v_2}\frac{\partial v_2}{\partial v_1}+\overline{v_3}\frac{\partial v_3}{\partial v_1}$

Define partial adjoint $\overline{v_{i\rightarrow j}} = \overline{v_j}\frac{\partial v_j}{\partial v_i}$ for each input output node pair i and j. We can compute partial adjoints separately then sum them together.

$$ \overline{v_i}=\sum_{j\in next(i)}\overline{v_{i\rightarrow j}} $$

def gradient(out):
    # Dictionary that records a list of partial adjoints of each node
    node_to_grad = {out: [1]}
    
    for i in reverse_topo_order(out):
        # Sum up partial adjoints
        # \overline{v_i}=\sum_j \overline{v_{i\rightarrow j}}
        adjoint_i = sum(partial_adjoint_i_j) = sum(node_to_grad[i])
        for k in inputs(i):
            # compute \overline{v_{k->i}}}=\overline{v_i}\frac{\partial v_i}{\partial v_k}
            partial_adjoint_k_i = adjoin_i * partial_grad(v_i, v_k)
            # “Propagates” partial adjoint to its input
            # append \overline{v_{k->i}} to node_to_grad[k]
            node_to_grad[k].append(partial_adjoint_k_i)
    return adjoint_input

Reverse mode AD by extending computational graph

Reverse mode AD vs Backpropagation

Two advantages of reverse mode AD over backpropagation:

  1. handle gradient of gradient
  2. can optimize computation since reverse mode AD just forward pass

Reverse mode AD on tensors

Define adjoint for tensor $\overline{Z}=\begin{bmatrix} \frac{y}{Z_{1,1}} & \dots & \frac{y}{Z_{1,n}} \\ \vdots & \ddots & \vdots \\ \frac{y}{Z_{m,1}} & \dots & \frac{y}{Z_{m,n}} \\ \end{bmatrix}$

Forward evaluation trace

$$ Z_{ij}=\sum_k X_{ik}W_{kj} \\ v=f(Z) $$

Reverse evaluation in scalar form

$$ \overline{X_{i,k}}=\sum_j \frac{\partial Z_{i,j}}{\partial X_{i,k}}\overline{Z_{ij}}=\sum_j W_{k,j}\overline{Z_{i,j}} \\ $$

Forward matrix form

$$ Z=XW\\ v=f(Z) $$

Reverse matrix form

$$ \overline{X}=\overline{Z}W^{\mathsf{T}} $$

Reverse mode AD on data structures

Define adjoint value usually in the same data type as the forward value and adjoint propagation rule. The the sample algorithm works.