Machine Learning Basics

2023-07-01

Machine Learning Basics

Formulate learning problems

predict y from x(supervised learning)
learn representation from unlabeled data(unsupervised learning)
learn a policy from reward(reinforcement learning)

Supervised learning

Given $D={(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)}$, learn $f_{\theta}(x)\approx y$. (e.g. Linear regression, image classification)

How do we represent $f_{\theta}(x)$(input $x$, output a guess of $y$)?

$$ f_{\theta}=\theta_1x_1+\theta_2x_2+\theta_3\\ f_{\theta}=\theta_1x+\theta_2x^2+\theta_3x^3\\ $$

or a neural network...

How do we measure difference between $f_{\theta}(x_i)$ and $y_i$?

$$ ||f_{\theta}(x_i)-y_i||^2\\ \delta(f_{\theta}(x_i)\neq y_i) $$

or probability...

How do we find the best setting of $\theta$?

gradient descent, random search, least squares...

Unsupervised learning

Unlike in suoervised learning, everything is clear and well defined. What does we learn representation form unlabeled data mean? One way to formulate unsupervised learning is as a problem of generative modeling.(e.g. GANs, VAEs, pixel RNN, etc.)

Another type of unsupervised learning is self-supervised learning. For example, language model(BERT, GPT2, GPT3, etc.)

Reinforcement learning

Choose $f_{\theta}(s_t)=a_t$ to maximize $\sum_{t=1}^H r(s_t,a_t)$. It actually subsumes(generalizes) supervised learning.

Supervised learning:get $f_{\theta}(s_t)$ to match $y_i$.

Reinforcement learning: get $f_{\theta}(s_t)$ to maximize reward(could be anything).

Supervised Learning Principles

Predict y based on x.

Instead of predicting y directly(learn $f_{\theta}(x) \approx y$), we predict $p(y|x)$.

Conditional probabilities

$x$ random variable representing the input

$y$ random variable representing the output

$p(x,y)=p(x)p(y\mid x)$ chain rule

$p(y\mid x) = \frac{p(x,y)}{p(x)}$ definition of conditional probability

How do we represent it($p(y \mid x)$)?

How about: $f_{dog}(x)=x^T\theta_{dog}, f_{cat}(x)=x^T\theta_{cat}, \vec{\theta}={\theta_{dog},\theta_{cat}}$. It violates the rule that all the probabilities should be nonnegative and sum to 1, unless $\theta$ is well-designed.

$$ p(y\mid x)=softmax(f_{dog}(x),f_{cat}(x)) $$

"softmax" could be any function(ideally one to one & onto) that takes these inputs and outputs probabilities that are positive and sum to 1.

Why any function?

In machine learning, you produce right answer by choosing right parameters but not by function. The function just need to be general enough(doesn't lose too much information) so that it exists parameters($\vec{\theta}$) that can represent the right answer.

That said though, let's choose a good function now.

How to make a number x positive?

$exp(z)$ (convenient because it's one to one & onto) maps entire real number line to entire set of positive reals

How to make a bunch of numbers sum to 1?

$\frac{z_1}{\sum_{i=1}^n z_i}$

The softmax in general

N possible labels

$p(y\mid x)$ - vector with N elements

$f_{\theta}(x)$ - vector-valued function with N outputs

$$ p(y=i\mid x) = softmax(f_{\theta}(x))[i]=\frac{exp(f_{\theta,i}(x))}{\sum_{j=1}^N exp(f_{\theta,j}(x))} $$

The machine learning method

for solving any problem ever

Define your model class - How do represent the “program”
Define your loss function - How to measure if one model in your model class is better than another?
Pick your optimizer - How to search the model class to find the model that minimizes the loss function?
Run it on a big GPU

Marr's levels of analysis

computational - "why" e.g., loss function
algorithmic - "what" e.g., the model
implementation - "how" e.g., the optimization algorithm

Loss function

How is the dataset generated?

$p(x)$: probability distribution over photos

"dog" $\sim p(y\mid x)$: conditional probability distribution over labels

result: $(x,y) \sim p(x,y)$

$(x,y) \sim p(x,y)$

Training set: $D={(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)}$

What is $p(D)$?

Key assumption: independent and identically distributed(i.i.d.). Every $(x_i,y_i)$ is independent of each $(x_i,y_i)$.

When $i.i.d.$: $p(D)=\Pi_{i}p(x_i,y_i)=\Pi_{i}p(x_i)p(y_i\mid x_i)$

We are learning $p_{\theta}(y\mid x)$, it's a model of the true $p(y\mid x)$.

A good model should make the data more probable.

idea: choose $\theta$ such that

$$ \max_{\theta} p(D)=\Pi_{i}p(x_i)p(y_i\mid x_i) $$

It's numerical difficult because it multiplies many numbers less than 1.

$$ \log p(D)=\sum_{i}log(p(x_i)p_{\theta}(y_i\mid x_i))=const+\sum_{i}log(p_{\theta}(y_i\mid x_i)) $$

Maximum Likelihood Estimation(MLE)

$$ \theta^=\arg\max_{\theta}\sum_{i}log(p_{\theta}(y_i\mid x_i)) \\ \theta^=\arg\min_{\theta}-\sum_{i}log(p_{\theta}(y_i\mid x_i)) $$

Negative log-likelihood(NLL) is our loos function.

Loss function

In general: the loss function quantifies how bad $\theta$ is we want the least bad(best) $\theta$.

Examples:

negative log-likelihood: $-\sum_i \log p_{\theta}(y_i\mid x_i)$ also called cross-entropy.

aside: cross-entropy

How similar are two distributions, $p_{\theta}$ and $p$?

$H(p,p_{\theta}=-\sum_y p(y\mid x_i)\log p_{\theta}(y\mid x_i)$

assume $y_i \sim p(y\mid x_i)$

$H(p,p_{\theta} \approx -\log p_{\theta}(y_i\mid x_i)$

zero-one loss: $\sum_i \delta(f_{\theta}(x_i) \neq y_i)$
mean squared error: $\sum_i \frac{1}{2}||f_{\theta}(x_i)-y_i||^2$ actually just negative log-likelihood.

Optimization

The loss landscape

$\theta^* \leftarrow \arg\min_{\theta} L(\theta)$

Let's say $\theta$ is 2D.

An algorithm:

Find a direction $v$ where $L(\theta)$ decreases
$\theta \leftarrow \theta + \alpha v$

Gradient descent

Which way does $L(\theta)$ decrease?

1D: negative slope = go to the right, positive slope = go to the left

In general: for each dimension, go in the direction opposite the slope along that dimension.

$v_1=-\frac{dL(\theta)}{d\theta_1},v_1=-\frac{dL(\theta)}{d\theta_2}$ etc.

It is not the only direction, in fact there's a entire half space of vectors. It is just the deepest one.

gradient:

$$ \nabla_{\theta}L(\theta)= \begin{bmatrix} \frac{\partial L(\theta)}{\partial \theta_1} \\ \frac{\partial L(\theta)}{\partial \theta_2} \\ \vdots \\ \frac{\partial L(\theta)}{\partial \theta_n} \end{bmatrix} $$

Logistic regression

$f_{\theta}(x)=\begin{bmatrix} x^T\theta_{y1} \\ x^T\theta_{y2} \\ \vdots \\ x^T\theta_{ym}\end{bmatrix}=x^T\theta$

$p_{\theta}(y=i\mid x)=softmax(f_{\theta}(x))[i]=\frac{exp(f_{\theta,i}(x))}{\sum_{j=1}^N exp(f_{\theta,j}(x))}$

Gradient descent:

Compute $\nabla_{\theta}L(\theta)$
$\theta \leftarrow \theta - \alpha \nabla_{\theta}L(\theta)$

$L(\theta)=-\sum_{i}^n \log p_{\theta}(y_i\mid x_i)$

Special case: binary classification

$P(y_1\mid x) = \frac{e^{\theta_{y_1}^T x}}{e^{\theta_{y_1}^T x}+e^{\theta_{y_2}^T x}}$ This is a bit redundant(overparameterized on $\theta$) because $P(y_1\mid x)+P(y_2\mid x)=1$.

$P(y_1\mid x) = \frac{e^{\theta_{y_1}^T x}}{e^{\theta_{y_1}^T x}+e^{\theta_{y_2}^T x}} = \frac{e^{\theta_{y_1}^T x} e^{-\theta_{y_1}^T x} }{e^{\theta_{y_1}^T x}e^{-\theta_{y_1}^T x}+e^{\theta_{y_2}^T x}e^{-\theta_{y_1}^T x}}=\frac{1}{1+e^{-\theta^T_+ x}}$

Let $\theta_+=\theta_{y1}-\theta_{y2}$

In general, softmax always overparameterized(has one redundant parameter).

$\frac{1}{1+e^{-\theta^T_+ x}}$ is called logistic equation, also referred to as sigmoid.

Empirical risk and true risk

zero-one loss: $\sum_i \delta(f_{\theta}(x_i) \neq y_i)$

Risk: probability you will get wrong. expected value of loss quantitfies.

$Risk=E_{x\sim p(x),y\sim p(y\mid x)}[L(x,y,\theta)]$

During training, we can't sample $x \sim p(x)$, we just have $D$.

$\text{Empirical risk}=\frac{1}{n}\sum_{i=1}^n L(x_i,y_i,\theta) \approx E_{x\sim p(x),y\sim p(y\mid x)}[L(x,y,\theta)]$

Empirical risk minimization

Supervised learning is(usually) empirical risk minimization.

Is this the same as true risk minimization?

Overfitting: when the empirical risk is low, but the true risk is high

can happen if the dataset is too small
can happen if the model is too powerful(has too many parameters/capacity)

Underfitting: whe the empirical risk is high, and the true risk is high

can happen if the model is too weak(has too few parameters/capacity)
can happen if your optimizer is not configured well(e.g., wrong learning rate)

Machine Learning Basics