2023-07-01
Given $D={(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)}$, learn $f_{\theta}(x)\approx y$. (e.g. Linear regression, image classification)
How do we represent $f_{\theta}(x)$(input $x$, output a guess of $y$)?
$$ f_{\theta}=\theta_1x_1+\theta_2x_2+\theta_3\\ f_{\theta}=\theta_1x+\theta_2x^2+\theta_3x^3\\ $$
or a neural network...
How do we measure difference between $f_{\theta}(x_i)$ and $y_i$?
$$ ||f_{\theta}(x_i)-y_i||^2\\ \delta(f_{\theta}(x_i)\neq y_i) $$
or probability...
How do we find the best setting of $\theta$?
gradient descent, random search, least squares...
Unlike in suoervised learning, everything is clear and well defined. What does we learn representation form unlabeled data mean? One way to formulate unsupervised learning is as a problem of generative modeling.(e.g. GANs, VAEs, pixel RNN, etc.)
Another type of unsupervised learning is self-supervised learning. For example, language model(BERT, GPT2, GPT3, etc.)
Choose $f_{\theta}(s_t)=a_t$ to maximize $\sum_{t=1}^H r(s_t,a_t)$. It actually subsumes(generalizes) supervised learning.
Supervised learning:get $f_{\theta}(s_t)$ to match $y_i$.
Reinforcement learning: get $f_{\theta}(s_t)$ to maximize reward(could be anything).
Predict y based on x.
Instead of predicting y directly(learn $f_{\theta}(x) \approx y$), we predict $p(y|x)$.
$x$ random variable representing the input
$y$ random variable representing the output
$p(x,y)=p(x)p(y\mid x)$ chain rule
$p(y\mid x) = \frac{p(x,y)}{p(x)}$ definition of conditional probability
How about: $f_{dog}(x)=x^T\theta_{dog}, f_{cat}(x)=x^T\theta_{cat}, \vec{\theta}={\theta_{dog},\theta_{cat}}$. It violates the rule that all the probabilities should be nonnegative and sum to 1, unless $\theta$ is well-designed.
$$ p(y\mid x)=softmax(f_{dog}(x),f_{cat}(x)) $$
"softmax" could be any function(ideally one to one & onto) that takes these inputs and outputs probabilities that are positive and sum to 1.
Why any function?
In machine learning, you produce right answer by choosing right parameters but not by function. The function just need to be general enough(doesn't lose too much information) so that it exists parameters($\vec{\theta}$) that can represent the right answer.
That said though, let's choose a good function now.
How to make a number x positive?
$exp(z)$ (convenient because it's one to one & onto) maps entire real number line to entire set of positive reals
How to make a bunch of numbers sum to 1?
$\frac{z_1}{\sum_{i=1}^n z_i}$
N possible labels
$p(y\mid x)$ - vector with N elements
$f_{\theta}(x)$ - vector-valued function with N outputs
$$ p(y=i\mid x) = softmax(f_{\theta}(x))[i]=\frac{exp(f_{\theta,i}(x))}{\sum_{j=1}^N exp(f_{\theta,j}(x))} $$
for solving any problem ever
$p(x)$: probability distribution over photos
"dog" $\sim p(y\mid x)$: conditional probability distribution over labels
result: $(x,y) \sim p(x,y)$
$(x,y) \sim p(x,y)$
Training set: $D={(x_1,y_1),(x_2,y_2),\dots,(x_n,y_n)}$
What is $p(D)$?
Key assumption: independent and identically distributed(i.i.d.). Every $(x_i,y_i)$ is independent of each $(x_i,y_i)$.
When $i.i.d.$: $p(D)=\Pi_{i}p(x_i,y_i)=\Pi_{i}p(x_i)p(y_i\mid x_i)$
We are learning $p_{\theta}(y\mid x)$, it's a model of the true $p(y\mid x)$.
A good model should make the data more probable.
idea: choose $\theta$ such that
$$ \max_{\theta} p(D)=\Pi_{i}p(x_i)p(y_i\mid x_i) $$
It's numerical difficult because it multiplies many numbers less than 1.
$$ \log p(D)=\sum_{i}log(p(x_i)p_{\theta}(y_i\mid x_i))=const+\sum_{i}log(p_{\theta}(y_i\mid x_i)) $$
Maximum Likelihood Estimation(MLE)
$$ \theta^=\arg\max_{\theta}\sum_{i}log(p_{\theta}(y_i\mid x_i)) \\ \theta^=\arg\min_{\theta}-\sum_{i}log(p_{\theta}(y_i\mid x_i)) $$
Negative log-likelihood(NLL) is our loos function.
In general: the loss function quantifies how bad $\theta$ is we want the least bad(best) $\theta$.
Examples:
aside: cross-entropy
How similar are two distributions, $p_{\theta}$ and $p$?
$H(p,p_{\theta}=-\sum_y p(y\mid x_i)\log p_{\theta}(y\mid x_i)$
assume $y_i \sim p(y\mid x_i)$
$H(p,p_{\theta} \approx -\log p_{\theta}(y_i\mid x_i)$
$\theta^* \leftarrow \arg\min_{\theta} L(\theta)$
Let's say $\theta$ is 2D.
An algorithm:
Which way does $L(\theta)$ decrease?
1D: negative slope = go to the right, positive slope = go to the left
In general: for each dimension, go in the direction opposite the slope along that dimension.
$v_1=-\frac{dL(\theta)}{d\theta_1},v_1=-\frac{dL(\theta)}{d\theta_2}$ etc.
It is not the only direction, in fact there's a entire half space of vectors. It is just the deepest one.
gradient:
$$ \nabla_{\theta}L(\theta)= \begin{bmatrix} \frac{\partial L(\theta)}{\partial \theta_1} \\ \frac{\partial L(\theta)}{\partial \theta_2} \\ \vdots \\ \frac{\partial L(\theta)}{\partial \theta_n} \end{bmatrix} $$
$f_{\theta}(x)=\begin{bmatrix} x^T\theta_{y1} \\ x^T\theta_{y2} \\ \vdots \\ x^T\theta_{ym}\end{bmatrix}=x^T\theta$
$p_{\theta}(y=i\mid x)=softmax(f_{\theta}(x))[i]=\frac{exp(f_{\theta,i}(x))}{\sum_{j=1}^N exp(f_{\theta,j}(x))}$
Gradient descent:
$L(\theta)=-\sum_{i}^n \log p_{\theta}(y_i\mid x_i)$
$P(y_1\mid x) = \frac{e^{\theta_{y_1}^T x}}{e^{\theta_{y_1}^T x}+e^{\theta_{y_2}^T x}}$ This is a bit redundant(overparameterized on $\theta$) because $P(y_1\mid x)+P(y_2\mid x)=1$.
$P(y_1\mid x) = \frac{e^{\theta_{y_1}^T x}}{e^{\theta_{y_1}^T x}+e^{\theta_{y_2}^T x}} = \frac{e^{\theta_{y_1}^T x} e^{-\theta_{y_1}^T x} }{e^{\theta_{y_1}^T x}e^{-\theta_{y_1}^T x}+e^{\theta_{y_2}^T x}e^{-\theta_{y_1}^T x}}=\frac{1}{1+e^{-\theta^T_+ x}}$
Let $\theta_+=\theta_{y1}-\theta_{y2}$
In general, softmax always overparameterized(has one redundant parameter).
$\frac{1}{1+e^{-\theta^T_+ x}}$ is called logistic equation, also referred to as sigmoid.
zero-one loss: $\sum_i \delta(f_{\theta}(x_i) \neq y_i)$
Risk: probability you will get wrong. expected value of loss quantitfies.
$Risk=E_{x\sim p(x),y\sim p(y\mid x)}[L(x,y,\theta)]$
During training, we can't sample $x \sim p(x)$, we just have $D$.
$\text{Empirical risk}=\frac{1}{n}\sum_{i=1}^n L(x_i,y_i,\theta) \approx E_{x\sim p(x),y\sim p(y\mid x)}[L(x,y,\theta)]$
Supervised learning is(usually) empirical risk minimization.
Is this the same as true risk minimization?
Overfitting: when the empirical risk is low, but the true risk is high
Underfitting: whe the empirical risk is high, and the true risk is high