Deep Reinforcement Learning and Control

2022-04-20T11:10:00

Given a model of the world

After this section, reader should be able to solve the problem of "MDP control" both in infinite horizon setting and finite horizon setting.

Markov Processes
MRPs
MDPs
Evaluation and Control in MDPs

Markov Process or Markov Chain $(S,P)$

Memoryless random process
- Sequence of random states with Markov property
no rewards, no actions

Markov Reward Process is a Markov Chain + reward

$R$ is a reward function $R(s_t=s)=E[r_t\mid s_t=s]$

We assume stationary rewards. $$ r_i=r_j,\text{whenever }s_i=s_j, \forall i,j =0,1,..., $$ In the case of stochastic rewards, we require that the cumulative distribution functions(cdf) of the rewards conditioned on the current state be time independent. $$ F(r_i\mid s_i=s)=F(r_j\mid s_j=s), \forall s \in S,\forall i,j=0,1,..., $$ where $F(r_i\mid s_i=s)$ denotes the cdf of $r_i$ conditioned on the state $s_i=s$.

With these two equation above, we have the following result about the expeced rewards: $$ R(s)=E[r_i\mid s_i=s],\forall i=0,1,..., $$

Horizon $H$

Number of time steps in each episode
Can be infinite
Otherwise called finite Markov reward process

Markov Process

Two assumptions:

Finite state space
Stationary transition probabilities

$P$ is a row-stochastic matrix. 1 is an eigenvalue of any row-stochastic matrix. Any eignvalue of a row-stochastic matrix has maximum absolute value 1.

The max-norm or infinity-norm of a vector $x \in R^n$ is denoted by $\Vert x\Vert_\infty$ and defined as $\Vert x\Vert_\infty=\max_i |x_i|$. For any matrix $A\in R^{m\times n}$, define the following quantity(called the induced infinity norm of a matrix) $$ \Vert \boldsymbol{A}\Vert_\infty=\sup_{x \in R^n, x \neq 0} \frac{\Vert \boldsymbol{A}x\Vert_\infty}{\Vert x\Vert_\infty} $$

$\Vert \boldsymbol{A}\Vert_\infty$ satisfies all the properties of a norm
$\Vert \boldsymbol{A}\Vert_\infty = \max_{i=1,...,m}(\sum_{j=1}^n|\boldsymbol{A}_{ij}|)$
Conclude that if $A$ is row-stochastic, the $\Vert \boldsymbol{A}\Vert_\infty=1$
For every $x\in R^n$, $\Vert \boldsymbol{A}x\Vert_\infty \leq \Vert \boldsymbol{A}\Vert_\infty\Vert x\Vert_\infty$

Returns $G_t$

General form: $G_t=r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+\dots+\gamma^{H-1}r_{t+H-1}$

$G_t=R_{t+1}+R_{t+2}+\dots R_{T}$ for epsiodic tasks, where T is a final step at which a terminal state is reached, ending an episode.

$G_t=R_{t+1}+\gamma R_{t+2}+\dots =\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}$ for continuing tasks with a temporal discounting, where $\gamma$ is the discount factor.

Discount factor

Consider a finite horizon Markov reward process, with bounded rewards. Specifically assume that $\exists M \in (0,\infty)$ such that$\mid r_i\mid \leq M \forall i$ and across all episodes(realizations). The return for any eepisode $G_t$ is bounded. $C(M,\gamma,t,H)=\frac{M(1-\gamma^{H-t})}{1-\gamma}$ is a bound. In other word, $\mid G_t\mid \leq C$ for any episode.(Hint: Geometric progression)

Consider a finite horizon Markov reward process, with bounded rewards and $\gamma <1$. The return for any eepisode $G_t$ is bounded. Consider the partial sum $\sum_{i=t}^N \gamma^{i-t}r_i$, for $N\geq t$. ${S_n}_{N\geq t}$ is a Cauchy sequence.

Value Functions are Expected Return

$V_{t}(s)=E[G_t\mid S_t=s]$ is called state-value function.

When the horizon H is infinite, $V_{t}(s)=E[G_t\mid S_t=s]$ with the stationary assumptions of the rewards and tarnsition probabilities imply that $V_i(s)=V_j(s)$ for all $i,j=0,1,...,$ and thus in this case we will define: $$ V(s)=V_0(s) $$

Computing the value function of a Markov Reward Process

Simulation
Analytic solution
Iterative solution

Monte Carlo simulation

procedure Monte Carlo Evaluation(M,s,t,N)

set $i=0, G_t=0$
while $i \neq N$ do
- Generate an episode, starting from state s and time t
- Using the generated episode, calculate return $g \leftarrow \sum_{i=t}^{H-1}\gamma^{i-t}r_i$
- $G_t \leftarrow G_t+g$
- $i \leftarrow i + 1$
$V_t(s) \leftarrow G_t/N$
return $V_t(s)$

Analytic solution

This method works only for an infinite horizon MRP with $\gamma < 1$.

MRP value function satisifies $V(s)=R(s)+\gamma \sum_{s' \in S}P(s'\mid s)V(s')$.

For finite state MRP($|S|<\infty$), we can express $V(s)$ using a matrix equation $$ \begin{bmatrix}V(s_1)\\ \vdots \\ V(s_N)\end{bmatrix}=\begin{bmatrix}R(s_1)\\ \vdots \\ R(s_N)\end{bmatrix}+\gamma \begin{bmatrix}P(s_1\mid s_1) & \dots & P(s_N\mid s_1)\\ P(s_1\mid s_2) & \dots & P(s_N\mid s_2)\\ \vdots & \ddots & \vdots \\ P(s_1\mid s_N) & \dots & P(s_N\mid s_N)\end{bmatrix}\begin{bmatrix}V(s_1)\\ \vdots \\ V(s_N)\end{bmatrix} $$ \begin{equation} V=R+\gamma PV \end{equation}

\begin{equation} V-\gamma PV=R\newline (I-\gamma P) V=R\newline V=(I-\gamma P)^{-1}R \end{equation}

Solving directly requires taking a matrix inverse $\sim O(N^3)$. Note that $(I-\gamma P)$ is invertible.

Iterative Algorithm for Computing Value of a MRP

Initialize $V_0(s)=0$ for all $s$
For $k=1$ until convergence
- For all $s \in S$
  - $V_k(s)=R(s)+\gamma \sum_{s' \in S}P(s'\mid s)V_{k-1}(s')$

Computational complexity: $O(|S|^2)$ for each iteration($|S|=N$)

Solving MDPs

Markov Decision Process is Markov Reward Process + actions.

MDP is a tuple: $(S,A,P,R,\gamma)$

Dynamic programming assumes full knowledge of the MDP. It is used for planning in an MDP.

Prediction: Given an MDP($S,A,T,r,\gamma$) and a policy $\pi(a\mid s)=P[A_t=a\mid S_t=s]$ find the stae and action value functions.
Optimal Control: Given an MDP($S,A,T,r,\gamma$), find the optimal policy.

We can find optimal policy by maximizing over value functions

$$ \pi_{}(a|s)= \begin{cases} 1,& \text{if } a=argmax_{a\in A}q_{}(s,a)\\ 0,& \text{otherwise} \end{cases} $$

If we know $q_*(s,a)$, we do not need the dynamics to find the optimal policy.

$$ \pi_{}(a|s)= \begin{cases} 1,& \text{if } a=argmax_{a\in A}(\sum_{s',r}p(s',r|s,a)(r+\gamma v_(s')))\\ 0,& \text{otherwise} \end{cases} $$

If we know $v_*(s)$, we need the dynamics to do one step lookahead.

Solving Linear Equations

$$ G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\dots \\ =R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3} + \gamma^2 R_{t+3}+\dots) \\ =R_{t+1}+\gamma G_{t+1} $$

By conditioning on a state and taking expections: $$ v_{\pi}(s)=E[R_{t+1}+\gamma v_{\pi}(S_{t+1})]\\ =\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] $$

By conditioning on a state and action and taking expections: $$ q_{\pi}(s, a)=E[R_{t+1}+\gamma q_{\pi}(S_{t+1}, A_{t+1})]\\ =\sum_{s',r}p(s',r|s,a)[r+\gamma \sum_{a'}\pi(a'|s) q_{\pi}(s',a')] $$

Bellman Expectation Equations

\begin{equation} v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] \end{equation} is a set of linear equations, one for each state. The state value function for $\pi$ is its unique solution.

\begin{equation} q_{\pi}(s, a)=\sum_{s',r}p(s',r|s,a)[r+\gamma \sum_{a'}\pi(a'|s) q_{\pi}(s',a')] \end{equation} is a set of linear equations, one for each state and action. The state-action value function for $\pi$ is its unique solution.

Bellman optimality Equations

For $$ v_{}(s)=\max_{a\in A}(\sum_{s',r}p(s',r \mid s,a)(r+\gamma v_{}(s'))) $$

$v^*$ is the unique solution of this system of nonlinear equations

$$ q_{}(s, a)=E[R_{t+1}+\gamma max_{a' \in A}q_{}(S_{t+1},a')\mid S_t=s, A_t=a]\\ = \sum_{s',r}p(s',r|s,a)[r+\gamma \max_{a'}q_{*}(s',a')] $$

$q^*$ is the unique solution of this system of nonlinear equations

$$ v_{\pi}(s)=\sum_{a\in A}\pi(a\mid s)q_{\pi}(s,a) \\ v_{}(s)=max_a q_{}(s,a) \\ q_{}(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma v_{}(s')] $$

MDP + π(a|s) = Markov Reward Process

MDP is a MRP $(S,R^\pi, P^\pi,\gamma)$ where

$$ R^\pi(s)=\sum_{a\in A}\pi(a\mid s)R(s,a) $$ $$ P^\pi(s'\mid s)=\sum_{a\in A}\pi(a\mid s)P(s'\mid s,a) $$

It implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with $R^\pi$ and $P^\pi$.

MDP Policy Evaluation, Iterative Algorithm

Initialize $V_0(s)=0$ for all $s$
For $k=1$ until convergence(e.g. $\vert V_k^{\pi} - V_{k-1}^\pi\vert \lt \epsilon$)
- For all $s \in S$
  - $V_k^{\pi}(s)=\sum_{a}\pi(a\mid s)[r(s,a)+\gamma \sum_{s' \in S}p(s'\mid s,a)V_{k-1}^\pi(s')]$

This is a Bellman backup for a particular policy.

If the policy is determinisitic then the above update simplifies to
$$ V_k^{\pi}(s)=r(s,\pi(s))+\gamma \sum_{s' \in S}p(s'\mid s,\pi(s))V_{k-1}^\pi(s') $$

MDP Control

Compute the optimal policy $$ \pi^*(s)=arg\max_{\pi}V^{\pi}(s) $$

There exists a unique optimal value function.

Optimal policy for a MDP in an infinite horizon problem(agent acts forever) is

deterministic
stationary (does not depend on time step)
not necessarily unique

Number of deterministic policies is $|A|^{|S|}$. Policy iteration is generally more efficient than enumeration.

MDP Policy Iteration(PI)

Set $i=0$
Initialize $\pi_0(s)$ randomly for all states s
While $i == 0$ or $\vert \pi_i - \pi_{i-1} \Vert_1>0$(L1-norm, measures if the policy changed for any state):
- $V^{\pi_i} \leftarrow \text{MDP } V$ function policy evaluation of $\pi_i$
- $\pi_{i+1} \leftarrow$ Policy improvement
- i=i+1

Policy Improvement

Now, We get $V^{\pi_i}$ by policy evaluation.

Compute state-action vlaue of a policy $\pi_i$
- For $s\in S$ and $a\in A$
  - $Q^{\pi_i}(s,a)=R(s,a)+\gamma \sum_{s'\in S}p(s'\mid s,a)V^{\pi_i}(s')$
Compute new policy $\pi_{i+1}$, for all $s\in S$ $$ \pi_{i+1}(s)=arg\max_{a}Q^{\pi_i}(s,a) \forall s \in S $$

By definition $arg\max_{a}Q^{\pi_i}(s,a) \geq Q^{\pi_i}(s,\pi_i(s))$

$$ \max_a Q^{\pi_i}(s,a) \geq R(s,\pi_i(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_i(s))V^{\pi_i}(s')=V^{\pi_i}(s) $$

The Werid thing is suppose we take $\pi_{i+1}(s)$ for one action, then follow $\pi_i$ forever, Our expected sum of rewards is at least as good as if we had always followed $ \pi_i$. It's a bit unclear since we completely change to the new policy meaning the new proposed policy is to always follow $\pi_{i+1}$

Monotonic Improvement in Policy

Definition:

$$V^{\pi_1} \ge V^{\pi_2}:V^{\pi_1}(s) \ge V^{\pi_2}(s), \forall s \in S$$

Proposition:$V^{\pi_{i+1}} \ge V^{\pi_i}$ with strict inequality if $\pi_i$ is suboptimal, where $\pi_{i+1}$ is the new policy we get from policy improvement on $\pi_i$

Proof:

\begin{equation} V^{\pi_i}(s) \le \max_a Q^{\pi_i}(s,a)\newline =\max_a R(s,a)+\gamma \sum_{s'\in S}p(s'\mid s,a)V^{\pi_i}(s')\newline =R(s,\pi_{i+1}(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_{i+1}(s))V^{\pi_i}(s') \text{ by defn}\newline \le R(s,\pi_{i+1}(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_{i+1}(s))\max_a Q^{\pi_i}(s',a') \text{ by defn}\newline = R(s,\pi_{i+1}(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_{i+1}(s))(R(s',\pi_{i+1}(s'))+\gamma \sum_{s''\in S}p(s''\mid s',\pi_{i+1}(s'))V^{\pi_i}(s''))\newline \dots\newline = V^{\pi_{i+1}}(s) \end{equation}

If policy does not change, it can never change again.

There is a maximum number of iterations of policy iteration.($|A|^{|S|}$)

Bellman Backup Operators

Applied to a value function
Returns a new value function
Improves the value if possible
- $BV(s)=\max_a [R(s,a)+\gamma \sum_{s'\in S}p(s'\mid s,a)V(s')]$
$BV$ yields a value function over all states s

Value Iteration (VI)

Set $k=1$
Initialize $V_0(s)=0$ for all states s
Loop until convergence: (for ex. $\vert V_{k+1}-V_{k}\Vert_\inf \le \epsilon$)
- For each state $s$
  - $V_{k+1}(s)=\max_a[R(s,a)+\gamma \sum_{s'\in S}P(s'\mid s, a)V_k(s')]$
- View as Bellman backup on value function
  - $V_{k+1}=BV_k$
  - $\pi_{k+1}(s)=arg\max_a[R(s,a)+\gamma \sum_{s'\in S}P(s'\mid s, a)V_k(s')]$

Policy Iteration as Bellman Operations

Bellman backup operator $B^\pi$ for a particular policy is defined as $$ B^{\pi} V(s)=R^{\pi} (s)+\gamma \sum_{s' \in S}P^\pi(s'\mid s)V(s') $$

Policy evaluation amounts to computing the fixed point of $B^{\pi}$

To do policy evaluation, repeatedly apply operator until $V$ stops changing $V^\pi = B^\pi B^\pi \dots B^\pi V$

To do policy improvement $$ \pi_{k+1}(s)= arg\max_a[R(s,a)+\gamma \sum_{s'\in S}P(s'\mid s,a)V^{\pi_k}(s')] $$

Contraction Operator

Let $O$ be an operator, and $|x|$ denote any norm of x

If $|OV-OV'| \le |V-V'|$, then $O$ is a contraction operator

Will Value Iteration Converge?

Yes, if discount factor γ < 1, or end up in a terminal state with probability 1.

Bellman backup is a contraction if discount factor, $\gamma < 1$.

If apply it to two different value functions, distance between value functions shrinks after applying Bellman equation to each.

Proof: Bellman Backup is a Contraction on $V$ for $\gamma < 1$

Let $\vert V-V'\vert=\max_s |V(s)-V'(s)|$ be the infinity norm

$$ \vert BV_k-BV_j\vert \le \max_a \vert\gamma \vert V_k-V_j\vert\sum_{s'\in S}P(s'\mid s, a)\vert =\gamma\vert V_k-V_j\vert $$

Even if all inequalities are equalities, this is still a contraction if $\gamma < 1$

MDP: Computing Optimal Policy and Optimal Value

Policy iteration computes infinite horizon value of a policy and then improves that policy.

Value iteration maintains optimal value of starting in a state s if have a finite number of steps k left in the episode. Iterate to consider longer and longer episodes.

Consider a finite horizon MDP. A policy $\pi$ in this MDP induces a value function$V^{\pi}$(lets refer to this as $V^\pi_{old}$). No suppose we have a new MDP where the only difference is that all rewards have a constant c added to them. Prove $V_{new}^\pi(s)=V_{old}^\pi(s)+\frac{c}{1-\gamma}$

\begin{equation} V^\pi_{old}=E_\pi [\sum_{T=0}^{\infty}\gamma^Tr_{t+T}\mid s_t=s]\newline V^\pi_{new}=E_\pi [\sum_{T=0}^{\infty}\gamma^T(r_{t+T}+c)\mid s_t=s]\newline =E_\pi [\sum_{T=0}^{\inf}\gamma^Tr_{t+T}\mid s_t=s]+c\sum_{T=0}^{\infty}\gamma^T\newline =V_{old}^\pi(s)+\frac{c}{1-\gamma} \end{equation}

Summary

Given dynamics/transition model $P$ and reward model $R$, we can do dynamic programing to evaluate how good a policy is.

$V_k^\pi(s)$ is exact value of k-horizon value of state s under policy $\pi$

$V_k^\pi(s)$ is an estimate of infinite horizon value of state s under policy $\pi$

Dynmaic programming computes this $V^\pi(s)=r(s,\pi(s))+\gamma \sum P(s'\mid s,a)V_{k-1}^\pi(s')$, bootstrapping the rest of the expected return by the value estimate $V_{k-1}$.

If we know model $P(s' \mid s,a)$, we can compute immediate reward and expectation over next states exactly, then we can bootstrap and use current estimate $V_{k-1}$

Dynamic Programming:

$V^\pi(s)\approx E_\pi[r_t+\gamma V_{k-1}\mid s_t=s]$
Requires model of MDP $M$
Bootstraps future return using value estimate
Requires Markov assumption: bootstrapping regarless of history

Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works

Estimating the expected return of a particular policy if don’t have access to true MDP models
Monte Carlo policy evaluation
Temporal Difference (TD)

Monte Carlo(MC) Policy Evaluation

$V^\pi(s)=E_{T\sim \pi}[G_t\mid s_t=s] $
If trajectories are all finite, sample set of trajectories & average returns
Does not require MDP dynamics/rewards
Does not assume state is Markov
Can only be applied to episodic MDPs
- Averaging over returns from a complete episode
- Requires each episode to terminate

In general, we get the Monte Carlo estimate of some quantity by observing many iterations of how that quantity is generated either in real life or via simulation and then averaging over the observed quantities. By the law of large numbers, this average converges to the expectation of the quantity.

Aim: estimate $V^\pi(s)$ given episodes generated under policy $\pi$, which is the average of returns Gt under policy π starting at state s
MC computes empirical mean return
Often do this in an incremental fashion. After each episode, update estimate of $V^\pi$

Execute a rollout of policy $\pi$ until termination many times
Record the returns $G_t$ that we observe when starting at state s
Take an average of the values we get for $G_t$ to estimate $V^\pi(s)$.

First-Visit Monte Carlo (MC) On Policy Evaluation

Initialize $N(s)=0, G(s)=0 \forall s\in S$. $N(s)$ means # times visited a state.

Loop

Sample episode $i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2}\ldots $
Define $G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma^2 r_{i,t+2}+\ldots+\gamma^{T_i-1} r_{i,T_i} $ as return from time step $t$ onwards in $i$th episode.
For each time step t till the end of the episode i
- If this is the first time t that state s is visited in episode i
  - Increment counter of total first visits: $N(s)=N(s)+1$
  - Increment total return$G(s)=G(s)+G_{i,t}$
  - Update estimate $V^\pi(s)=G(s)/N(s)$

Every-Visit Monte Carlo (MC) On Policy Evaluation

Initialize $N(s)=0,G(s)=0\forall s \in S$

Loop

Sample episode $i=s_{i,1},a_{i,1},r_{i,1},s_{i,2},a_{i,2},r_{i,2}\ldots,s_{i,T_i} $
Define $G_{i,t}=r_{i,t}+\gamma r_{i,t+1}+\gamma^2 r_{i,t+2}+\ldots+\gamma^{T_i-1} r_{i,T_i} $ as return from time step $t$ onwards in $i$th episode.
For each time step t till the end of the episode i
- state s is the state visited at time step t in episodes i
- Increment counter of total visits: $N(s)=N(s)+1$
- Increment total return $G(s)=G(s)+G_{i,t}$
- Update estimate $V^\pi(s)=G(s)/N(s)$

Evaluation of the Quality of a Policy Estimation Approach: Bias, Variance and MSE

Consider a statistical model that is parameterized by θ and that determines a probability distribution over observed data P(x|θ)

Consider a statistic $\hat{\theta}$ that provides an estimate of θ and is a function of observed data x

Definition: the bias of an estimator$\hat{\theta}$ is: $$ Bias_{\theta}(\hat{\theta})=E_{x\mid \theta}[\hat{\theta}(x)]-\theta(x) $$

Definition: the variance of an estimator$\hat{\theta}$ is: $$ Var(\hat{\theta})=E_{x\mid \theta}[(\hat{\theta}(x)-E[\theta(x)])^2] $$

Definition: mean squared error (MSE) of an estimator $\hat{\theta}$ is: $$ MSE(\hat{\theta})=Var(\hat{\theta})+Bias_{\theta}(\hat{\theta})^2 $$

Let n be the number of data points x used to estimate the parameter θ and call the resulting estimate of θ using that data $\hat{\theta}_n$

Then the estimator $\hat{\theta}_n$ is consistent if, for all $\epsilon > 0$

$$ \lim_{n \rightarrow \infty} Pr(|\hat{\theta}_n-\theta|>\epsilon)=0 $$

First-Visit Monte Carlo (MC) On Policy Evaluation

$V^\pi$ estimator is unbiased and consistent

Every-Visit Monte Carlo (MC) On Policy Evaluation

$V^\pi$ estimator is biased and consistent, but has better MSE

Temporal Difference Learning

Combination of Monte Carlo & dynamic programming methods
Model-free
Can be used in episodic or infinite-horizon non-episodic settings
Immediately updates estimate of V after each (s, a,r,s′) tuple

Temporal Difference [TD(0)] Learning

$$ V^\pi(s_t)=V^\pi(s_t)+\alpha([r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)]]) $$

TD error:$\delta_t=r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)$

Algorithm

Input: $\alpha$

Initialize $V^\pi(s) = 0, \forall s \in S$

Loop

Sampe tuple $(s_t, a_t, r_t, s_{t+1})$
$V^\pi(s_t)=V^\pi(s_t)+\alpha([r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)])$

Model Free Control

Maximization Bias

Value Function Approximation

So far we have represented value function by a lookup table where each state has a corresponding entry, $V(s)$, or each state-action pair has an entry, $Q(s, a)$. However, this approach might not generalize well to problems with very large state and/or action spaces, or in other cases we might prefer quickly learning approximations over converged values of each state. A popular approach to address this problem is via function approximation:

$$ v_\pi(s)\approxeq \hat{v}(s,\boldsymbol{w}) \text{ or } q_\pi(s,a)\approxeq \hat{q}(s,a,\boldsymbol{w}) $$

Here $\boldsymbol{w}$ is usually referred to as the parameter or weights of our function approximator. We list a few popular choices for function approximators:

Linear combinations of features
Neural networks
Decision trees
Nearest neighbors
Fourier / wavelet bases

Linear Feature Representations

In linear function representations, we use a feature vector to represent a state: $$ x(s)=[x_1(s),x_2(s),\ldots,x_n(s)] $$

We then approximate our value functions using a linear combination of features: $$ \hat{v}(s,\boldsymbol{w})=x(s)^T\boldsymbol{w}=\sum_{j=1}^n w_j x_j(s) $$

We define the (quadratic) objective (also known as the loss) function to be: $$ J(\boldsymbol{w})=E_\pi[(v_\pi(s)-\hat{v}(s,\boldsymbol{w}))^2] $$

Gradient descent

A common technique to minimize the above objective function is called Gradient Descent.

Mathematically, this can be summarized as:

compute the gradient: $$ \nabla_\boldsymbol{w}J(\boldsymbol{w})=[\frac{\partial J(\boldsymbol{w})}{\partial w_1},\frac{\partial J(\boldsymbol{w})}{\partial w_2}, \ldots, \frac{\partial J(\boldsymbol{w})}{\partial w_n}] $$

compute an update step using gradient descent: $$ \Delta \boldsymbol{w} = -\frac{1}{2} \alpha \nabla_\boldsymbol{w}J(\boldsymbol{w}) $$

take a step towards the local minimum $$ \boldsymbol{w} = \boldsymbol{w} + \Delta \boldsymbol{w} $$

Stochastic gradient descent

In practice, Gradient Descent is not considered a sample efficient optimizer and stochastic gradient descent (SGD) is used more often. Although the original SGD algorithm referred to updating the weights using a single sample, due to the convenience of vectorization, people often refer to gradient descent on a minibatch of samples as SGD as well. In (minibatch) SGD, we sample a minibatch of past experiences, compute our objective function on that minibatch, and update our parameters using gradient descent on the minibatch.

Monte Carlo with linear VFA

Initialize $\boldsymbol{w}=0$, Returns(s)=0 $\forall s$,k=1
loop
- Sample k-th epsiode $(s_{k1}, a_{k1}, r_{k1}, s_{k2}, \ldots, s_k,L_k)$ given $\pi$
- for $t=1,\ldots,L_k$ do
  - if first visit to (s) in epsiode k then
    - Append $\sum_{j=t}^{L_k}r_{kj}$ to Return($s_t$)
    - $\boldsymbol{w} \leftarrow \boldsymbol{w}+\alpha(Reutrn(s_t)-\hat{v}(s_t,\boldsymbol{w}))x(s_t)$
- k=k+1

Temporal Difference (TD(0)) with linear VFA

Convergence Guarantees for Linear VFA for Policy Evaluation

Algorithm	Tabular	Linear VFA	Nonlinear VFA
Monte-Carlo Control	Yes	(Yes)	No
SARSA	Yes	(Yes)	No
Q-learning	Yes	No	No

(Yes) means the result chatters around near-optimal value function.

Control using VFA

Neural Netowrks

Although linear VFAs often work well given the right set of features, it can also be diffcult to handcraft such feature set. Neural Networks provide a much richer function approximation class that is able to directly go from states without requiring an explicit specification of features.

Value-Based Deep Reinforcement Learning

Deep Q-Network
Double DQN
Dueling DQN

All the three neural architectures are able to learn successful policies directly from high-dimensional inputs, e.g. preprocessed pixels from video games, by using end-to-end reinforcement learning, and they all achieved a level of performance that is comparable to a professional human games tester across a set of 49 names on Atari 2600.

Convolutional Neural Networks (CNNS) are used in these architectures for feature extraction from pixel inputs. Understanding the mechanisms behind feature extraction via CNNs can help better understand how DQN works.

Generalization: Deep Q-Network (DQN)

The performance of linear function approximators highly depends on the quality of features. In general, handcrafting an appropriate set of features can be diffcult and time-consuming. To scale up to making decisions in really large domains (e.g. huge state space) and enable automatic feature extraction, deep neural networks (DNNs) are used as function approximators.

DQN Architecture

Preprocessing Raw Pixels

The raw Atari 2600 frames are of size (210 × 160 × 3), where the last dimension is corresponding to the RGB channels. The preprocessing step adopted in {% cite mnih2015humanlevel %} aims at reducing the input dimensionality and dealing with some artifacts of the game emulator. We summarize the preprocessing as follows:

single frame encoding: to encode a single frame, the maximum value for each pixel color value over the frame being encoded and the previous frame is returned. In other words, we return a pixel-wise max-pooling of the 2 consecutive raw pixel frames
dimensionality reduction: extract the Y channel, also known as luminance, from the encoded RGB frame and rescale it to (84 × 84 × 1).

The above preprocessing is applied to the 4 most recent raw RGB frames and the encoded frames are stacked together to produce the input (of shape (84 × 84 × 4)) to the Q-Network. Stacking together the recent frames as game state is also a way to transform the game environment into a (almost) Markovian world.

Training Algorithm for DQN

Reducing Bias: Double Deep Q-Network (DDQN)

Decoupling Value and Advantage: Dueling Network

Introduction to Policy Search

Stochastic policies

Policy optimization

Policy objective functions

Once we have defined a policy $\pi_(a|s)$, we need to able to measure how it is performing in order to optimize it. In an episodic environment, a natural measurement is the start value of the policy, which is the expected value of the start state:

$$ J_1(\theta)=V^{\pi_\theta}(s_1)=E_{\pi_theta}[v_1] $$

In continuing environments we can use the average value of the policy, where $d^{\pi_\theta}(s)$ is the stationary distribution of $\pi_\theta$:

$$ J_{avV(\theta)}=\sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s) $$

or alternatively we can use the average reward per time-step: $$ J_{avR(\theta)}=\sum_s d^{\pi_\theta}(s)\sum_a \pi_\theta(a\mid s)R(s,a) $$

Gradient-free optimization methods

With an objective function, we can treat our policy-based reinforcement learning problem as an optimization problem. We focus on gradient descent, because recently that has been the most common optimization method for policy-based RL methods. However, it is worth considering some gradient-free optimization methods, including the following:

Hill climbing
Simplex / amoeba/Nelder Mead
Genetic algorithms
Cross-Entropy method (CEM)
Covariance Matrix Adaptation (CMA)
Evolution strategies

These methods have the advantage over gradient-based methods in that they do not have to compute a gradient of the objective function. This allows the policy parameterization to be non-differentiable, and these methods are also often easy to parallelize. Gradient-free methods are often a useful baseline to try, and sometimes they can work embarrassingly well.

However, this methods are usually not very sample efficient because they ignore the temporal structure of the rewards - updates only take into account the total reward over the entire episode, and they do not break up the reward into different rewards for each state in the trajectory.

Policy gradient

Let us define $V^{\pi_\theta}=V(s_0, \theta)$ to be the objective function we wish to maximize over $\theta$. Policy gradient methods search for a local maximum in $V(\theta)$ by ascending the gradient of the policy, w.r.t parameters $\theta$

$$ \Delta \theta = \alpha \nabla_\theta V(s_0, \theta) $$

where $\alpha$ is a step-size parameter and $\nabla_\theta V(\theta)$ is the policy gradient.

$$ \nabla_\theta V(\theta)=\begin{bmatrix}\frac{\partial V(\theta)}{\partial \theta_1}\\ \vdots \\ \frac{\partial V(\theta)}{\partial \theta_n}\end{bmatrix} $$

Compute Gradients by Finite Differences

For each dimension $k \in [1,n]$

Estimate kth partial derivative of objective function w.r.t. $\theta$
By perturbing $\theta$ by small amount $\epsilon$ in kth dimension

$$ \frac{\partial V(s_0, \theta)}{\partial \theta_k} \approxeq \frac{V(s_0,\theta+\epsilon u_k)-V(s_0,\theta)}{\epsilon} $$ where $u_k$ is a unit vector with 1 in kth component, 0 elsewhere.

Use n evaluations to compute policy gradient in n dimensions
Simple, noisy, inefficient, but sometimes effective
Works for arbitrary polices, even if policy is not differentiale
uses a number of evaluations that scales linearly with the policy feature dimension (not state)

Finite difference policy gradient:

Is guaranteed to converage to a local optima
Is not guaranteed to converage to a global optima
Don't relies on the Markov assumption

COmputing the gradient analytically

Assume policy $\pi_\theta$ is differentiable whenever it is non-zero
Assume we can calculate gradient $\nabla_\theta \pi_\theta(s,a)$ analytically
Many choices of differentiable policy classes including:
- Softmax
- Gaussian
- Neural networks

Appendix

Stochastic Matrices

A stochastic matrix is a square matrix of non-negative real numbers in a closed interval that list the probabilities in a finite Markov chain. This is also called a probability matrix, probability transition matrix, transition matrix, substitution matrix, or Markov matrix, is matrix used to characterize transitions for a finite Markov chain. These square matrices can take multiple forms, depending on which stochastic vector (row or column) can be summed to 1.

A right stochastic matrix has each row summing to 1.
A left stochastic matrix has each column summing to 1.
A doubly stochastic matrix has both rows and columns summing to 1.

In the contex of Markov process, transition probability matrix P of size $\mid S\mid\times \mid S\mid$, whose (i,j) entry is given by $P_{ij}=P(j\mid i)$, with i,j referring to the states of S oredered arbitrarily. Matrix P is a non-negative row-stochastic matrix.

Let A be a stochastic matrix. Then:

1 is an eigenvalue of A.
If $\lambda$ is a (real or ccomplex) eigenvalue of A, the $\mid \lambda \mid \leq 1$

Proof:

$$ A^T \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1\end{bmatrix}=\begin{bmatrix}1 \\ 1 \\ \vdots \\ 1\end{bmatrix} $$

Therefore, 1 is an eigenvalue of $A^T$. But A and $A^T$ have the same characteristic polynomial:

$$ det(A-\lambda I_n)=det((A-\lambda I_n)^T)det(A^T-\lambda I_n) $$

Therefore, 1 is an eigenvalue of A.

Now let $\lambda$ be any eigenvalue of A, so it is also an eigenvalue of $A^T$. Let $x=(x_1,x_2,...,x_n)$ be an eigenvector of $A^T$ with eigenvalue $\lambda$. So $A^T x = \lambda x$. The jth entry of this vector equation is $\lambda x_j=\sum_{i=1}^n a_{ij}x_i$.

Choose $x_j$ with the largest absolute value, so $|x_i| \leq |x_j|$ for all i. Then $$ |\lambda x_j|=|\sum_{i=1}^n a_{ij}x_i|\leq \sum_{i=1}^n a_{ij} \times |x_i|\leq \sum_{i=1}^n a_{ij} \times |x_j| = 1 \times |x_j| $$

where the last equality holds because $\sum_{i=1}^n a_{ij} =1$. This implies $\mid \lambda \mid \leq 1$.

Cauchy sequence

A sequence is called a Cauchy sequence if the terms of the sequence eventually all become arbitrarily close to one another. That is, given $\epsilon > 0$ there exists N such that if $m, n > N$ then $|a_m- a_n| < \epsilon$.

Any Cauchy sequence is bounded.
Any convergent sequence is a Cauchy sequence.
A Real Cauchy sequence is convergent.

Policy Gradient Theory

Let $\theta$ denote the vector of policy parameters and $\rho$ the performance of the curresponding policy. Then, in the policy gradient approach, the policy parameters are updated approximately proportional to the gradient: $$ \delta \theta \approxeq \alpha \frac{\partial \rho}{\partial \theta} $$

where $\alpha$ is a positive-definite step size.

With function approximation, two ways of formulating the agent's objective are useful. One is the average reward formulation, in which polices are ranked according to their long-term expected reward per step, $\rho(\pi)$: $$ \rho(\pi)=\lim_{n \rightarrow \infty} \frac{1}{n}E{r_1 + r_2 + \dots + r_n \mid \pi} = \sum_{s}d^{\pi}(s)\sum_a \pi(s,a)R_s^a $$

where $d^{\pi}(s)=\lim_{t\rightarrow \infty}Pr{s_t=s \mid s_0,\pi}$ is the stationary distribution of states under $\pi$, which we assume exists and is independent of $s_0$ for all policies. In the average reward formulation, the value of a state-action pair given a policy is defined as $$ Q^{\pi}(s,a)=\sum_{t=1}^{\infty}E{r_t-\rho(\pi) \mid s_0=s,a_0=a,\pi} $$

The second forumation is that in which there is a designated start state $s_0$, and we care only about the long-term reward obtained from it. $$ \rho(\pi)=E{\sum_{t=1}^\infty \gamma^{t-1}r_t \mid s_0,\pi} $$

$$ Q^{\pi}(s,a)=E{\sum_{k=1}^\infty \gamma^{k-1}r_{t+k} \mid s_t=s,a_t=a,\pi} $$

where $\gamma \in [0,1]$ is a discount rate. In this formulation, we defined $d^\pi(s)$ as a discounted weighting of states encountered starting at $s_0$ and the following $\pi$: $d^\pi(s)=\sum_{t=0}^{\infty} \gamma^t Pr{s_t=s\mid s_0,\pi}$.

Score Function

A score function is the derivative of the log of a parameterized probability/likelihood. For example, let $p(s;\theta)$ be the probability of state s under parameter $\theta$, then the score function would be $\nabla_\theta \log{p(s;\theta)}$