Chufan Chen's Homepage

Deep Reinforcement Learning and Control

2022-04-20T11:10:00

Given a model of the world

After this section, reader should be able to solve the problem of "MDP control" both in infinite horizon setting and finite horizon setting.

Markov Process or Markov Chain \((S,P)\)

Markov Reward Process is a Markov Chain + reward

$R$ is a reward function $R(s_t=s)=E[r_t\mid s_t=s]$

We assume stationary rewards. $$ r_i=r_j,\text{whenever }s_i=s_j, \forall i,j =0,1,..., $$ In the case of stochastic rewards, we require that the cumulative distribution functions(cdf) of the rewards conditioned on the current state be time independent. $$ F(r_i\mid s_i=s)=F(r_j\mid s_j=s), \forall s \in S,\forall i,j=0,1,..., $$ where $F(r_i\mid s_i=s)$ denotes the cdf of $r_i$ conditioned on the state $s_i=s$.

With these two equation above, we have the following result about the expeced rewards: $$ R(s)=E[r_i\mid s_i=s],\forall i=0,1,..., $$

Horizon $H$

Markov Process

Two assumptions:

$P$ is a row-stochastic matrix. 1 is an eigenvalue of any row-stochastic matrix. Any eignvalue of a row-stochastic matrix has maximum absolute value 1.

The max-norm or infinity-norm of a vector $x \in R^n$ is denoted by $\Vert x\Vert_\infty$ and defined as $\Vert x\Vert_\infty=\max_i |x_i|$. For any matrix $A\in R^{m\times n}$, define the following quantity(called the induced infinity norm of a matrix) $$ \Vert \boldsymbol{A}\Vert_\infty=\sup_{x \in R^n, x \neq 0} \frac{\Vert \boldsymbol{A}x\Vert_\infty}{\Vert x\Vert_\infty} $$

  1. $\Vert \boldsymbol{A}\Vert_\infty$ satisfies all the properties of a norm
  2. $\Vert \boldsymbol{A}\Vert_\infty = \max_{i=1,...,m}(\sum_{j=1}^n|\boldsymbol{A}_{ij}|)$
  3. Conclude that if $A$ is row-stochastic, the $\Vert \boldsymbol{A}\Vert_\infty=1$
  4. For every $x\in R^n$, $\Vert \boldsymbol{A}x\Vert_\infty \leq \Vert \boldsymbol{A}\Vert_\infty\Vert x\Vert_\infty$

Returns $G_t$

General form: $G_t=r_t+\gamma r_{t+1}+\gamma^2r_{t+2}+\dots+\gamma^{H-1}r_{t+H-1}$

$G_t=R_{t+1}+R_{t+2}+\dots R_{T}$ for epsiodic tasks, where T is a final step at which a terminal state is reached, ending an episode.

$G_t=R_{t+1}+\gamma R_{t+2}+\dots =\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}$ for continuing tasks with a temporal discounting, where $\gamma$ is the discount factor.

Discount factor

Consider a finite horizon Markov reward process, with bounded rewards. Specifically assume that $\exists M \in (0,\infty)$ such that$\mid r_i\mid \leq M \forall i$ and across all episodes(realizations). The return for any eepisode $G_t$ is bounded. $C(M,\gamma,t,H)=\frac{M(1-\gamma^{H-t})}{1-\gamma}$ is a bound. In other word, $\mid G_t\mid \leq C$ for any episode.(Hint: Geometric progression)

Consider a finite horizon Markov reward process, with bounded rewards and $\gamma <1$. The return for any eepisode $G_t$ is bounded. Consider the partial sum $\sum_{i=t}^N \gamma^{i-t}r_i$, for $N\geq t$. ${S_n}_{N\geq t}$ is a Cauchy sequence.

Value Functions are Expected Return

$V_{t}(s)=E[G_t\mid S_t=s]$ is called state-value function.

When the horizon H is infinite, $V_{t}(s)=E[G_t\mid S_t=s]$ with the stationary assumptions of the rewards and tarnsition probabilities imply that $V_i(s)=V_j(s)$ for all $i,j=0,1,...,$ and thus in this case we will define: $$ V(s)=V_0(s) $$

Computing the value function of a Markov Reward Process

Monte Carlo simulation

procedure Monte Carlo Evaluation(M,s,t,N)

Analytic solution

This method works only for an infinite horizon MRP with $\gamma < 1$.

MRP value function satisifies $V(s)=R(s)+\gamma \sum_{s' \in S}P(s'\mid s)V(s')$.

For finite state MRP($|S|<\infty$), we can express $V(s)$ using a matrix equation $$ \begin{bmatrix}V(s_1)\\ \vdots \\ V(s_N)\end{bmatrix}=\begin{bmatrix}R(s_1)\\ \vdots \\ R(s_N)\end{bmatrix}+\gamma \begin{bmatrix}P(s_1\mid s_1) & \dots & P(s_N\mid s_1)\\ P(s_1\mid s_2) & \dots & P(s_N\mid s_2)\\ \vdots & \ddots & \vdots \\ P(s_1\mid s_N) & \dots & P(s_N\mid s_N)\end{bmatrix}\begin{bmatrix}V(s_1)\\ \vdots \\ V(s_N)\end{bmatrix} $$ \begin{equation} V=R+\gamma PV \end{equation}

\begin{equation} V-\gamma PV=R\newline (I-\gamma P) V=R\newline V=(I-\gamma P)^{-1}R \end{equation}

Solving directly requires taking a matrix inverse $\sim O(N^3)$. Note that $(I-\gamma P)$ is invertible.

Iterative Algorithm for Computing Value of a MRP

Computational complexity: $O(|S|^2)$ for each iteration($|S|=N$)

Solving MDPs

Markov Decision Process is Markov Reward Process + actions.

MDP is a tuple: $(S,A,P,R,\gamma)$

Dynamic programming assumes full knowledge of the MDP. It is used for planning in an MDP.

We can find optimal policy by maximizing over value functions

$$ \pi_{}(a|s)= \begin{cases} 1,& \text{if } a=argmax_{a\in A}q_{}(s,a)\\ 0,& \text{otherwise} \end{cases} $$

If we know $q_*(s,a)$, we do not need the dynamics to find the optimal policy.

$$ \pi_{}(a|s)= \begin{cases} 1,& \text{if } a=argmax_{a\in A}(\sum_{s',r}p(s',r|s,a)(r+\gamma v_(s')))\\ 0,& \text{otherwise} \end{cases} $$

If we know $v_*(s)$, we need the dynamics to do one step lookahead.

Solving Linear Equations

$$ G_t=R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+\dots \\ =R_{t+1}+\gamma (R_{t+2}+\gamma R_{t+3} + \gamma^2 R_{t+3}+\dots) \\ =R_{t+1}+\gamma G_{t+1} $$

By conditioning on a state and taking expections: $$ v_{\pi}(s)=E[R_{t+1}+\gamma v_{\pi}(S_{t+1})]\\ =\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] $$

By conditioning on a state and action and taking expections: $$ q_{\pi}(s, a)=E[R_{t+1}+\gamma q_{\pi}(S_{t+1}, A_{t+1})]\\ =\sum_{s',r}p(s',r|s,a)[r+\gamma \sum_{a'}\pi(a'|s) q_{\pi}(s',a')] $$

Bellman Expectation Equations

\begin{equation} v_{\pi}(s)=\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_{\pi}(s')] \end{equation} is a set of linear equations, one for each state. The state value function for $\pi$ is its unique solution.

\begin{equation} q_{\pi}(s, a)=\sum_{s',r}p(s',r|s,a)[r+\gamma \sum_{a'}\pi(a'|s) q_{\pi}(s',a')] \end{equation} is a set of linear equations, one for each state and action. The state-action value function for $\pi$ is its unique solution.

Bellman optimality Equations

For $$ v_{}(s)=\max_{a\in A}(\sum_{s',r}p(s',r \mid s,a)(r+\gamma v_{}(s'))) $$

$v^*$ is the unique solution of this system of nonlinear equations

$$ q_{}(s, a)=E[R_{t+1}+\gamma max_{a' \in A}q_{}(S_{t+1},a')\mid S_t=s, A_t=a]\\ = \sum_{s',r}p(s',r|s,a)[r+\gamma \max_{a'}q_{*}(s',a')] $$

$q^*$ is the unique solution of this system of nonlinear equations

$$ v_{\pi}(s)=\sum_{a\in A}\pi(a\mid s)q_{\pi}(s,a) \\ v_{}(s)=max_a q_{}(s,a) \\ q_{}(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma v_{}(s')] $$

MDP + π(a|s) = Markov Reward Process

MDP is a MRP $(S,R^\pi, P^\pi,\gamma)$ where

$$ R^\pi(s)=\sum_{a\in A}\pi(a\mid s)R(s,a) $$ $$ P^\pi(s'\mid s)=\sum_{a\in A}\pi(a\mid s)P(s'\mid s,a) $$

It implies we can use same techniques to evaluate the value of a policy for a MDP as we could to compute the value of a MRP, by defining a MRP with $R^\pi$ and $P^\pi$.

MDP Policy Evaluation, Iterative Algorithm

This is a Bellman backup for a particular policy.

If the policy is determinisitic then the above update simplifies to
$$ V_k^{\pi}(s)=r(s,\pi(s))+\gamma \sum_{s' \in S}p(s'\mid s,\pi(s))V_{k-1}^\pi(s') $$

MDP Control

Compute the optimal policy $$ \pi^*(s)=arg\max_{\pi}V^{\pi}(s) $$

There exists a unique optimal value function.

Optimal policy for a MDP in an infinite horizon problem(agent acts forever) is

Number of deterministic policies is $|A|^{|S|}$. Policy iteration is generally more efficient than enumeration.

MDP Policy Iteration(PI)

Policy Improvement

Now, We get $V^{\pi_i}$ by policy evaluation.

By definition $arg\max_{a}Q^{\pi_i}(s,a) \geq Q^{\pi_i}(s,\pi_i(s))$

$$ \max_a Q^{\pi_i}(s,a) \geq R(s,\pi_i(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_i(s))V^{\pi_i}(s')=V^{\pi_i}(s) $$

The Werid thing is suppose we take $\pi_{i+1}(s)$ for one action, then follow $\pi_i$ forever, Our expected sum of rewards is at least as good as if we had always followed $ \pi_i$. It's a bit unclear since we completely change to the new policy meaning the new proposed policy is to always follow $\pi_{i+1}$

Monotonic Improvement in Policy

Definition:

$$V^{\pi_1} \ge V^{\pi_2}:V^{\pi_1}(s) \ge V^{\pi_2}(s), \forall s \in S$$

Proposition:$V^{\pi_{i+1}} \ge V^{\pi_i}$ with strict inequality if $\pi_i$ is suboptimal, where $\pi_{i+1}$ is the new policy we get from policy improvement on $\pi_i$

Proof:

\begin{equation} V^{\pi_i}(s) \le \max_a Q^{\pi_i}(s,a)\newline =\max_a R(s,a)+\gamma \sum_{s'\in S}p(s'\mid s,a)V^{\pi_i}(s')\newline =R(s,\pi_{i+1}(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_{i+1}(s))V^{\pi_i}(s') \text{ by defn}\newline \le R(s,\pi_{i+1}(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_{i+1}(s))\max_a Q^{\pi_i}(s',a') \text{ by defn}\newline = R(s,\pi_{i+1}(s))+\gamma \sum_{s'\in S}p(s'\mid s,\pi_{i+1}(s))(R(s',\pi_{i+1}(s'))+\gamma \sum_{s''\in S}p(s''\mid s',\pi_{i+1}(s'))V^{\pi_i}(s''))\newline \dots\newline = V^{\pi_{i+1}}(s) \end{equation}

If policy does not change, it can never change again.

There is a maximum number of iterations of policy iteration.($|A|^{|S|}$)

Bellman Backup Operators

Value Iteration (VI)

Policy Iteration as Bellman Operations

Bellman backup operator $B^\pi$ for a particular policy is defined as $$ B^{\pi} V(s)=R^{\pi} (s)+\gamma \sum_{s' \in S}P^\pi(s'\mid s)V(s') $$

Policy evaluation amounts to computing the fixed point of $B^{\pi}$

To do policy evaluation, repeatedly apply operator until $V$ stops changing $V^\pi = B^\pi B^\pi \dots B^\pi V$

To do policy improvement $$ \pi_{k+1}(s)= arg\max_a[R(s,a)+\gamma \sum_{s'\in S}P(s'\mid s,a)V^{\pi_k}(s')] $$

Contraction Operator

Let $O$ be an operator, and $|x|$ denote any norm of x

If $|OV-OV'| \le |V-V'|$, then $O$ is a contraction operator

Will Value Iteration Converge?

Yes, if discount factor γ < 1, or end up in a terminal state with probability 1.

Bellman backup is a contraction if discount factor, $\gamma < 1$.

If apply it to two different value functions, distance between value functions shrinks after applying Bellman equation to each.

Proof: Bellman Backup is a Contraction on $V$ for $\gamma < 1$

Let $\vert V-V'\vert=\max_s |V(s)-V'(s)|$ be the infinity norm

$$ \vert BV_k-BV_j\vert \le \max_a \vert\gamma \vert V_k-V_j\vert\sum_{s'\in S}P(s'\mid s, a)\vert =\gamma\vert V_k-V_j\vert $$

Even if all inequalities are equalities, this is still a contraction if $\gamma < 1$

MDP: Computing Optimal Policy and Optimal Value

Policy iteration computes infinite horizon value of a policy and then improves that policy.

Value iteration maintains optimal value of starting in a state s if have a finite number of steps k left in the episode. Iterate to consider longer and longer episodes.

Consider a finite horizon MDP. A policy $\pi$ in this MDP induces a value function$V^{\pi}$(lets refer to this as $V^\pi_{old}$). No suppose we have a new MDP where the only difference is that all rewards have a constant c added to them. Prove $V_{new}^\pi(s)=V_{old}^\pi(s)+\frac{c}{1-\gamma}$

\begin{equation} V^\pi_{old}=E_\pi [\sum_{T=0}^{\infty}\gamma^Tr_{t+T}\mid s_t=s]\newline V^\pi_{new}=E_\pi [\sum_{T=0}^{\infty}\gamma^T(r_{t+T}+c)\mid s_t=s]\newline =E_\pi [\sum_{T=0}^{\inf}\gamma^Tr_{t+T}\mid s_t=s]+c\sum_{T=0}^{\infty}\gamma^T\newline =V_{old}^\pi(s)+\frac{c}{1-\gamma} \end{equation}

Summary

Given dynamics/transition model $P$ and reward model $R$, we can do dynamic programing to evaluate how good a policy is.

$V_k^\pi(s)$ is exact value of k-horizon value of state s under policy $\pi$

$V_k^\pi(s)$ is an estimate of infinite horizon value of state s under policy $\pi$

Dynmaic programming computes this $V^\pi(s)=r(s,\pi(s))+\gamma \sum P(s'\mid s,a)V_{k-1}^\pi(s')$, bootstrapping the rest of the expected return by the value estimate $V_{k-1}$.

If we know model $P(s' \mid s,a)$, we can compute immediate reward and expectation over next states exactly, then we can bootstrap and use current estimate $V_{k-1}$

Dynamic Programming:

Model-Free Policy Evaluation: Policy Evaluation Without Knowing How the World Works

Monte Carlo(MC) Policy Evaluation

In general, we get the Monte Carlo estimate of some quantity by observing many iterations of how that quantity is generated either in real life or via simulation and then averaging over the observed quantities. By the law of large numbers, this average converges to the expectation of the quantity.

  1. Execute a rollout of policy $\pi$ until termination many times
  2. Record the returns $G_t$ that we observe when starting at state s
  3. Take an average of the values we get for $G_t$ to estimate $V^\pi(s)$.

First-Visit Monte Carlo (MC) On Policy Evaluation

Initialize $N(s)=0, G(s)=0 \forall s\in S$. $N(s)$ means # times visited a state.

Loop

Every-Visit Monte Carlo (MC) On Policy Evaluation

Initialize $N(s)=0,G(s)=0\forall s \in S$

Loop

Evaluation of the Quality of a Policy Estimation Approach: Bias, Variance and MSE

Consider a statistical model that is parameterized by θ and that determines a probability distribution over observed data P(x|θ)

Consider a statistic $\hat{\theta}$ that provides an estimate of θ and is a function of observed data x

Definition: the bias of an estimator$\hat{\theta}$ is: $$ Bias_{\theta}(\hat{\theta})=E_{x\mid \theta}[\hat{\theta}(x)]-\theta(x) $$

Definition: the variance of an estimator$\hat{\theta}$ is: $$ Var(\hat{\theta})=E_{x\mid \theta}[(\hat{\theta}(x)-E[\theta(x)])^2] $$

Definition: mean squared error (MSE) of an estimator $\hat{\theta}$ is: $$ MSE(\hat{\theta})=Var(\hat{\theta})+Bias_{\theta}(\hat{\theta})^2 $$

Let n be the number of data points x used to estimate the parameter θ and call the resulting estimate of θ using that data $\hat{\theta}_n$

Then the estimator $\hat{\theta}_n$ is consistent if, for all $\epsilon > 0$

$$ \lim_{n \rightarrow \infty} Pr(|\hat{\theta}_n-\theta|>\epsilon)=0 $$

First-Visit Monte Carlo (MC) On Policy Evaluation

$V^\pi$ estimator is unbiased and consistent

Every-Visit Monte Carlo (MC) On Policy Evaluation

$V^\pi$ estimator is biased and consistent, but has better MSE

Temporal Difference Learning

Temporal Difference [TD(0)] Learning

$$ V^\pi(s_t)=V^\pi(s_t)+\alpha([r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)]]) $$

TD error:$\delta_t=r_t+\gamma V^\pi(s_{t+1})-V^\pi(s_t)$

Algorithm

Input: $\alpha$

Initialize $V^\pi(s) = 0, \forall s \in S$

Loop

Model Free Control

Maximization Bias

Value Function Approximation

So far we have represented value function by a lookup table where each state has a corresponding entry, $V(s)$, or each state-action pair has an entry, $Q(s, a)$. However, this approach might not generalize well to problems with very large state and/or action spaces, or in other cases we might prefer quickly learning approximations over converged values of each state. A popular approach to address this problem is via function approximation:

$$ v_\pi(s)\approxeq \hat{v}(s,\boldsymbol{w}) \text{ or } q_\pi(s,a)\approxeq \hat{q}(s,a,\boldsymbol{w}) $$

Here $\boldsymbol{w}$ is usually referred to as the parameter or weights of our function approximator. We list a few popular choices for function approximators:

Linear Feature Representations

In linear function representations, we use a feature vector to represent a state: $$ x(s)=[x_1(s),x_2(s),\ldots,x_n(s)] $$

We then approximate our value functions using a linear combination of features: $$ \hat{v}(s,\boldsymbol{w})=x(s)^T\boldsymbol{w}=\sum_{j=1}^n w_j x_j(s) $$

We define the (quadratic) objective (also known as the loss) function to be: $$ J(\boldsymbol{w})=E_\pi[(v_\pi(s)-\hat{v}(s,\boldsymbol{w}))^2] $$

Gradient descent

A common technique to minimize the above objective function is called Gradient Descent.

Mathematically, this can be summarized as:

compute the gradient: $$ \nabla_\boldsymbol{w}J(\boldsymbol{w})=[\frac{\partial J(\boldsymbol{w})}{\partial w_1},\frac{\partial J(\boldsymbol{w})}{\partial w_2}, \ldots, \frac{\partial J(\boldsymbol{w})}{\partial w_n}] $$

compute an update step using gradient descent: $$ \Delta \boldsymbol{w} = -\frac{1}{2} \alpha \nabla_\boldsymbol{w}J(\boldsymbol{w}) $$

take a step towards the local minimum $$ \boldsymbol{w} = \boldsymbol{w} + \Delta \boldsymbol{w} $$

Stochastic gradient descent

In practice, Gradient Descent is not considered a sample efficient optimizer and stochastic gradient descent (SGD) is used more often. Although the original SGD algorithm referred to updating the weights using a single sample, due to the convenience of vectorization, people often refer to gradient descent on a minibatch of samples as SGD as well. In (minibatch) SGD, we sample a minibatch of past experiences, compute our objective function on that minibatch, and update our parameters using gradient descent on the minibatch.

Monte Carlo with linear VFA

Temporal Difference (TD(0)) with linear VFA

Convergence Guarantees for Linear VFA for Policy Evaluation

Algorithm Tabular Linear VFANonlinear VFA
Monte-Carlo Control Yes(Yes)No
SARSA Yes(Yes) No
Q-learning Yes No No

(Yes) means the result chatters around near-optimal value function.

Control using VFA

Neural Netowrks

Although linear VFAs often work well given the right set of features, it can also be diffcult to handcraft such feature set. Neural Networks provide a much richer function approximation class that is able to directly go from states without requiring an explicit specification of features.

Value-Based Deep Reinforcement Learning

All the three neural architectures are able to learn successful policies directly from high-dimensional inputs, e.g. preprocessed pixels from video games, by using end-to-end reinforcement learning, and they all achieved a level of performance that is comparable to a professional human games tester across a set of 49 names on Atari 2600.

Convolutional Neural Networks (CNNS) are used in these architectures for feature extraction from pixel inputs. Understanding the mechanisms behind feature extraction via CNNs can help better understand how DQN works.

Generalization: Deep Q-Network (DQN)

The performance of linear function approximators highly depends on the quality of features. In general, handcrafting an appropriate set of features can be diffcult and time-consuming. To scale up to making decisions in really large domains (e.g. huge state space) and enable automatic feature extraction, deep neural networks (DNNs) are used as function approximators.

DQN Architecture

Preprocessing Raw Pixels

The raw Atari 2600 frames are of size (210 × 160 × 3), where the last dimension is corresponding to the RGB channels. The preprocessing step adopted in {% cite mnih2015humanlevel %} aims at reducing the input dimensionality and dealing with some artifacts of the game emulator. We summarize the preprocessing as follows:

The above preprocessing is applied to the 4 most recent raw RGB frames and the encoded frames are stacked together to produce the input (of shape (84 × 84 × 4)) to the Q-Network. Stacking together the recent frames as game state is also a way to transform the game environment into a (almost) Markovian world.

Training Algorithm for DQN

Reducing Bias: Double Deep Q-Network (DDQN)

Decoupling Value and Advantage: Dueling Network

Stochastic policies

Policy optimization

Policy objective functions

Once we have defined a policy $\pi_(a|s)$, we need to able to measure how it is performing in order to optimize it. In an episodic environment, a natural measurement is the start value of the policy, which is the expected value of the start state:

$$ J_1(\theta)=V^{\pi_\theta}(s_1)=E_{\pi_theta}[v_1] $$

In continuing environments we can use the average value of the policy, where $d^{\pi_\theta}(s)$ is the stationary distribution of $\pi_\theta$:

$$ J_{avV(\theta)}=\sum_s d^{\pi_\theta}(s)V^{\pi_\theta}(s) $$

or alternatively we can use the average reward per time-step: $$ J_{avR(\theta)}=\sum_s d^{\pi_\theta}(s)\sum_a \pi_\theta(a\mid s)R(s,a) $$

Gradient-free optimization methods

With an objective function, we can treat our policy-based reinforcement learning problem as an optimization problem. We focus on gradient descent, because recently that has been the most common optimization method for policy-based RL methods. However, it is worth considering some gradient-free optimization methods, including the following:

These methods have the advantage over gradient-based methods in that they do not have to compute a gradient of the objective function. This allows the policy parameterization to be non-differentiable, and these methods are also often easy to parallelize. Gradient-free methods are often a useful baseline to try, and sometimes they can work embarrassingly well.

However, this methods are usually not very sample efficient because they ignore the temporal structure of the rewards - updates only take into account the total reward over the entire episode, and they do not break up the reward into different rewards for each state in the trajectory.

Policy gradient

Let us define $V^{\pi_\theta}=V(s_0, \theta)$ to be the objective function we wish to maximize over $\theta$. Policy gradient methods search for a local maximum in $V(\theta)$ by ascending the gradient of the policy, w.r.t parameters $\theta$

$$ \Delta \theta = \alpha \nabla_\theta V(s_0, \theta) $$

where $\alpha$ is a step-size parameter and $\nabla_\theta V(\theta)$ is the policy gradient.

$$ \nabla_\theta V(\theta)=\begin{bmatrix}\frac{\partial V(\theta)}{\partial \theta_1}\\ \vdots \\ \frac{\partial V(\theta)}{\partial \theta_n}\end{bmatrix} $$

Compute Gradients by Finite Differences

For each dimension $k \in [1,n]$

$$ \frac{\partial V(s_0, \theta)}{\partial \theta_k} \approxeq \frac{V(s_0,\theta+\epsilon u_k)-V(s_0,\theta)}{\epsilon} $$ where $u_k$ is a unit vector with 1 in kth component, 0 elsewhere.

Finite difference policy gradient:

COmputing the gradient analytically

Appendix

Stochastic Matrices

A stochastic matrix is a square matrix of non-negative real numbers in a closed interval that list the probabilities in a finite Markov chain. This is also called a probability matrix, probability transition matrix, transition matrix, substitution matrix, or Markov matrix, is matrix used to characterize transitions for a finite Markov chain. These square matrices can take multiple forms, depending on which stochastic vector (row or column) can be summed to 1.

In the contex of Markov process, transition probability matrix P of size $\mid S\mid\times \mid S\mid$, whose (i,j) entry is given by $P_{ij}=P(j\mid i)$, with i,j referring to the states of S oredered arbitrarily. Matrix P is a non-negative row-stochastic matrix.

Let A be a stochastic matrix. Then:

  1. 1 is an eigenvalue of A.
  2. If $\lambda$ is a (real or ccomplex) eigenvalue of A, the $\mid \lambda \mid \leq 1$

Proof:

$$ A^T \begin{bmatrix} 1 \\ 1 \\ \vdots \\ 1\end{bmatrix}=\begin{bmatrix}1 \\ 1 \\ \vdots \\ 1\end{bmatrix} $$

Therefore, 1 is an eigenvalue of $A^T$. But A and $A^T$ have the same characteristic polynomial:

$$ det(A-\lambda I_n)=det((A-\lambda I_n)^T)det(A^T-\lambda I_n) $$

Therefore, 1 is an eigenvalue of A.

Now let $\lambda$ be any eigenvalue of A, so it is also an eigenvalue of $A^T$. Let $x=(x_1,x_2,...,x_n)$ be an eigenvector of $A^T$ with eigenvalue $\lambda$. So $A^T x = \lambda x$. The jth entry of this vector equation is $\lambda x_j=\sum_{i=1}^n a_{ij}x_i$.

Choose $x_j$ with the largest absolute value, so $|x_i| \leq |x_j|$ for all i. Then $$ |\lambda x_j|=|\sum_{i=1}^n a_{ij}x_i|\leq \sum_{i=1}^n a_{ij} \times |x_i|\leq \sum_{i=1}^n a_{ij} \times |x_j| = 1 \times |x_j| $$

where the last equality holds because $\sum_{i=1}^n a_{ij} =1$. This implies $\mid \lambda \mid \leq 1$.

Cauchy sequence

A sequence is called a Cauchy sequence if the terms of the sequence eventually all become arbitrarily close to one another. That is, given $\epsilon > 0$ there exists N such that if $m, n > N$ then $|a_m- a_n| < \epsilon$.

  1. Any Cauchy sequence is bounded.
  2. Any convergent sequence is a Cauchy sequence.
  3. A Real Cauchy sequence is convergent.

Policy Gradient Theory

Let $\theta$ denote the vector of policy parameters and $\rho$ the performance of the curresponding policy. Then, in the policy gradient approach, the policy parameters are updated approximately proportional to the gradient: $$ \delta \theta \approxeq \alpha \frac{\partial \rho}{\partial \theta} $$

where $\alpha$ is a positive-definite step size.

With function approximation, two ways of formulating the agent's objective are useful. One is the average reward formulation, in which polices are ranked according to their long-term expected reward per step, $\rho(\pi)$: $$ \rho(\pi)=\lim_{n \rightarrow \infty} \frac{1}{n}E{r_1 + r_2 + \dots + r_n \mid \pi} = \sum_{s}d^{\pi}(s)\sum_a \pi(s,a)R_s^a $$

where $d^{\pi}(s)=\lim_{t\rightarrow \infty}Pr{s_t=s \mid s_0,\pi}$ is the stationary distribution of states under $\pi$, which we assume exists and is independent of $s_0$ for all policies. In the average reward formulation, the value of a state-action pair given a policy is defined as $$ Q^{\pi}(s,a)=\sum_{t=1}^{\infty}E{r_t-\rho(\pi) \mid s_0=s,a_0=a,\pi} $$

The second forumation is that in which there is a designated start state $s_0$, and we care only about the long-term reward obtained from it. $$ \rho(\pi)=E{\sum_{t=1}^\infty \gamma^{t-1}r_t \mid s_0,\pi} $$

$$ Q^{\pi}(s,a)=E{\sum_{k=1}^\infty \gamma^{k-1}r_{t+k} \mid s_t=s,a_t=a,\pi} $$

where $\gamma \in [0,1]$ is a discount rate. In this formulation, we defined $d^\pi(s)$ as a discounted weighting of states encountered starting at $s_0$ and the following $\pi$: $d^\pi(s)=\sum_{t=0}^{\infty} \gamma^t Pr{s_t=s\mid s_0,\pi}$.

Score Function

A score function is the derivative of the log of a parameterized probability/likelihood. For example, let $p(s;\theta)$ be the probability of state s under parameter $\theta$, then the score function would be $\nabla_\theta \log{p(s;\theta)}$