Reinforcement Learning Theory

Introduction

Through this article, we aim to understand the theory of reinforcement learning in the context of Large Language Models.

Reinforcement learning addresses the domain of sequential decision problems, wherein an agent takes actions in an environment. These configurations are generally represented using Markov Decision Processes (MDP). An MDP is characterized by -

  • Time - T - Discrete, infinite
  • States - S - Discrete
  • Actions - A - Discrete
  • Transitions - \(\tau : S \times A \to \Delta(S)\) - a probability distributions of states

    \[P(s’ \vert s, a) = P_R[ \tau (s, a) = s’]\]

    In more complicated setups, the transition function could be a function of the time \(T\) (stationarity and non-stationarity). In finite-time horizons, the time can be embedded inside the state itself, converting non-stationarity scenarios to stationarity ones.

  • Reward - \(R: S \times A \times S \to \mathbb R: [-M, M]\). There is no need to make this non-deterministic since that is already generalized by \(\tau\). The rewards are usually bounded. The expected reward for an action is given by

    \[R’(s, a) = \sum_{s’ \in S} R(s’, a, s) p(s’ \vert s, a)\]
  • Initial state \(S_0\) - Can be a single state or a distribution over states.
  • Discount Factor \(\gamma\) - A factor \(<1\) to bound the total expected reward from the future. This will be better understandable from the text later.

A policy \(\pi\) is a probability distributions over actions based on the current state. Consider a distribution \(\beta: S \to \delta(A)\), then a policy formally is

\[\pi(s, a) = P_R[\beta(s) = a]\]

A trajectory describes the sequence of states, actions of an agent in a run - \(s_0, a_0, r_0, s_1, a_1, r_1, \dots\).

A policy is associated with a value function \(V_{\pi, t} (s) = \sum_{i = t}^\infty \gamma^{i - t} r_i\) Note that the dependence of the value of function on time is redundant in stationarity situations.

The goal is to maximize this value function over all possible policies in an environment

\[\pi^* = \arg \max_{\pi} V_\pi (S_0)\]

This is known as the Markov Decision Problem. In a strict setting, the supremum may not exist (be a valid policy).

How is Reinforcement Learning different from Supervised learning? The key differences are

  • There is no sequentiality
  • Every decision in SL has an associated reward, and there is no stochasticity

Policies are optimized in multiple methods

  • There are hill-climb methods - policy iteration, value iteration
  • Learning based methods - think of a policy as a matrix that has to optimized to satisfy certain constraints.

Bellman Equations

These are a recursive formulations of policies and value functions.

\[V_\pi(s) =3.142 s \pi(s, a)[ p(s’ \vert s, a) [R(s’, a, s) + \gamma V_\pi (s’)]\]

The Markov Assumption

Many processes in nature are Markov in nature - the action at the current state only depends on the current state, and not the history of states.

A general policy is a function from trajectories \(H_t\) to a distribution over actions. However, such general representation is difficult to compute. Additionally, due to the Markov assumption, a stationary policy \(\pi: S \to \delta(A)\) is general enough for optimality. On surface, this may seem like we are limiting the power of an agent, but we will see that this is not the case. In fact, even a deterministic policy will do.

How about reward? Are we limiting the reward function with a Markovian assumption? Since we consider expected rewards - the expectation value can be embedded within the state-based rewards as well. So in fact, both representations are the same.

With a general policy, the probability of a trajectory is given by

\[P[S_0 = s_0, A_0 = a_0, \dots, S_t = s_t] = \mu(s_0) \pi(a_0 \vert s_0) P(s’ \vert s_0, a_0) \pi(A_1 \vert S_0, a_0, s_1) \dots\]

Continuing with the Bellman equations, our goal was to find a policy that maximizes the value function

\[\begin{align*} V_\pi(s) &= \mathbb E_{P, \pi} \sum_{t = 0}^\infty \gamma^t [R(S_t, A_t) \vert S_0 = s] \max_{\pi} V_\pi(s) \\ &= V^*(s) \\ \end{align*}\]

Firstly, does there exist a policy such that \(V_\pi(s) = V^*(s)\) for all \(s\)? An optimal policy for a particular state is guaranteed because of compactness, but is there such a policy for all states?

To aid our analysis, we also define a Q-function \(Q: S \times A \to R\), that describes the expected reward \(Q(s, a)\) of taking an action \(a\) at state \(s\). This function is very similar to the value function but has an additional condition on the action as well. Why have both these functions? Convenience. Analogously, we also define \(Q^*(s, a) = \max_\pi Q_\pi (a, a)\).

Theorem (Existence theorem). There exists a stationary and deterministic policy that has the optimal value for all states.

Therefore, Bellman equations are possible due to assumptions of Markovian nature of the policy and state transition functions. We have the following equations

\[\begin{align*} V_\pi(s) &= R(s, \pi(s)) + \gamma \sum_{s’} P(s’ \vert a, \pi(s)) V_\pi (s’) \\ Q_\pi(s, a) &= R(s, a) + \gamma \sum_{s’} P(s’ \vert s, a) \sum_{a’} \pi(a’ \vert s’) Q_\pi(s’, a’) \\ V_\pi(s) &= \sum_a \pi(s) Q_\pi(s, a) \\ Q_\pi(s, a) &= R(s, a) + \gamma \sum_{s’} P(s’ \vert s, a) V_\pi(s’) \end{align*}\]

Summary

The equation we’ve been working with for value iteration is

\[Q_{t + 1}(s, a) = L(Q_t) = R(s, a) + \mathbb E_{s’ \sim P(. \vert s, a)} \max_{a’} Q_{t}(s’, a’)\]

The key step here is to extract a policy from the current \(Q\) function. We noted that the optimal function \(Q^*\) satisfies the optimality equation.

Aside from the optimality equation, we have the expectation equation that every \(Q\) function satisfies

\[Q_\pi(s, a) = R(s, a) + \mathbb E_{s’} \mathbb E_{a’ \sim \pi(\cdot \vert s’)} Q(s’, a’)\]

The advantage of looking at the optimality equation as an operation \(Q_{t + 1} = L(Q_{t})\) is that we can apply the contraction concepts to arrive at \(Q^*\) with Banach’s fixed point theorem. This way, we prove that there is a unique optimal \(Q\) function.

Now we show that, after enough number of iterations, we can also get the value function \(V\) arbitrarily close the optimal value. All these subtleties together show that the value iteration algorithm works!

How do obtain these bounds based on iterations? We need to find an upper bound for \(\| Q_{t} - Q^* \|_\infty\). We can show that this value is \(leq \|Q^* - Q_0 \|_\infty\). Assuming we start with a \(Q\) with all zeroes, the maximum value of \(Q^*\) is simply \(R_{\max}/(1 - \gamma) = (\max_{s, a} \vert R(s, a)\vert)/(1 - \gamma)\).

Lemma. \(\|V_m(s) - V^*(s)\| \leq \frac{2}{1 - \gamma}\| Q_m - Q^*\|_\infty\).

Proof.

\[\begin{align*} V^*(s) - V_m(s) &= Q^*(s, \pi^*(s)) - Q_m(s, a = \pi_m(s)) \\ &= Q^*(s, \pi^*(s)) - Q^*(s, a) + Q^*(s, a) - Q_m(s, a) \\ &= Q^*(s, \pi^*(s)) - Q^*(s, a) + \gamma \mathbb E_{s’} (V^*(s’) - V_m(s’)) &\leq Q^*(s, \pi^*(s)) - Q^*(s, a) + \gamma \|V^* - V_m\|_\infty \\ &\leq (Q^*(s, \pi^*(s)) - Q_m(s, \pi^*(s))) + (Q_m(s, \pi^*(s)) - Q^*(s, a)) + \gamma \|V^* - V_m \|_\infty \\ &\leq (Q^*(s, \pi^*(s)) - Q_m(s, \pi^*(s))) + (Q_m(s, a) - Q^*(s, a)) + \gamma \|V^* - V_m \|_\infty \\ &\leq 2\|Q^* - Q_m\|_\infty + \gamma \|V^* - V_m\|_\infty \\ \|V^* - V_m \|_\infty \leq \frac{2}{1 - \gamma} \|Q^* - Q_m \|_\infty \end{align*}\]

Policy Iteration

Instead of modifying the policy based on the current value, why not do it the other way round? Iterate over the policy, get its value and improve it again? There is a subtle different as compared to the previous algorithm, and it turns out that this method is much more efficient!

\[\pi_0 \underset{Q_0}{\longrightarrow} \to \pi_1 \to \cdots \to \pi_k \underset{Q_k}{\longrightarrow} \pi_{k + 1}\]

Policy iteration takes \(\mathcal O(\vert S\vert^3 + \vert S \vert^2 \vert A\vert)\) whereas value iteration is \(\mathcal O(\vert S \vert^2 \vert A\vert)\).

Model-free methods

Policy iteration and Value iteration are closely related to each other. For both the algorithms, we need to evaluate a policy to find the corresponding value function. However, in many cases, we do not know the exact transition and reward functions. In other cases, the environment can have a large number of states, making it impossible to model it.

For such situations, we rely on Monte-Carlo methods. Any method that solves a problem by generating suitable random numbers and observing that a fraction of numbers obey some property or properties, can be classified as a Monte Carlo method. The key ideas here are using a sampling technique for a heuristic estimator. These methods do not make use of the Markov assumption much, making them much more generalizable.

The idea is to learn directly from episodes of experience without a prior knowledge of MDP transitions (rewards). Since the idea relies on episodes, one caveat is that it can only be applied to episodic MDPs - episodes have to terminate.

Prediction Problem

Let us consider the first problem - estimating \(V_\pi(s)\) for a state. Instead of updating after every action, we update after each episode by taking the mean reward across all sample trajectories sampled from this stage. The first-visit algorithm is given by

\[\tilde V_\pi(s) = \frac{1}{m} \sum G_i^s\]

Where \(G_i^s\) is the total reward after \(s\) first appears in the episode. Even though the number of states is large, it’s nonetheless finite. Using this fact, we can show theoretically that the value above can be bounded. The convergence time is also associated with the underlying transition probabilities (rare states require more episodes to appear in the trajectory).

There are more questions to answer. For example, is this a biased or unbiased estimator?

\[\begin{align*} \mathbb E(\tilde V_\pi(s)) &= \frac{1}{m} \sum_i \mathbb E(G_i^s) \\ &= V_\pi(s) \quad \because \text{ Markovian assumption} \end{align*}\]

How about doing a second-visit algorithm or every-visit algorithm? They are valid approaches too, the theoretical analysis slightly varies. The estimators may not be unbiased but they use the data more efficiently. That is useful in cases where sampling is expensive. The every-visit algorithm typically has a higher bias (due to dependencies in an episode for the occurrences) but lower variance and higher efficiency.

These differences are important to understand. For example, in the latest o3-mini model, they observed that the every-visit variant of an RL algorithm obtained much better performance than the first-visit variant.

In contrast, consider estimating the \(Q_\pi(s, a)\) function. A policy might choose a certain action for a given state. However, to get all the Q-values, the policy must account for exploring all actions at different states to get a good estimate. The exploration probability is captured by the epsilon-greedy class of algorithms.

Temporal Difference TD(\(\lambda\)) Algorithms

The idea of this class of algorithms is to improve the policy as we keep exploring the environment more.

The first variant of these algorithms is TD(0) - We sample state transition from the trajectory (one from each) and update the existing value function based on the action and reward obtained. Formally, given \(V_t(.)\) and a sample from the trajectory \((s_t, a_t, r_t, s_{t + 1})\), how do we obtain \(V_{t + 1}(s_t)\)?

From the Bellman’s equations, we have

\[V_\pi(s) = \mathbb E_{a \sim \pi(s)} [R(s, a) + \gamma \mathbb E_{s’ \in P(s, a)} V_\pi(s’)]\]

So, can we do

\[V_{t + 1}(s_t) \gets r_t + \gamma V_t(s_{t + 1})\]

This equation omits our previous estimate \(V_t(s_t)\). How do we use it? We can do a sort of an averaging or gradient descent

\[\begin{align*} V_{t + 1}(s_t) &\gets V_t(s_t) + \alpha_{s_t} \delta_t \\ &\delta_t = r_t + \gamma V_t(s_{t + 1}) - V_t(s_t) \end{align*}\]

In essence, we are averaging over the previous visits in the trajectory but with slightly different update rules.

How do we choose the \(\alpha\)’s? For convergence purposes, we require \(\sum_{t = 0}^{\infty} \alpha(t) \to \infty\) and \(\sum_{t = 0}^\infty \alpha^2(t) < \infty\). These parameters are important for controlling the bias and variance of the estimators as we will discuss later. Possible learning schedules include \(\alpha(t) = \frac{1}{t^{0.5 + \epsilon}}\) with \(0 < \epsilon \leq 0.5\) and $$\alpha(t) = \frac{1}{\sqrt{t}\log t}.

With these learning schedules, you can prove the convergence thinking of the process as a contraction. It converges to the Bellman’s equation for some new value \(\gamma\).

To compare various statistical approaches (Monte Carlo methods), we need to compare the bias, variance, convergence and sampling complexity into account. To improve on some of these criteria, we have the general TD(\(\lambda\)) algorithms.

Instead of considering a single transition, the general algorithm considers more transitions to update the value function.

\[V^k_t = r_t + \gamma r_{t + 1} + \dots + \gamma^{k} r_{t + k} +\gamma^{k + 1} V_t{s_t + k}\]

What is the advantage of this approach? We slowed the updates, which seemingly increases the bias but may decrease the variance. Another way to understand TD(\(\lambda\)) is to think of it as a combination of Temporal Difference and Monte Carlo learning. It is an average of \(k\)-step returns. Kind of truncated Monte Carlo method.

What’s more? We can generalize the above mention equation a bit more. For different states, we can consider different \(k\)’s, and take the average of them.

\[V_t^\lambda(s_t) = (1 - \lambda)\sum_{i = 0}^\infty \lambda^i V_t^k(s_t)\]

where the update algorithm is

\[V_{t + 1} \gets V_t(s_t) + \alpha_t (V_t^{\lambda} (s_t) - V_t(s_t))\]

It reduces the variance because we are considering a a geometric weighted mean. Since longer windows \(k\) have higher variance, we reduce their weight in the average with \(lambda\).

In practice, when we approach a new state, we take our current estimate of \(V_t\), update all our previous calculations to recompute the new value function. The convergence proofs for these algorithms is not mathematically rigorous. It is a good area of research to find better proofs or better yet, more efficient algorithms.

How do we execute these algorithms in practice? Theoretically, setting the upper limit of \(j\) to \(\infty\) is the best estimate. However, \(\lambda\) acts like a discount factor, and if the rewards are sparse (towards the end of the episode), then the convergence would take a long time - episodic algorithms.

In essence, this version of the algorithm requires many forward executions of the simulation. In many cases, since this is infeasible in practice, developers have started using a backwards version of the algorithm.

We define the eligibility trace of a state \(e_t(s) = \gamma \lambda e_{t - 1}(s) + 1\). The update equation becomes

\[V_{t + 1}(s) \to V_t(s) + \alpha \delta_t e_t(s)\]

where \(\delta_T = R_t + \gamma V_{t + 1} (S_{t + 1}) - V_{t - 1}(S_t)\).

These equations are almost same as the previous algorithm but with better practice implementation. (When you expand the formulae, and interchange the summations, this is what you get).




    Enjoy Reading This Article?

    Here are some more articles you might like to read next:

  • Brains and AI
  • Machine Learning Systems
  • AI Agents
  • Data Systems for Machine Learning
  • Statistical Natural Language Processing
  • Talk to my AI