Introduction

Through this article, we aim to understand the theory of reinforcement learning in the context of Large Language Models.

Reinforcement learning addresses the domain of sequential decision problems, wherein an agent takes actions in an environment. These configurations are generally represented using Markov Decision Processes (MDP). An MDP is characterized by -

Time - T - Discrete, infinite
States - S - Discrete
Actions - A - Discrete
Transitions - $\tau : S \times A \to \Delta(S)$ - a probability distributions of states
\[P(s’ \vert s, a) = P_R[ \tau (s, a) = s’]\]
In more complicated setups, the transition function could be a function of the time $T$ (stationarity and non-stationarity). In finite-time horizons, the time can be embedded inside the state itself, converting non-stationarity scenarios to stationarity ones.
Reward - $R: S \times A \times S \to \mathbb R: [-M, M]$. There is no need to make this non-deterministic since that is already generalized by $\tau$. The rewards are usually bounded. The expected reward for an action is given by
\[R’(s, a) = \sum_{s’ \in S} R(s’, a, s) p(s’ \vert s, a)\]
Initial state $S_0$ - Can be a single state or a distribution over states.
Discount Factor $\gamma$ - A factor $<1$ to bound the total expected reward from the future. This will be better understandable from the text later.

A policy $\pi$ is a probability distributions over actions based on the current state. Consider a distribution $\beta: S \to \delta(A)$, then a policy formally is

\[\pi(s, a) = P_R[\beta(s) = a]\]

A trajectory describes the sequence of states, actions of an agent in a run - $s_0, a_0, r_0, s_1, a_1, r_1, \dots$.

A policy is associated with a value function $V_{\pi, t} (s) = \sum_{i = t}^\infty \gamma^{i - t} r_i$ Note that the dependence of the value of function on time is redundant in stationarity situations.

The goal is to maximize this value function over all possible policies in an environment

\[\pi^* = \arg \max_{\pi} V_\pi (S_0)\]

This is known as the Markov Decision Problem. In a strict setting, the supremum may not exist (be a valid policy).

How is Reinforcement Learning different from Supervised learning? The key differences are

There is no sequentiality
Every decision in SL has an associated reward, and there is no stochasticity

Policies are optimized in multiple methods

There are hill-climb methods - policy iteration, value iteration
Learning based methods - think of a policy as a matrix that has to optimized to satisfy certain constraints.

Bellman Equations

These are a recursive formulations of policies and value functions.

\[V_\pi(s) =3.142 s \pi(s, a)[ p(s’ \vert s, a) [R(s’, a, s) + \gamma V_\pi (s’)]\]

The Markov Assumption

Many processes in nature are Markov in nature - the action at the current state only depends on the current state, and not the history of states.

A general policy is a function from trajectories $H_t$ to a distribution over actions. However, such general representation is difficult to compute. Additionally, due to the Markov assumption, a stationary policy $\pi: S \to \delta(A)$ is general enough for optimality. On surface, this may seem like we are limiting the power of an agent, but we will see that this is not the case. In fact, even a deterministic policy will do.

How about reward? Are we limiting the reward function with a Markovian assumption? Since we consider expected rewards - the expectation value can be embedded within the state-based rewards as well. So in fact, both representations are the same.

With a general policy, the probability of a trajectory is given by

\[P[S_0 = s_0, A_0 = a_0, \dots, S_t = s_t] = \mu(s_0) \pi(a_0 \vert s_0) P(s’ \vert s_0, a_0) \pi(A_1 \vert S_0, a_0, s_1) \dots\]

Continuing with the Bellman equations, our goal was to find a policy that maximizes the value function

\[\begin{align*} V_\pi(s) &= \mathbb E_{P, \pi} \sum_{t = 0}^\infty \gamma^t [R(S_t, A_t) \vert S_0 = s] \max_{\pi} V_\pi(s) \\ &= V^*(s) \\ \end{align*}\]

Firstly, does there exist a policy such that $V_\pi(s) = V^*(s)$ for all $s$? An optimal policy for a particular state is guaranteed because of compactness, but is there such a policy for all states?

To aid our analysis, we also define a Q-function $Q: S \times A \to R$, that describes the expected reward $Q(s, a)$ of taking an action $a$ at state $s$. This function is very similar to the value function but has an additional condition on the action as well. Why have both these functions? Convenience. Analogously, we also define $Q^*(s, a) = \max_\pi Q_\pi (a, a)$.

Theorem (Existence theorem). There exists a stationary and deterministic policy that has the optimal value for all states.

Therefore, Bellman equations are possible due to assumptions of Markovian nature of the policy and state transition functions. We have the following equations

\[\begin{align*} V_\pi(s) &= R(s, \pi(s)) + \gamma \sum_{s’} P(s’ \vert a, \pi(s)) V_\pi (s’) \\ Q_\pi(s, a) &= R(s, a) + \gamma \sum_{s’} P(s’ \vert s, a) \sum_{a’} \pi(a’ \vert s’) Q_\pi(s’, a’) \\ V_\pi(s) &= \sum_a \pi(s) Q_\pi(s, a) \\ Q_\pi(s, a) &= R(s, a) + \gamma \sum_{s’} P(s’ \vert s, a) V_\pi(s’) \end{align*}\]

Summary

The equation we’ve been working with for value iteration is

\[Q_{t + 1}(s, a) = L(Q_t) = R(s, a) + \mathbb E_{s’ \sim P(. \vert s, a)} \max_{a’} Q_{t}(s’, a’)\]

The key step here is to extract a policy from the current $Q$ function. We noted that the optimal function $Q^*$ satisfies the optimality equation.

Aside from the optimality equation, we have the expectation equation that every $Q$ function satisfies

\[Q_\pi(s, a) = R(s, a) + \mathbb E_{s’} \mathbb E_{a’ \sim \pi(\cdot \vert s’)} Q(s’, a’)\]

The advantage of looking at the optimality equation as an operation $Q_{t + 1} = L(Q_{t})$ is that we can apply the contraction concepts to arrive at $Q^*$ with Banach’s fixed point theorem. This way, we prove that there is a unique optimal $Q$ function.

Now we show that, after enough number of iterations, we can also get the value function $V$ arbitrarily close the optimal value. All these subtleties together show that the value iteration algorithm works!

How do obtain these bounds based on iterations? We need to find an upper bound for $\| Q_{t} - Q^* \|_\infty$. We can show that this value is $leq \|Q^* - Q_0 \|_\infty$. Assuming we start with a $Q$ with all zeroes, the maximum value of $Q^*$ is simply $R_{\max}/(1 - \gamma) = (\max_{s, a} \vert R(s, a)\vert)/(1 - \gamma)$.

Lemma. $\|V_m(s) - V^*(s)\| \leq \frac{2}{1 - \gamma}\| Q_m - Q^*\|_\infty$.

Proof.

\[\begin{align*} V^*(s) - V_m(s) &= Q^*(s, \pi^*(s)) - Q_m(s, a = \pi_m(s)) \\ &= Q^*(s, \pi^*(s)) - Q^*(s, a) + Q^*(s, a) - Q_m(s, a) \\ &= Q^*(s, \pi^*(s)) - Q^*(s, a) + \gamma \mathbb E_{s’} (V^*(s’) - V_m(s’)) &\leq Q^*(s, \pi^*(s)) - Q^*(s, a) + \gamma \|V^* - V_m\|_\infty \\ &\leq (Q^*(s, \pi^*(s)) - Q_m(s, \pi^*(s))) + (Q_m(s, \pi^*(s)) - Q^*(s, a)) + \gamma \|V^* - V_m \|_\infty \\ &\leq (Q^*(s, \pi^*(s)) - Q_m(s, \pi^*(s))) + (Q_m(s, a) - Q^*(s, a)) + \gamma \|V^* - V_m \|_\infty \\ &\leq 2\|Q^* - Q_m\|_\infty + \gamma \|V^* - V_m\|_\infty \\ \|V^* - V_m \|_\infty \leq \frac{2}{1 - \gamma} \|Q^* - Q_m \|_\infty \end{align*}\]

Policy Iteration

Instead of modifying the policy based on the current value, why not do it the other way round? Iterate over the policy, get its value and improve it again? There is a subtle different as compared to the previous algorithm, and it turns out that this method is much more efficient!

\[\pi_0 \underset{Q_0}{\longrightarrow} \to \pi_1 \to \cdots \to \pi_k \underset{Q_k}{\longrightarrow} \pi_{k + 1}\]

Policy iteration takes $\mathcal O(\vert S\vert^3 + \vert S \vert^2 \vert A\vert)$ whereas value iteration is $\mathcal O(\vert S \vert^2 \vert A\vert)$.

Model-free methods

Policy iteration and Value iteration are closely related to each other. For both the algorithms, we need to evaluate a policy to find the corresponding value function. However, in many cases, we do not know the exact transition and reward functions. In other cases, the environment can have a large number of states, making it impossible to model it.

For such situations, we rely on Monte-Carlo methods. Any method that solves a problem by generating suitable random numbers and observing that a fraction of numbers obey some property or properties, can be classified as a Monte Carlo method. The key ideas here are using a sampling technique for a heuristic estimator. These methods do not make use of the Markov assumption much, making them much more generalizable.

The idea is to learn directly from episodes of experience without a prior knowledge of MDP transitions (rewards). Since the idea relies on episodes, one caveat is that it can only be applied to episodic MDPs - episodes have to terminate.

Prediction Problem

Let us consider the first problem - estimating $V_\pi(s)$ for a state. Instead of updating after every action, we update after each episode by taking the mean reward across all sample trajectories sampled from this stage. The first-visit algorithm is given by

\[\tilde V_\pi(s) = \frac{1}{m} \sum G_i^s\]

Where $G_i^s$ is the total reward after $s$ first appears in the episode. Even though the number of states is large, it’s nonetheless finite. Using this fact, we can show theoretically that the value above can be bounded. The convergence time is also associated with the underlying transition probabilities (rare states require more episodes to appear in the trajectory).

There are more questions to answer. For example, is this a biased or unbiased estimator?

\[\begin{align*} \mathbb E(\tilde V_\pi(s)) &= \frac{1}{m} \sum_i \mathbb E(G_i^s) \\ &= V_\pi(s) \quad \because \text{ Markovian assumption} \end{align*}\]

How about doing a second-visit algorithm or every-visit algorithm? They are valid approaches too, the theoretical analysis slightly varies. The estimators may not be unbiased but they use the data more efficiently. That is useful in cases where sampling is expensive. The every-visit algorithm typically has a higher bias (due to dependencies in an episode for the occurrences) but lower variance and higher efficiency.

These differences are important to understand. For example, in the latest o3-mini model, they observed that the every-visit variant of an RL algorithm obtained much better performance than the first-visit variant.

In contrast, consider estimating the $Q_\pi(s, a)$ function. A policy might choose a certain action for a given state. However, to get all the Q-values, the policy must account for exploring all actions at different states to get a good estimate. The exploration probability is captured by the epsilon-greedy class of algorithms.

Temporal Difference TD($\lambda$) Algorithms

The idea of this class of algorithms is to improve the policy as we keep exploring the environment more.

The first variant of these algorithms is TD(0) - We sample state transition from the trajectory (one from each) and update the existing value function based on the action and reward obtained. Formally, given $V_t(.)$ and a sample from the trajectory $(s_t, a_t, r_t, s_{t + 1})$, how do we obtain $V_{t + 1}(s_t)$?

From the Bellman’s equations, we have

\[V_\pi(s) = \mathbb E_{a \sim \pi(s)} [R(s, a) + \gamma \mathbb E_{s’ \in P(s, a)} V_\pi(s’)]\]

So, can we do

\[V_{t + 1}(s_t) \gets r_t + \gamma V_t(s_{t + 1})\]

This equation omits our previous estimate $V_t(s_t)$. How do we use it? We can do a sort of an averaging or gradient descent

\[\begin{align*} V_{t + 1}(s_t) &\gets V_t(s_t) + \alpha_{s_t} \delta_t \\ &\delta_t = r_t + \gamma V_t(s_{t + 1}) - V_t(s_t) \end{align*}\]

In essence, we are averaging over the previous visits in the trajectory but with slightly different update rules.

How do we choose the $\alpha$’s? For convergence purposes, we require $\sum_{t = 0}^{\infty} \alpha(t) \to \infty$ and $\sum_{t = 0}^\infty \alpha^2(t) < \infty$. These parameters are important for controlling the bias and variance of the estimators as we will discuss later. Possible learning schedules include $\alpha(t) = \frac{1}{t^{0.5 + \epsilon}}$ with $0 < \epsilon \leq 0.5$ and $$\alpha(t) = \frac{1}{\sqrt{t}\log t}.

With these learning schedules, you can prove the convergence thinking of the process as a contraction. It converges to the Bellman’s equation for some new value $\gamma$.

To compare various statistical approaches (Monte Carlo methods), we need to compare the bias, variance, convergence and sampling complexity into account. To improve on some of these criteria, we have the general TD($\lambda$) algorithms.

Instead of considering a single transition, the general algorithm considers more transitions to update the value function.

\[V^k_t = r_t + \gamma r_{t + 1} + \dots + \gamma^{k} r_{t + k} +\gamma^{k + 1} V_t{s_t + k}\]

What is the advantage of this approach? We slowed the updates, which seemingly increases the bias but may decrease the variance. Another way to understand TD($\lambda$) is to think of it as a combination of Temporal Difference and Monte Carlo learning. It is an average of $k$-step returns. Kind of truncated Monte Carlo method.

What’s more? We can generalize the above mention equation a bit more. For different states, we can consider different $k$’s, and take the average of them.

\[V_t^\lambda(s_t) = (1 - \lambda)\sum_{i = 0}^\infty \lambda^i V_t^k(s_t)\]

where the update algorithm is

\[V_{t + 1} \gets V_t(s_t) + \alpha_t (V_t^{\lambda} (s_t) - V_t(s_t))\]

It reduces the variance because we are considering a a geometric weighted mean. Since longer windows $k$ have higher variance, we reduce their weight in the average with $lambda$.

In practice, when we approach a new state, we take our current estimate of $V_t$, update all our previous calculations to recompute the new value function. The convergence proofs for these algorithms is not mathematically rigorous. It is a good area of research to find better proofs or better yet, more efficient algorithms.

How do we execute these algorithms in practice? Theoretically, setting the upper limit of $j$ to $\infty$ is the best estimate. However, $\lambda$ acts like a discount factor, and if the rewards are sparse (towards the end of the episode), then the convergence would take a long time - episodic algorithms.

In essence, this version of the algorithm requires many forward executions of the simulation. In many cases, since this is infeasible in practice, developers have started using a backwards version of the algorithm.

We define the eligibility trace of a state $e_t(s) = \gamma \lambda e_{t - 1}(s) + 1$. The update equation becomes

\[V_{t + 1}(s) \to V_t(s) + \alpha \delta_t e_t(s)\]

where $\delta_T = R_t + \gamma V_{t + 1} (S_{t + 1}) - V_{t - 1}(S_t)$.

These equations are almost same as the previous algorithm but with better practice implementation. (When you expand the formulae, and interchange the summations, this is what you get).

Policy Gradient Methods

The algorithms we have seen are for discrete MDPs; that is, the state-action space is discrete. However, many applications in real—life deal with high-dimensional spaces or continuous representations as we discussed before. In such cases, we need a more general case of algorithms arrive at the optimal policies in a computationally feasible manner.

Policy Representations

The generalized representation of a policy that seems to have worked for us is a parameterized distribution for $V, Q$ or $\pi$. For example, $\pi$ could be a Gaussian probability distribution with $\mu, \sigma$ as the parameters. With this, thus, we move on from the tabular representations to a more generalized one.

However do we convert representations to probabilities? One simple idea is as follows. Consider a policy parameterized with $\theta \in \mathbb R^d$. Then, for every state (continuous), think of mapping the actions to a real-space.

\[\phi_s : A \to \mathbb R^d\]

The probabilities can be obtained from $\text{softmax}(Q_s(a)^T \cdot \theta)$. It is one of the simplest ideas, a linear approach, to arrive at the probabilities.

There are some things to keep in mind when working with this new paradigm. If the representation space is not chosen properly, then a small change in the representation space, resulting in two different policies, can have a large difference in terms of the value functions. It can cause stability issues and other problems while using policy gradient approaches.

Policy updates

So, with this representation, how do we obtain the optimal policy?

The loss function is of the form $J(\theta) = \sum_s \mu(s) V_\pi(s)$. You can imagine, $\theta$ being a neural network and how this approach scales up.

\[\begin{align*} \nabla J(\theta) &= \frac{\partial J(\theta)}{\partial \theta} \\ &= \sum_s \mu(s) \frac{\partial V_\pi(s)}{\partial \theta} \end{align*} \\ &= \sum_s d^{\pi}(s) \sum_a Q_\pi(s, a) \frac{\partial}{\partial \theta} \pi(s, a, \theta)\]

where $d^\pi(s) = \sum_{k \geq 0} \gamma^k Pr(S_k = s)$ - how likely does the state $s$ appear in a Markov chain. Although it seems complicated, it can be computed easily.

Do we sum across all states and actions again? We are trading space for computation time.

Let us simplify the equation a bit more - $\begin{align*} \nabla J(\theta) = \sum_s d^\pi(s) \sum_a \pi(s, a, \theta) Q_\pi(s, a) \frac{\partial}{\partial \theta} \ln \pi(s, a, \theta) \end{align*}$

This equation form helps us to think about it in terms of expectations of the policy. Let us derive these equations. We start with the definition of $J(\theta)$

\[\begin{align*} \frac{\partial}{\partial \theta} &= \sum_s Pr(S_o = s) \frac{\partial}{\partial \theta} V(s) \\ \frac{\partial}{\partial \theta} V(s) &= \sum_a Q(s, a) \frac{\partial}{\partial \theta} \pi(s, a, \theta) + \sum_a \pi(s, a, \theta) \frac{\partial}{\partial \theta} Q(s, a) \\ \frac{\partial}{\partial \theta} Q(s, a) \gamma &= \sum_{s’} P(s, a, a’) \frac{\partial}{\partial \theta} V(s’) \\ \frac{\partial}{\partial \theta} V(s) &= \sum_a Q(s, a) \frac{\partial}{\partial \theta} \pi(s, a, \theta) + \gamma \sum_a \pi(s, a, theta) \sum_{s’} P(s, a, s’) \frac{\partial}{\partial \theta} V(s’) \\ &= \sum_a Q(s, a) \frac{\partial}{\partial \theta} \pi(s, a, \theta) + \gamma \sum_{s’} P[s_1 = s’ \vert s_0 = s] \frac{\partial}{\partial \theta} V(s’) \\ \frac{\partial}{\partial \theta} J(\theta) &= \sum_s Pr(s_0 = s) (\sum_a Q(s, a) \frac{\partial}{\partial \theta} \pi(s, a, \theta) + \gamma \sum_{s’} P[s_1 = s’ \vert s_0 = s] \frac{\partial}{\partial \theta} V(s’)) \\ &= \sum_{s, a} Pr(s_0 = s)Q(s, a)\frac{\partial}{\partial \theta} \pi(s, a, \theta) + \gamma \sum_{s’} P(s_1 = s’) \frac{\partial}{\partial \theta} V(s’)) \\ \end{align*}\]

We have unrolled the recursion once, but unrolling (tail recursion) it infinite times, would lead us to the equation we had before

\[\frac{\partial}{\partial \theta} J(\theta) &= \sum_{s} d^\pi(s) \sum_a Q(s, a)\frac{\partial}{\partial \theta} \pi(s, a, \theta)\]

Optimal policies

Using this formula, how quickly do we reach the local/optimal policy? What is the inductive bias to design the neural network architecture? Can we formulate better loss functions? How do effectively perform the computations? Need to research about this…

Let us try answering the computational feasibility question. In a way, performing gradient ascent on the gradient formula we derived previously is analogous to performing gradient descent on the complete data set in supervised training. We can do better with a technique analogous to stochastic gradient descent (this is my understanding that could be incorrect), using Monte Carlo methods to perform the ascent on an estimate of the gradient $\nabla J(\theta)$. How do we get a Monte Carlo method out of the equation?

\[\begin{align*} \nabla J(\theta) &= \sum_{t geq 0} \mathbb E_{S_t, A_t} Q(S_t, A_t) \frac{\partial}{\partial \theta} \ln (S_t, A_t, \theta) \\ &\approx \mathbb E_{S_t, A_t} \underbrace{\left(\sum_{t geq 0} Q(S_t, A_t) \frac{\partial}{\partial \theta} \ln (S_t, A_t, \theta) \right)}_{\text{samples}}\\ \end{align*}\]

So, we take the average of the quantity $Q \cdot \ln (\pi)$ over multiple episodes using Monte Carlo sampling. These equations form the basis of the REINFORCE algorithm.

We further analyze this equation - how do we reduce the variance of the unbiased estimator? We use something known as a controlled variate approach wherein for an estimator of the random variable $X$, we build another random variable $Y$ such that

\[\begin{align*} var(Y) &< 2cov(X, Y) \\ var(X - Y + \mathbb E(Y)) &= var(X - Y) \\ &= var(X) + var(Y) - 2cov(X, Y) < var(X) \end{align*}\]

The positively correlated random variables we choose are

\[\begin{align*} X &= G_t = Q(S_t, A_t) = R_t + \gamma T_{t + 1} + \gramma^2 R_{t + 1} \\ Y &= b(S_t) \end{align*}\]

where $b(S_t)$ is $V(S_t)$ obtained from $TD(\lambda)$. The term $Q(S_t, A_t) - V(S_t)$ is called the advantage! The equation is modified to

\[\begin{align*} \nabla J(\theta) &= \mathbb E \left[\sum_{t = 0}^L (G_t - b(S_t)) \frac{\partial}{\partial \theta} \ln \pi(S_t, A_t, \theta) \right]\\ &= \mathbb E \left[\sum_{t = 0}^L G_t \frac{\partial}{\partial \theta} \ln \pi(S_t, A_t, \theta) - \sum_{t = 0}^L b(S_t) \frac{\partial}{\partial \theta} \ln \pi(S_t, A_t, \theta)\right] \\ &= \mathbb E \left[\sum_{t = 0}^L G_t \frac{\partial}{\partial \theta} \ln \pi(S_t, A_t, \theta)\right] - \underbrace{\sum_{t = 0}^L \sum_s P[S_t = s] b(S_t) \frac{\partial}{\partial \theta} \sum_a \pi(S_t, A_t, \theta \vert A_t = a, S_t = s)}_{=0} \\ &= \sum_{t = 0}^L G_t \frac{\partial}{\partial \theta} \ln \pi(S_t, A_t, \theta) \end{align*}\]

We can decrease the variance further but it may introduce bias in the estimator. There are many research works that explore these variations of the algorithm based on the outcome required.

Apart from this, there are other implementation tricks that seemed to be more effective in practice. For example, removing the discount factor from the REINFORCE equation yields more stable and faster training.

Language Models

Recently, RL techniques have been applied extensively to language models to improve their performance. What are the ideas behind it?

A language model (LLM) can be thought of as an agent in an environment. We (humans) give it feedback, and thereby we constitute the environment. The state of the LLM is the tokens generated so far, and the actions essentially are the set of tokens it can generate in the next instance. The probabilities associated with each token can be considered as the LLM policy (the language model itself encodes the policy). So, we have given an MDP formulation for the LLM. Can we use RL?

The problem with RL is that it does not scale well to high dimensions. With LLMs, the number of tokens is very large and rewards are super-sparse. We only give feedback to the LLM at the end of a generated sentence rather than for each token. That makes the exploration of this space to find the optimal policy (language model) very intractable. Due to this, we have resorted to SFT (supervised fine-tuning) techniques to encode an initial policy. By training an LLM with internet text data, we are constraining the search domain implicitly by developing a really good initial policy. This policy (language model) can then be used with RL techniques to improve even further.

The current state of research is at this intersection - How do we improve the credit assignment to make it more fine-grained (token-wise)? Obtaining human feedback continuously is not feasible. Human feedback is subjective, inconsistent with time and sometimes without any rational basis. So some works try to train a reward model learning from human feedback, and that can be used with the LLM for RL. There’s also the problem of catastrophic forgetting in smaller LLMs.

Overview of LLM training

There are three major steps involved in building models like Instruct-GPT.

Pre-training

The model with randomized weights is trained on a large corpus of data in a self-supervised manner. This step essentially involves extracting large amounts of text, and training models like GPT for next-token prediction. Note that it does not require any labels. Intuitively, we are teaching the model human languages and making it learn how to generate meaningful sentences. The datasets have of the order trillions of tokens.

Super-vised Fine-tuning

In this step, we train the model with prompts and answers. We teach the model with specific set of questions and the corresponding answers. Again, there are many ways in which this step is done, some developers only train only a few layers of the model or use some other fine-tuning techniques like LoRA or Adapters. The datasets are much smaller but still consist millions of tokens.

Reinforcement-learning with Human Feedback (RLHF)

The most important aspect as we have discussed previously, is defining the reward. Many works are based on cognitive and neuroscience techniques. One popular observation is that humans are good at giving preferences over assigning numerical scores to choices. So, with this motivation, we ask the LLM to generate multiple responses to a given prompt. The human can then give a preference or a ranking order as a feedback to the model. These preferences can be related to scores for the reward model using the Bradley-Terry and Placket-Luce models.

AlphaZero

Prior to 2014, researchers came up with specific rules to master board games. For example, they created very sophisticated Chess engines (e.g., Stockfish) and developed it iteratively. The issue with such games is that the state-space is very large (a computationally infeasible number). Training reinforcement learning algorithms (most suitable since these are sequential series of steps) is difficult since the rewards are usually given at the end of the plays. A single reward for a whole trajectory with states and actions is too sparse to train any useful policy.

Hierarchical action space with rewards

Addressing this, AlphaZero by Deepmind was proposed in 2014 leveraging the advancements in neural networks. Given a state as an input, the neural network returns the optimal action and expected value for that state.

We take the examples of trajectories from the game $\{s_i, \pi(s_i), V(s_i)\}$ and minimize the following loss function $\mathcal L = (V_\theta(s_i) - V(s))^2 - \pi(s_i)^T \log \pi_\theta (s_i) + c \|\theta\|^2$

Note that cross-entropy loss is the same as KL Divergence - we only consider the trainable part for the purposes of back-propagation.

It is easy to see why this may work. Along with this, they introduced self-play to iteratively and autonomously improve the policy. For this to work, the agent has to explore the state-action space while exploiting the policy. The trade-off is implemented through a KL-UCB based action choice

\[U(s, a) = Q(s, a) + c \pi_\theta (s, a) \frac{\sqrt{\sum_b N(s, b)}}{N(s, a) + 1}\]

where $N(s, a)$ represents the number of times we have chosen $a$ in state $s$.

Multi-Arm Bandit Problems

The multi-armed bandit problem involves a decision maker iteratively selecting one of multiple fixed choices when the properties of each choice are only partially known at the time of allocation. More formally, given $K$ bandits/actions/choices each returning a value from $\{0, 1\}$ guided by a distribution $\mu_k$. The goal of the decision maker is to choose the bandits so that the returned value is the highest.

Since the player does not known the underlying distributions, they have to learn about the properties through experience. Ideally, they may greedily sample the bandit having the highest $mu_k$, and the estimate for this distribution is built over time. The player has a tradeoff between exploitation of the choice that gives the highest expected payoff and exploration to get more information about the expected payoffs of the other machines.

To quantify these choices, we introduce a quantity called regret - defined as the expected different between the reward sum associated with an optimal strategy and the sum of the collected rewards. So we define the following quantities

\[\begin{align*} N_k(T) &= \sum_{t = 1}^T 1 [I(t) = k] \\ W_k(T) &= \sum_{t = 1}^T X_k(t) 1 [I(t) = k] \\ \hat \mu_k(T) &= \frac{W_k(T)}{N_k(T)} \\ W(T) &= \sum_{k = 1}^K W_k(T) \\ \mathbb E(R(T)) = T \mu_1 - \sum_k \mu_k N_k(T) \end{align*}\]

We say $k = 1$ is the best choice without any loss of generality. Our goal is minimize the expected regret $\mathbb E(R(T))$.

Is it possible to make the regret sub-linear? The Upper Confidence Bound (UCB) algorithm achieves logarithmic regret in terms of $T$, and that is the best we have so far.

Why do we define regret as the metric, and not something else? It could be that this is the simplest and the most flexible and intuitive metric for this problem. Furthermore, we have considered stationary distributions. What about non-stationary ones? There is one detailed work for this with very rigorous theory by Wei et. al..

So within our stationary setup, let us see why the UCB algorithm works. For each bandit we estimate a range $[a, b]$ for $\hat \mu_k(t)$ such that $P[\mu_k \in [a, b]] \geq 1 - \delta$. That is, we want to be able to estimate the properties of the bandits with a certain confidence. Given that we are dealing with Bernoulli variables, we have the following result

\[\begin{align*} P[\mu \geq \frac{1}{n} \sum Y_i + \epsilon] &\leq e ^{-2n \epislon^2 \\ \end{align*}\]

The proof for the above Hoeffding’s inequality is to consider the probability nCk mu^k (1 - mu)^(n - k)

Reinforcement Learning Theory