Reinforcement Learning Theory
Introduction
Through this article, we aim to understand the theory of reinforcement learning in the context of Large Language Models.
Reinforcement learning addresses the domain of sequential decision problems, wherein an agent takes actions in an environment. These configurations are generally represented using Markov Decision Processes (MDP). An MDP is characterized by -
- Time - T - Discrete, infinite
- States - S - Discrete
- Actions - A - Discrete
-
Transitions - \(\tau : S \times A \to \Delta(S)\) - a probability distributions of states
\[P(s’ \vert s, a) = P_R[ \tau (s, a) = s’]\]In more complicated setups, the transition function could be a function of the time \(T\) (stationarity and non-stationarity). In finite-time horizons, the time can be embedded inside the state itself, converting non-stationarity scenarios to stationarity ones.
-
Reward - \(R: S \times A \times S \to \mathbb R: [-M, M]\). There is no need to make this non-deterministic since that is already generalized by \(\tau\). The rewards are usually bounded. The expected reward for an action is given by
\[R’(s, a) = \sum_{s’ \in S} R(s’, a, s) p(s’ \vert s, a)\] - Initial state \(S_0\) - Can be a single state or a distribution over states.
- Discount Factor \(\gamma\) - A factor \(<1\) to bound the total expected reward from the future. This will be better understandable from the text later.
A policy \(\pi\) is a probability distributions over actions based on the current state. Consider a distribution \(\beta: S \to \delta(A)\), then a policy formally is
\[\pi(s, a) = P_R[\beta(s) = a]\]A trajectory describes the sequence of states, actions of an agent in a run - \(s_0, a_0, r_0, s_1, a_1, r_1, \dots\).
A policy is associated with a value function \(V_{\pi, t} (s) = \sum_{i = t}^\infty \gamma^{i - t} r_i\) Note that the dependence of the value of function on time is redundant in stationarity situations.
The goal is to maximize this value function over all possible policies in an environment
\[\pi^* = \arg \max_{\pi} V_\pi (S_0)\]This is known as the Markov Decision Problem. In a strict setting, the supremum may not exist (be a valid policy).
How is Reinforcement Learning different from Supervised learning? The key differences are
- There is no sequentiality
- Every decision in SL has an associated reward, and there is no stochasticity
Policies are optimized in multiple methods
- There are hill-climb methods - policy iteration, value iteration
- Learning based methods - think of a policy as a matrix that has to optimized to satisfy certain constraints.
Bellman Equations
These are a recursive formulations of policies and value functions.
\[V_\pi(s) =3.142 s \pi(s, a)[ p(s’ \vert s, a) [R(s’, a, s) + \gamma V_\pi (s’)]\]The Markov Assumption
Many processes in nature are Markov in nature - the action at the current state only depends on the current state, and not the history of states.
A general policy is a function from trajectories \(H_t\) to a distribution over actions. However, such general representation is difficult to compute. Additionally, due to the Markov assumption, a stationary policy \(\pi: S \to \delta(A)\) is general enough for optimality. On surface, this may seem like we are limiting the power of an agent, but we will see that this is not the case. In fact, even a deterministic policy will do.
How about reward? Are we limiting the reward function with a Markovian assumption? Since we consider expected rewards - the expectation value can be embedded within the state-based rewards as well. So in fact, both representations are the same.
With a general policy, the probability of a trajectory is given by
\[P[S_0 = s_0, A_0 = a_0, \dots, S_t = s_t] = \mu(s_0) \pi(a_0 \vert s_0) P(s’ \vert s_0, a_0) \pi(A_1 \vert S_0, a_0, s_1) \dots\]Continuing with the Bellman equations, our goal was to find a policy that maximizes the value function
\[\begin{align*} V_\pi(s) &= \mathbb E_{P, \pi} \sum_{t = 0}^\infty \gamma^t [R(S_t, A_t) \vert S_0 = s] \max_{\pi} V_\pi(s) &= V^*(s) \\ \end{align*}\]Firstly, does there exist a policy such that \(V_\pi(s) = V^*(s)\) for all \(s\)? An optimal policy for a particular state is guaranteed because of compactness, but is there such a policy for all states?
To aid our analysis, we also define a Q-function \(Q: S \times A \to R\), that describes the expected reward \(Q(s, a)\) of taking an action \(a\) at state \(s\). This function is very similar to the value function but has an additional condition on the action as well. Why have both these functions? Convenience. Analogously, we also define \(Q^*(s, a) = \max_pi Q_\pi (a, a)\).
Theorem (Existence theorem). There exists a stationary and deterministic policy that has the optimal value for all states.
Therefore, Bellman equations are possible due to assumptions of Markovian nature of the policy and state transition functions. We have the following equations
\[\begin{align*} V_\pi(s) &= R(s, \pi(s)) + \gamma \sum_{s’} P(s’ \vert a, \pi(s)) V_\pi (s’) \\ Q_\pi(s, a) &= R(s, a) + \gamma \sum_{s’} P(s’ \vert s, a) \sum_{a’} \pi(a’ \vert s’) Q_\pi(s’, a’) \\ V_\pi(s) &= \sum_a \pi(s) Q_\pi(s, a) \\ Q_\pi(s, a) &= R(s, a) + \gamma \sum_{s’} P(s’ \vert s, a) V_\pi(s’) \end{align*}\]Enjoy Reading This Article?
Here are some more articles you might like to read next: