Study Notes: Stanford CS336 Language Modeling from Scratch [14]
A Beginnerβs Guide to Reinforcement Learning for Language Models
Recent breakthroughs in AI reasoningβlike DeepSeek R1 and OpenAIβs o1βhave been powered by reinforcement learning (RL). But if youβre new to RL, the math can feel intimidating. Terms like βpolicy gradients,β βbaselines,β and βimportance samplingβ get thrown around, and the equations look like alphabet soup.
In this note, I am trying to break down the core concepts of RL for language models in plain English, with simple examples and step-by-step explanations of a few key formulas.
This guide is based on my study notes from Stanford CS336 and resources like OpenAIβs Spinning Up in Deep RL and Nathan Lambertβs RLHF Book.
Table of Contents
- A Beginnerβs Guide to Reinforcement Learning for Language Models
- Table of Contents
- The Big Picture: Training Dogs and Language Models
- Part 1: Language Models as Policies
- Part 2: Trajectories β Recording the Journey
- Part 3: Rewards and Returns β Measuring Success
- Part 4: The Policy Gradient (Vanilla REINFORCE)
- Part 5: Baselines β Reducing the Noise
- Part 6: Off-Policy Learning β Reusing Old Data
- Part 7: GRPO β Group Relative Policy Optimization
- Part 8: Putting It All Together
The Big Picture: Training Dogs and Language Models
Before diving into equations, letβs build intuition with an analogy.
Imagine youβre training a dog to do tricks. You canβt tell the dog exactly which muscles to moveβyou can only reward it when it does something good. Over time, the dog learns to repeat actions that led to treats and avoid actions that didnβt.
Reinforcement learning for language models works the same way:
| Dog Training | LLM Training |
|---|---|
| Dog decides what action to take | LLM decides what token to generate |
| You give a treat (or not) | Reward function gives a score (0 or 1) |
| Dog repeats actions that got treats | LLM increases probability of tokens that led to rewards |
The key insight: we donβt tell the model what to generate (e.g., what is the groundtruth)βwe just tell it whether its answer was good or bad, and it figures out the rest.
Part 1: Language Models as Policies
What is a Policy?
In RL terminology, a policy is just a decision-making strategy. For language models:
- State ($s_t$): The text generated so far (the context, or the prefix)
- Action ($a_t$): The next token to generate
- Policy ($\pi_\theta$): The probability distribution over possible next tokens
Your LLM is a policy! Given a text prefix, it outputs probabilities for each possible next token:
State: "The capital of France is"
Policy: {"Paris": 0.85, "Lyon": 0.05, "the": 0.03, ...}
Action: Sample from this distribution β "Paris"
Mathematically, we write this as:
\[a_t \sim \pi_\theta(\cdot | s_t)\]This reads: βaction $a_t$ is sampled from the policy $\pi_\theta$ given state $s_t$.β
The Two Operations You Need
To train a policy with RL, you only need two operations:
| Operation | What It Does | Example |
|---|---|---|
| Sampling | Draw a token from the probability distribution | Pick βParisβ with 85% probability-the highest probability |
| Scoring | Compute the log-probability of a token | $\log \pi_\theta(\text{βParisβ} \mid s_t) = \log(0.85) \approx -0.16$ |
Thatβs it! You donβt need to know anything else about the modelβs internals.
Part 2: Trajectories β Recording the Journey
What is a Trajectory?
A trajectory (also called an episode or rollout) is the complete sequence of states and actions from start to finish:
\[\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)\]Think of it like recording a chess game move-by-moveβyou capture everything that happened.
A Concrete Example
Letβs trace a trajectory for a math problem:
Prompt: βWhat is 2+3? Think step by step.β
| Timestep | State ($s_t$) | Action ($a_t$) |
|---|---|---|
| 0 | βWhat is 2+3? Think step by step. |
βIβ |
| 1 | ββ¦ |
βneedβ |
| 2 | ββ¦ |
βtoβ |
| 3 | ββ¦ |
βaddβ |
| β¦ | β¦ | β¦ |
| T | ββ¦ </think> |
β5β |
| T+1 | ββ¦ |
β</answer>β |
The trajectory ends when the model emits an end-of-text token (like </answer>) or hits a maximum length.
Key observation: In LLM-land, the βenvironmentβ is trivially deterministic. The next state is just the old state plus the new token:
\[s_{t+1} = s_t | a_t\](where $|$ means concatenation)
Part 3: Rewards and Returns β Measuring Success
The Reward Function
The reward $r_t = R(s_t, a_t)$ judges how good an action was. For RL on math problems, we typically use sparse rewards:
- Intermediate steps: $r_t = 0$ (no feedback until the end)
- Final answer: $r_T = 1$ if correct, $0$ if wrong
Example:
| Trajectory | Final Answer | Correct? | Reward |
|---|---|---|---|
| ββ¦ |
5 | β | 1 |
| ββ¦ |
6 | β | 0 |
The Return: Adding Up Rewards
The return $R(\tau)$ is the total reward accumulated over a trajectory:
\[R(\tau) = \sum_{t=0}^{T} r_t\]With sparse rewards, only the final step matters, so $R(\tau)$ equals the terminal reward (0 or 1).
The Objective: Maximize Expected Return
The goal of RL is to find policy parameters $\theta$ that maximize expected return:
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]In plain English: βOn average, how much reward does my policy get?β
If $J(\theta) = 0.7$, that means your model solves 70% of problems correctly.
Part 4: The Policy Gradient (Vanilla REINFORCE)
Now we get to the heart of RL: how do we actually improve the policy?
The Key Equation
The Vanilla REINFORCE policy gradient tells us how to update parameters:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau)\right]\]This looks complex, but the intuition is simple: increase the probability of actions that led to high rewards.
Symbol-by-Symbol Breakdown
Let me explain every symbol in this equation:
| Symbol | Name | Meaning |
|---|---|---|
| $J(\theta)$ | Objective function | Expected total rewardβthe thing we want to maximize |
| $\theta$ | Parameters | All the weights in your language model (millions of numbers) |
| $\nabla_\theta J$ | Gradient | βWhich direction should I nudge each parameter to increase J - the expected total reward?β |
| $\mathbb{E}[\cdot]$ | Expectation | Average over many samples |
| $\tau$ | Trajectory | One complete episode (e.g., prompt β response β end) |
| $\tau \sim \pi_\theta$ | Sampling | Generate trajectories by running the policy |
| $\sum_t$ | Sum over timesteps | Add up contributions from every token |
| $s_t$ | State | Text prefix at timestep $t$ |
| $a_t$ | Action | Token generated at timestep $t$ |
| $\pi_\theta(a_t \mid s_t)$ | Probability | How likely was this token given this context? |
| $\log \pi_\theta(a_t \mid s_t)$ | Log-probability | Same info, but in log space (more stable) |
| $\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ | Score function* | Gradient of the log-probability; points in the direction that increases this tokenβs probability |
| $R(\tau)$ | Return | Total reward for this trajectory (0 or 1) |
Note on terminology: The name βscore function*β comes from statistics, despite βscoreβ sounding like a scalar, the score function is a vector pointing in the direction of steepest increase for the log-probability.
The Log-Derivative Trick: Why the Math Works
The magic behind policy gradients is a simple calculus identity:
\[\nabla_\theta P = P \cdot \nabla_\theta \log P\]This comes from the chain rule for logarithms:
\[\frac{d}{d\theta} \log P = \frac{1}{P} \cdot \frac{d}{d\theta} P\]Rearranging:
\[\frac{d}{d\theta} P = P \cdot \frac{d}{d\theta} \log P\]Why is this useful? It lets us convert βgradient of an expectationβ into βexpectation of a gradientββwhich we can estimate by sampling!
Numerical example:
Suppose $P(a) = 0.3$ is the probability of some action.
Direct gradient: $\nabla_\theta P = 1$ (some value)
Using the trick:
- $\nabla_\theta \log P = \nabla_\theta \log(0.3) = \frac{1}{0.3} \cdot \nabla_\theta P = 3.33 \cdot 1$
- $P \cdot \nabla_\theta \log P = 0.3 \times 3.33 = 1$ β
Same answer! The trick is just a rearrangement that makes computation easier.
Deriving the Policy Gradient Step-by-Step
Letβs derive the REINFORCE equation from scratch.
Step 1: Write out the expectation explicitly
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \sum_{\tau} P(\tau | \theta) \cdot R(\tau)\]This sums over all possible trajectories, weighted by their probability.
Step 2: Take the gradient
\[\nabla_\theta J(\theta) = \sum_{\tau} \nabla_\theta P(\tau | \theta) \cdot R(\tau)\]Note: $R(\tau)$ doesnβt depend on $\theta$ (itβs just βwas the answer correct?β)
Step 3: Apply the log-derivative trick
\[\nabla_\theta J(\theta)= \sum_{\tau} P(\tau | \theta) \cdot \nabla_\theta \log P(\tau | \theta) \cdot R(\tau)\]Step 4: Recognize this as an expectation
\[\nabla_\theta J(\theta)= \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log P(\tau | \theta) \cdot R(\tau)\right]\]Step 5: Expand the trajectory probability
A trajectoryβs probability is:
\[P(\tau | \theta) = \underbrace{\rho_0(s_0)}_{\text{initial prompt}} \cdot \prod_{t=0}^{T} \underbrace{P(s_{t+1}|s_t, a_t)}_{\text{environment}} \cdot \underbrace{\pi_\theta(a_t|s_t)}_{\text{policy}}\]Taking the log:
\[\log P(\tau | \theta) = \log \rho_0(s_0) + \sum_t \log P(s_{t+1}|s_t, a_t) + \sum_t \log \pi_\theta(a_t|s_t)\]When we take $\nabla_\theta$, the first two terms vanish (they donβt depend on $\theta$):
\[\nabla_\theta \log P(\tau | \theta) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)\]Final result:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau)\right]\]Intuitive Summary
The policy gradient says:
- Sample trajectories by running your policy
- For each token, compute βhow to make it more likelyβ ($\nabla_\theta \log \pi_\theta$)
- Scale by the reward β good outcomes get reinforced, bad ones donβt
- Average across trajectories
Part 5: Baselines β Reducing the Noise
The Problem with Vanilla REINFORCE
Vanilla REINFORCE has high variance. Hereβs why:
Suppose your model already solves 90% of problems. Most trajectories get $R(\tau) = 1$, so the gradient says βincrease probability of these tokens!β even for trajectories that succeeded by luck.
The signal is noisyβsometimes youβre reinforcing good reasoning, sometimes just lucky guesses.
The Solution: Subtract a Baseline
The fix is to subtract a baseline $b(s)$ that estimates βwhat return do we typically get?β:
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \underbrace{(R(\tau) - b(s_t))}_{\text{advantage}}\right]\]The quantity $(R(\tau) - b(s_t))$ is called the advantage:
| Advantage | Meaning | Effect |
|---|---|---|
| Positive | Better than expected | Reinforce these tokens |
| Negative | Worse than expected | Discourage these tokens |
| Zero | Exactly as expected | No change |
A Concrete Example
Without baseline:
| Trajectory | $R(\tau)$ | Gradient Signal |
|---|---|---|
| Correct | 1 | Make these tokens more likely! |
| Wrong | 0 | Do nothing |
Every correct answer gets the same reinforcement, regardless of difficulty; every wrong answer gets no punishment. The model only learns from successes. This is actually a key limitation of vanilla REINFORCE with 0/1 rewards! Youβre not learning what to avoid, only what worked.
With baseline (say, $b = 0.9$ because model gets 90% right):
| Trajectory | $R(\tau)$ | Advantage = $R - 0.9$ | Gradient Signal |
|---|---|---|---|
| Correct | 1 | +0.1 | βSlightly reinforceβ |
| Wrong | 0 | -0.9 | βStrongly discourage!β |
Now the model can also learn avoiding failures rather than redundantly reinforcing successes!
Why Baselines Donβt Add Bias
You might worry: βDoesnβt subtracting something change the answer?β
No! The baseline term vanishes in expectation. Letβs prove it.
The claim: For any baseline $b(s)$ that only depends on the state:
\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = 0\]The proof:
Since $b(s)$ doesnβt depend on the action $a$, we can pull it out:
\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)]= b(s) \cdot \mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)]\]Now we show the expectation of the score function is zero:
\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = \sum_{a} \pi_\theta(a|s) \cdot \nabla_\theta \log \pi_\theta(a|s)\]Using the identity $\nabla_\theta \log P = \frac{\nabla_\theta P}{P}$:
\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)]= \sum_{a} \pi_\theta(a|s) \cdot \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} = \sum_{a} \nabla_\theta \pi_\theta(a|s)\]Swapping sum and gradient:
\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)]= \nabla_\theta \sum_{a} \pi_\theta(a|s) = \nabla_\theta 1 = 0\]The last step works because probabilities over all possible actions that can be taken sum to 1.
Concrete example with softmax as the policy function:
Letβs work through a real example. Considering a language model, where token probabilities come from softmax (policy) over logits:
\[\pi(a) = \frac{e^{z_a}}{\sum_k e^{z_k}}\]The log-probability simplifies nicely:
\[\log \pi(a) = z_a - \log \sum_k e^{z_k}\]Taking gradients with respect to each logit $z_i$:
| Gradient | Formula | Intuition |
|---|---|---|
| $\frac{\partial \log \pi(a)}{\partial z_a}$ | $1 - \pi(a)$ | Increasing own logit (or probability) helps (less help if already confident with high probability) |
| $\frac{\partial \log \pi(a)}{\partial z_b}$ | $-\pi(b)$ | Increasing competitorβs logit (or probability) hurts |
Derivation: For the chosen token, $\frac{\partial}{\partial z_a}[z_a - \log\sum_k e^{z_k}] = 1 - \frac{e^{z_a}}{\sum_k e^{z_k}} = 1 - \pi(a)$. For other tokens, $\frac{\partial}{\partial z_b}[z_a - \log\sum_k e^{z_k}] = 0 - \frac{e^{z_b}}{\sum_k e^{z_k}} = -\pi(b)$.
Numerical example:
Suppose we have 3 tokens with probabilities $[\pi(A), \pi(B), \pi(C)] = [0.5, 0.3, 0.2]$.
The score function for each token (as a vector over logits $[z_A, z_B, z_C]$):
| Token | Score function $\nabla_z \log \pi$ | Weighted by $\pi$ |
|---|---|---|
| A | $[1-0.5, -0.3, -0.2] = [+0.5, -0.3, -0.2]$ | $0.5 \times [+0.5, -0.3, -0.2] = [+0.25, -0.15, -0.10]$ |
| B | $[-0.5, 1-0.3, -0.2] = [-0.5, +0.7, -0.2]$ | $0.3 \times [-0.5, +0.7, -0.2] = [-0.15, +0.21, -0.06]$ |
| C | $[-0.5, -0.3, 1-0.2] = [-0.5, -0.3, +0.8]$ | $0.2 \times [-0.5, -0.3, +0.8] = [-0.10, -0.06, +0.16]$ |
| Sum | Β | $[0, 0, 0]$ β |
Each component sums to zero! For example, the first component: $0.25 - 0.15 - 0.10 = 0$.
The βincrease my probabilityβ directions (positive entries) are exactly canceled by the βdecrease othersβ probabilityβ directions (negative entries) when weighted by the policy.
Why this matters:
We can subtract any function of the state from our rewards without changing the expected gradient:
\[\mathbb{E}[\nabla_\theta \log \pi_\theta \cdot (R(\tau) - b(s))] = \mathbb{E}[\nabla_\theta \log \pi_\theta \cdot R(\tau)] - \underbrace{\mathbb{E}[\nabla_\theta \log \pi_\theta \cdot b(s)]}_{= 0}\]We get lower variance (because advantages are centered around zero) without introducing bias. Free lunch!
Part 6: Off-Policy Learning β Reusing Old Data
The Inefficiency of On-Policy Learning
Vanilla REINFORCE is on-policy: you generate rollouts, take one gradient step, then throw away the data and generate fresh rollouts.
Generate 1000 responses β one gradient step β discard β Generate 1000 more β ...
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β ON-POLICY REINFORCE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Sample 1000 questions from your question bank β
β (e.g., "What is 2+3?", "Solve xΒ²=4", ...) β
β β β
β βΌ β
β Step 2: Current model Ο_ΞΈ GENERATES responses for each question β
β (This is expensive! Running inference 1000 times) β
β β β
β βΌ β
β Step 3: Check answers β rewards [1, 0, 1, 1, 0, ...] β
β β β
β βΌ β
β Step 4: Compute gradient, update ΞΈ β ΞΈ' β
β β β
β βΌ β
β Step 5: DISCARD all 1000 responses β This is the wasteful part! β
β β β
β βΌ β
β Step 6: Go back to Step 1 with the NEW model Ο_ΞΈ' β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
This is wasteful! LLM inference is expensive, and weβre only using each sample once.
The Solution: Importance Sampling
In off-policy learning, we reuse rollouts from a previous policy $\pi_{\theta_{\text{old}}}$ to train the current policy $\pi_\theta$.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β OFF-POLICY β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Sample questions β
β β β
β βΌ β
β Step 2: Ο_ΞΈ_old generates responses (expensive, but done ONCE) β
β β β
β βΌ β
β Step 3: Check answers β rewards β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β Step 4: Gradient step 1 (with importance β β
β β weights to correct for Ο_ΞΈ_old) β β
β β β β β
β β βΌ β β
β β Step 5: Gradient step 2 (same data!) β β Reuse the same β
β β β β responses multiple β
β β βΌ β times! β
β β Step 6: Gradient step 3 (same data!) β β
β β β β β
β β βΌ β β
β β ... (typically 4-8 steps per batch) β β
β βββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Step 7: NOW discard and generate fresh responses β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The trick is importance sampling: we reweight samples to correct for the mismatch between old and new policies.
\[g_{\text{off-policy}} = \frac{1}{N} \sum_{i=1}^{N} \sum_t \underbrace{\frac{\pi_\theta(a_t^{(i)}|s_t^{(i)})}{\pi_{\theta_{\text{old}}}(a_t^{(i)}|s_t^{(i)})}}_{\text{importance weight}} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) \cdot R(\tau^{(i)})\]\(N\) is number of trajectories in the batch (e.g., 1000 responses).
The importance weight $\rho_t = \frac{\pi_\theta}{\pi_{\theta_{\text{old}}}}$ corrects for the distribution shift.
A Concrete Example
Suppose the old policy generated token βParisβ with probability 0.5, but your current policy would generate it with probability 0.7.
Without correction: Youβd undercount βParisβ because it was sampled from a distribution that liked it less.
With importance weight: $\rho = 0.7 / 0.5 = 1.4$
You upweight this sample by 40% to compensate.
| Old Policy | Current Policy | Importance Weight |
|---|---|---|
| P(βParisβ) = 0.5 | P(βParisβ) = 0.7 | 0.7/0.5 = 1.4 |
| P(βLyonβ) = 0.3 | P(βLyonβ) = 0.1 | 0.1/0.3 = 0.33 |
The Catch: Donβt Stray Too Far
Importance sampling only works when $\pi_\theta$ and $\pi_{\theta_{\text{old}}}$ are similar. If they diverge:
- Some importance weights explode (e.g., 100Γ)
- Gradient estimates become unreliable
- Training becomes unstable
This is why algorithms like PPO and GRPO clip the importance weightsβmore on this next!
Now letβs put everything together with GRPO, the algorithm used to train DeepSeek R1.
Part 7: GRPO β Group Relative Policy Optimization
The Core Idea: Compare Siblings
Remember, we need a baseline to reduce variance. The standard approach is to train a separate model to predict expected returnsβbut this is extra work.
GRPOβs insight: Instead of learning a baseline, sample multiple answers for the same question and compare them to each other!
If you ask the model βWhat is 2+3?β five times and it gets three right and two wrong, the correct answers are βbetter than averageβ and the wrong ones are βworse than average.β No separate baseline network needed!
Step 1: Sample Multiple Outputs Per Question
For each question $q$, sample $G$ outputs (the βgroupβ):
\[\{o^{(1)}, o^{(2)}, \ldots, o^{(G)}\} \sim \pi_\theta(\cdot | q)\]Example with G=5:
| Question | Output $i$ | Answer | Correct? | Reward $r^{(i)}$ |
|---|---|---|---|---|
| βWhat is 15Γ7?β | 1 | β105β | β | 1 |
| Β | 2 | β105β | β | 1 |
| Β | 3 | β112β | β | 0 |
| Β | 4 | β105β | β | 1 |
| Β | 5 | β107β | β | 0 |
Step 2: Compute Group-Normalized Advantages
The advantage for output $i$ is computed by normalizing within the group:
\[A^{(i)} = \frac{r^{(i)} - \text{mean}(r^{(1)}, \ldots, r^{(G)})}{\text{std}(r^{(1)}, \ldots, r^{(G)}) + \epsilon}\]Continuing the example:
- mean$(r) = (1+1+0+1+0)/5 = 0.6$
- std$(r) = 0.49$
| Output $i$ | $r^{(i)}$ | $A^{(i)} = \frac{r^{(i)} - 0.6}{0.49}$ | Interpretation |
|---|---|---|---|
| 1 | 1 | +0.82 | Better than siblings β reinforce |
| 2 | 1 | +0.82 | Better than siblings β reinforce |
| 3 | 0 | -1.22 | Worse than siblings β discourage |
| 4 | 1 | +0.82 | Better than siblings β reinforce |
| 5 | 0 | -1.22 | Worse than siblings β discourage |
Key insight: The same advantage applies to every token in that output. If output 1 was correct, every token in its reasoning chain gets $A = +0.82$.
Step 3: The GRPO-Clip Objective
GRPO combines off-policy learning with clipping to stay stable:
\[J_{\text{GRPO-Clip}}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o^{(i)}|}\sum_{t} \min\left(\rho_t \cdot A^{(i)}, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \cdot A^{(i)}\right)\right]\]Symbol-by-symbol breakdown:
| Symbol | Name | Meaning |
|---|---|---|
| $J_{\text{GRPO-Clip}}(\theta)$ | Objective function | The thing we want to maximize β βhow good is our policy?β |
| $\theta$ | Parameters | All the weights in our neural network |
| $\mathbb{E}[\cdot]$ | Expectation | Average over many sampled questions and responses |
| $G$ | Group size | Number of responses we generate per question (e.g., 8) |
| $\frac{1}{G}\sum_{i=1}^{G}$ | Average over group | Average the objective across all G responses for this question |
| $i$ | Response index | Which response in the group (1st, 2nd, β¦, G-th) |
| $o^{(i)}$ | Response i | The i-th generated response (sequence of tokens) |
| $|o^{(i)}|$ | Response length | Number of tokens in response i |
| $\frac{1}{|o^{(i)}|}\sum_{t}$ | Average over tokens | Average the objective across all tokens in this response |
| $t$ | Token index | Which token position in the response |
| $\rho_t$ | Probability ratio | $\frac{\pi_\theta(o_t)}{\pi_{\theta_{old}}(o_t)}$ β how much more/less likely is this token under new vs old policy? |
| $A^{(i)}$ | Advantage | Was response i better or worse than average in its group? |
| $\epsilon$ | Clip parameter | How far we allow the policy to change (typically 0.1β0.2) |
| $\text{clip}(\rho_t, 1-\epsilon, 1+\epsilon)$ | Clipped ratio | Force $\rho_t$ to stay in range $[1-\epsilon, 1+\epsilon]$ |
| $\min(\cdot, \cdot)$ | Minimum | Take the smaller of the two values (conservative update) |
The probability ratio $\rho_t$ in detail:
\[\rho_t = \frac{\pi_\theta(o_t^{(i)} | q, o_{<t}^{(i)})}{\pi_{\theta_{\text{old}}}(o_t^{(i)} | q, o_{<t}^{(i)})}\]| $\rho_t$ value | Meaning |
|---|---|
| 1.0 | Token probability unchanged since we generated the same response |
| 1.5 | New policy is 50% more likely to generate this token |
| 0.7 | New policy is 30% less likely to generate this token |
The clipping function:
With $\epsilon = 0.2$, the clip function constrains $\rho_t$ to the range $[0.8, 1.2]$:
Input Ο_t: 0.5 0.8 1.0 1.2 1.5 2.0
β β β β β β
Output: 0.8 0.8 1.0 1.2 1.2 1.2
β β β
clipped unchanged clipped
up down
The min operation β being conservative:
We compute BOTH the clipped and unclipped objectives, then take the minimum:
\[\min\left(\rho_t \cdot A^{(i)}, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \cdot A^{(i)}\right)\]Case 1: Positive advantage (good response, $A > 0$)
| $\rho_t$ | Unclipped $\rho_t \cdot A$ | Clipped | Min (used) |
|---|---|---|---|
| 0.8 | 0.8A | 0.8A | 0.8A |
| 1.0 | 1.0A | 1.0A | 1.0A |
| 1.2 | 1.2A | 1.2A | 1.2A |
| 1.5 | 1.5A | 1.2A | 1.2A β capped! |
Once $\rho_t > 1 + \epsilon$, the objective stops increasing. No more reward for pushing probability higher.
Objective
β²
β βββββββββββββββ capped at (1+Ξ΅)A
β /
β /
β /
ββββββββββ΄ββββββββββββββββββΊ Ο_t
1.0 1.2
Case 2: Negative advantage (bad response, $A < 0$)
| $\rho_t$ | Unclipped $\rho_t \cdot A$ | Clipped | Min (used) |
|---|---|---|---|
| 1.2 | -1.2A | -1.2A | -1.2A |
| 1.0 | -1.0A | -1.0A | -1.0A |
| 0.8 | -0.8A | -0.8A | -0.8A |
| 0.5 | -0.5A | -0.8A | -0.8A β capped! |
Once $\rho_t < 1 - \epsilon$, the objective stops decreasing. No more penalty for pushing probability lower.
Objective
β²
β ββββββββββ
β \
β \
β \
βββββββββββββββββ΄βββββββββΊ Ο_t
0.8 1.0
Complete worked example:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GRPO-Clip Objective: Step by Step β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Question: "What is 2+3?" β
β β
β Step 1: Generate G=4 responses from Ο_ΞΈ_old β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Response 1: "Let me think... 2+3 = 5" β reward = 1 β β
β β Response 2: "2 plus 3 equals 6" β reward = 0 β β
β β Response 3: "The answer is 5" β reward = 1 β β
β β Response 4: "I believe it's 7" β reward = 0 β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Step 2: Compute group-normalized advantages β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β mean(rewards) = 0.5, std(rewards) = 0.5 β β
β β β β
β β Aβ½ΒΉβΎ = (1 - 0.5) / 0.5 = +1.0 (better than average) β β
β β Aβ½Β²βΎ = (0 - 0.5) / 0.5 = -1.0 (worse than average) β β
β β Aβ½Β³βΎ = (1 - 0.5) / 0.5 = +1.0 (better than average) β β
β β Aβ½β΄βΎ = (0 - 0.5) / 0.5 = -1.0 (worse than average) β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Step 3: For each token, compute clipped objective β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Example: Token "5" in Response 1 (A = +1.0) β β
β β β β
β β Ο_ΞΈ_old("5" | context) = 0.4 β β
β β Ο_ΞΈ("5" | context) = 0.6 (after some gradient steps) β β
β β β β
β β Ο_t = 0.6 / 0.4 = 1.5 β β
β β β β
β β Unclipped: Ο_t Γ A = 1.5 Γ 1.0 = 1.50 β β
β β Clipped: clip(1.5, 0.8, 1.2) Γ A = 1.2 Γ 1.0 = 1.20 β β
β β β β
β β min(1.50, 1.20) = 1.20 β Use the conservative value β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Example: Token "6" in Response 2 (A = -1.0) β β
β β β β
β β Ο_ΞΈ_old("6" | context) = 0.3 β β
β β Ο_ΞΈ("6" | context) = 0.15 (model learned to avoid this) β β
β β β β
β β Ο_t = 0.15 / 0.3 = 0.5 β β
β β β β
β β Unclipped: Ο_t Γ A = 0.5 Γ (-1.0) = -0.50 β β
β β Clipped: clip(0.5, 0.8, 1.2) Γ A = 0.8 Γ (-1.0) = -0.80 β β
β β β β
β β min(-0.50, -0.80) = -0.80 β Use the more negative value β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Step 4: Average over all tokens and responses β Final objective J(ΞΈ) β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Plain English summary:
The GRPO-Clip objective says:
- For each question: Generate G different responses, score them, compute advantages
- For each token: Check how much the probability changed ($\rho_t$), multiply by advantage
- But clip the change: Donβt let the policy move more than $\epsilon$ away from the old policy
- Average everything: Over all tokens and all responses
Why Clipping Matters
Without clipping, taking many gradient steps on the same batch leads to overfitting. Clipping ensures the policy can only move X% away from the old policy per batch. This keeps training stable.
The GRPO Training Loop Visualized
Hereβs the complete GRPO workflow showing how it reuses generated responses:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β GRPO TRAINING LOOP β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: Sample a batch of questions from dataset β
β ["What is 2+3?", "Solve xΒ²=4", "What is 7Γ8?", ...] β
β β β
β βΌ β
β Step 2: Snapshot current model as Ο_ΞΈ_old β
β β β
β βΌ β
β Step 3: Generate G responses per question using Ο_ΞΈ_old β
β (This is EXPENSIVE β full inference G times per question) β
β β β
β ββββββββββββββββ΄βββββββββββββββ β
β β Question: "What is 2+3?" β β
β β Response 1: "5" β β β
β β Response 2: "6" β β β
β β Response 3: "5" β β β
β β Response 4: "7" β β β
β ββββββββββββββββ¬βββββββββββββββ β
β β β
β βΌ β
β Step 4: Compute rewards and group-normalized advantages β
β Aβ½ΒΉβΎ=+1.0, Aβ½Β²βΎ=-1.0, Aβ½Β³βΎ=+1.0, Aβ½β΄βΎ=-1.0 β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β Step 5-8: MULTIPLE gradient steps on the SAME responses β β
β β (This is where we save compute!) β β
β β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Gradient step 1: β β β
β β β - Compute Ο_t = Ο_ΞΈ(token) / Ο_ΞΈ_old(token) for all β β β
β β β - Apply clipping to keep Ο_t in [0.8, 1.2] β β β
β β β - Update ΞΈ β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Gradient step 2: (same responses, updated Ο_ΞΈ) β β β
β β β - Recompute Ο_t with new Ο_ΞΈ β β β
β β β - Clipping prevents Ο_t from going too far β β β
β β β - Update ΞΈ again β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β β β Gradient steps 3, 4, ... (typically 4-8 total) β β β
β β β - Eventually Ο_t hits clip boundaries β β β
β β β - Gradients become zero β time for fresh data β β β
β β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β Step 9: Discard responses, go back to Step 1 with updated Ο_ΞΈ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Why is this more efficient than vanilla REINFORCE?
| Method | Responses generated | Gradient steps | Efficiency |
|---|---|---|---|
| REINFORCE | 1000 | 1 | 1 step per 1000 inferences |
| GRPO | 1000 | 4-8 | 4-8 steps per 1000 inferences |
GRPO extracts 4-8Γ more learning from each expensive batch of generated responses!
The Complete GRPO Algorithm
Algorithm: GRPO
Input: initial model Ο_ΞΈ, reward function R, questions D
1: Ο_ΞΈ β Ο_ΞΈ_init # Start with base model
2: for step = 1 to n_grpo_steps: # Main training loop
3: Sample batch of questions D_b
4: Ο_ΞΈ_old β Ο_ΞΈ # Snapshot current model
5:
6: # Sample G outputs per question
7: for each question q in D_b:
8: Sample {o^(1), ..., o^(G)} from Ο_ΞΈ_old
9: Compute rewards {r^(1), ..., r^(G)}
10: Compute advantages A^(i) via group normalization
11:
12: # Take multiple gradient steps on same rollouts (off-policy)
13: for train_step = 1 to n_train_steps_per_rollout_batch:
14: Update Ο_ΞΈ by maximizing GRPO-Clip objective
15:
16: Output: trained Ο_ΞΈ
Why this works:
- Group normalization provides a baseline without training a separate network
- Off-policy updates let us take multiple gradient steps per batch (efficient!)
- Clipping prevents the policy from changing too much (stable!)
Part 8: Putting It All Together
Summary Table
| Concept | What It Is | What It Does | Example |
|---|---|---|---|
| Policy $\pi_\theta$ | Your LLM parameterized by weights $\theta$ | Given context, outputs probability distribution over next tokens | P(βParisβ \mid βThe capital of France isβ) = 0.85 |
| State $s_t$ | The text generated so far | Provides context for the next decision | βWhat is 2+3? |
| Action $a_t$ | A single token | The choice made at each step | βaddβ |
| Trajectory $\tau$ | Complete sequence $(s_0, a_0, s_1, a_1, β¦, s_T, a_T)$ | One full episode from prompt to end-of-text | Question β reasoning β answer β </answer> |
| Reward $r_t$ | Scalar feedback signal | Judges quality of action at state | 0 for intermediate steps, 1 if final answer correct |
| Return $R(\tau)$ | Sum of rewards over trajectory | Single number measuring βhow good was this trajectory?β | R = 1 (correct) or R = 0 (wrong) |
| Score function $\nabla_\theta \log \pi_\theta(a \mid s)$ | Gradient of log-probability | Direction in parameter space that increases P(action) | A vector with one entry per model parameter |
| REINFORCE | Vanilla policy gradient algorithm | Multiply score function by return, average over trajectories | Good trajectories β reinforce all their tokens |
| Baseline $b(s)$ | Estimate of expected return | Subtract from reward to get advantage | If model gets 70% right, b β 0.7 |
| Advantage $A = R(\tau) - b$ | βBetter or worse than expected?β | Positive β reinforce, Negative β discourage, Zero β no signal | R=1, b=0.7 β A=+0.3 (good!); R=0, b=0.7 β A=-0.7 (bad!) |
| On-policy | Generate β one gradient step β discard | Uses fresh data for each update | Wasteful: 1000 inferences for 1 gradient step |
| Off-policy | Generate β multiple gradient steps β discard | Reuses data with importance weights $\frac{\pi_\theta}{\pi_{\theta_{old}}}$ | Efficient: 1000 inferences for 4-8 gradient steps |
| Importance weight $\rho_t$ | Ratio $\frac{\pi_\theta(a)}{\pi_{\theta_{old}}(a)}$ | Corrects for distribution mismatch when reusing old data | If old P=0.4, new P=0.6, then Ο=1.5 |
| Clipping | Constrain $\rho_t$ to $[1-\epsilon, 1+\epsilon]$ | Prevents policy from changing too fast | With Ξ΅=0.2, Ο stays in [0.8, 1.2] |
| GRPO | Group Relative Policy Optimization | Sample G responses per question, use group statistics as baseline | No need to train separate value network |
Key Equations at a Glance
| Equation | Name | Plain English |
|---|---|---|
| $\nabla_\theta J = \mathbb{E}{\tau}[\sum_t \nabla\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau)]$ | Policy Gradient | βIncrease probability of tokens that led to high rewardsβ |
| $\nabla_\theta J = \mathbb{E}{\tau}[\sum_t \nabla\theta \log \pi_\theta(a_t \mid s_t) \cdot (R(\tau) - b)]$ | Baselined Policy Gradient | βReinforce better-than-expected, discourage worse-than-expectedβ |
| $A^{(i)} = \frac{r^{(i)} - \text{mean}(r)}{\text{std}(r)}$ | GRPO Advantage | βCompare this response to its siblingsβ |
| $\min(\rho_t \cdot A, \text{clip}(\rho_t) \cdot A)$ | Clipped Objective | βUpdate conservatively β donβt change too much at onceβ |
The PyTorch Connection
When you implement GRPO, the core looks like this:
def compute_grpo_loss(model, batch, old_log_probs, advantages, epsilon=0.2):
"""
Compute GRPO-Clip loss for a batch of trajectories.
"""
# Get current log probabilities
logits = model(batch["input_ids"])
log_probs = get_log_probs(logits, batch["input_ids"])
# Compute probability ratios: Ο_ΞΈ / Ο_ΞΈ_old
ratios = torch.exp(log_probs - old_log_probs)
# Clipped objective
unclipped = ratios * advantages
clipped = torch.clamp(ratios, 1 - epsilon, 1 + epsilon) * advantages
# Take minimum (conservative update)
loss = -torch.min(unclipped, clipped).mean()
return loss
The key steps are:
- Compute current vs. old log-probabilities
- Form the probability ratio
- Clip the ratio
- Take the minimum of clipped and unclipped objectives
- Maximize (so we minimize the negative)
The math may look intimidating at first, but the core ideas are simple:
- Try things (sample trajectories)
- See what works (check rewards)
- Do more of what works (policy gradient)
- Be smart about it (baselines, clipping, group normalization)
Thatβs really all there is to it!
Resources:
- Stanford CS336: Language Modeling from Scratch
- OpenAI Spinning Up in Deep RL
- Nathan Lambertβs RLHF Book
- DeepSeek R1 Paper
- DeepSeekMath Paper β Original GRPO formulation
- PPO Paper β Proximal Policy Optimization