A Beginner’s Guide to Reinforcement Learning for Language Models

Recent breakthroughs in AI reasoningβ€”like DeepSeek R1 and OpenAI’s o1β€”have been powered by reinforcement learning (RL). But if you’re new to RL, the math can feel intimidating. Terms like β€œpolicy gradients,” β€œbaselines,” and β€œimportance sampling” get thrown around, and the equations look like alphabet soup.

In this note, I am trying to break down the core concepts of RL for language models in plain English, with simple examples and step-by-step explanations of a few key formulas.

This guide is based on my study notes from Stanford CS336 and resources like OpenAI’s Spinning Up in Deep RL and Nathan Lambert’s RLHF Book.

Table of Contents

The Big Picture: Training Dogs and Language Models

Before diving into equations, let’s build intuition with an analogy.

Imagine you’re training a dog to do tricks. You can’t tell the dog exactly which muscles to moveβ€”you can only reward it when it does something good. Over time, the dog learns to repeat actions that led to treats and avoid actions that didn’t.

Reinforcement learning for language models works the same way:

Dog Training LLM Training
Dog decides what action to take LLM decides what token to generate
You give a treat (or not) Reward function gives a score (0 or 1)
Dog repeats actions that got treats LLM increases probability of tokens that led to rewards

The key insight: we don’t tell the model what to generate (e.g., what is the groundtruth)β€”we just tell it whether its answer was good or bad, and it figures out the rest.

Part 1: Language Models as Policies

What is a Policy?

In RL terminology, a policy is just a decision-making strategy. For language models:

  • State ($s_t$): The text generated so far (the context, or the prefix)
  • Action ($a_t$): The next token to generate
  • Policy ($\pi_\theta$): The probability distribution over possible next tokens

Your LLM is a policy! Given a text prefix, it outputs probabilities for each possible next token:

State:  "The capital of France is"
Policy: {"Paris": 0.85, "Lyon": 0.05, "the": 0.03, ...}
Action: Sample from this distribution β†’ "Paris"

Mathematically, we write this as:

\[a_t \sim \pi_\theta(\cdot | s_t)\]

This reads: β€œaction $a_t$ is sampled from the policy $\pi_\theta$ given state $s_t$.”

The Two Operations You Need

To train a policy with RL, you only need two operations:

Operation What It Does Example
Sampling Draw a token from the probability distribution Pick β€œParis” with 85% probability-the highest probability
Scoring Compute the log-probability of a token $\log \pi_\theta(\text{β€œParis”} \mid s_t) = \log(0.85) \approx -0.16$

That’s it! You don’t need to know anything else about the model’s internals.

Part 2: Trajectories β€” Recording the Journey

What is a Trajectory?

A trajectory (also called an episode or rollout) is the complete sequence of states and actions from start to finish:

\[\tau = (s_0, a_0, s_1, a_1, \ldots, s_T, a_T)\]

Think of it like recording a chess game move-by-moveβ€”you capture everything that happened.

A Concrete Example

Let’s trace a trajectory for a math problem:

Prompt: β€œWhat is 2+3? Think step by step.”

Timestep State ($s_t$) Action ($a_t$)
0 β€œWhat is 2+3? Think step by step. " β€œI”
1 β€œβ€¦ I" β€œneed”
2 β€œβ€¦ I need" β€œto”
3 β€œβ€¦ I need to" β€œadd”
… … …
T β€œβ€¦ </think> " β€œ5”
T+1 β€œβ€¦ 5" ”</answer>”

The trajectory ends when the model emits an end-of-text token (like </answer>) or hits a maximum length.

Key observation: In LLM-land, the β€œenvironment” is trivially deterministic. The next state is just the old state plus the new token:

\[s_{t+1} = s_t | a_t\]

(where $|$ means concatenation)

Part 3: Rewards and Returns β€” Measuring Success

The Reward Function

The reward $r_t = R(s_t, a_t)$ judges how good an action was. For RL on math problems, we typically use sparse rewards:

  • Intermediate steps: $r_t = 0$ (no feedback until the end)
  • Final answer: $r_T = 1$ if correct, $0$ if wrong

Example:

Trajectory Final Answer Correct? Reward
β€œβ€¦ 5” 5 βœ“ 1
β€œβ€¦ 6” 6 βœ— 0

The Return: Adding Up Rewards

The return $R(\tau)$ is the total reward accumulated over a trajectory:

\[R(\tau) = \sum_{t=0}^{T} r_t\]

With sparse rewards, only the final step matters, so $R(\tau)$ equals the terminal reward (0 or 1).

The Objective: Maximize Expected Return

The goal of RL is to find policy parameters $\theta$ that maximize expected return:

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]

In plain English: β€œOn average, how much reward does my policy get?”

If $J(\theta) = 0.7$, that means your model solves 70% of problems correctly.

Part 4: The Policy Gradient (Vanilla REINFORCE)

Now we get to the heart of RL: how do we actually improve the policy?

The Key Equation

The Vanilla REINFORCE policy gradient tells us how to update parameters:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau)\right]\]

This looks complex, but the intuition is simple: increase the probability of actions that led to high rewards.

Symbol-by-Symbol Breakdown

Let me explain every symbol in this equation:

Symbol Name Meaning
$J(\theta)$ Objective function Expected total rewardβ€”the thing we want to maximize
$\theta$ Parameters All the weights in your language model (millions of numbers)
$\nabla_\theta J$ Gradient β€œWhich direction should I nudge each parameter to increase J - the expected total reward?”
$\mathbb{E}[\cdot]$ Expectation Average over many samples
$\tau$ Trajectory One complete episode (e.g., prompt β†’ response β†’ end)
$\tau \sim \pi_\theta$ Sampling Generate trajectories by running the policy
$\sum_t$ Sum over timesteps Add up contributions from every token
$s_t$ State Text prefix at timestep $t$
$a_t$ Action Token generated at timestep $t$
$\pi_\theta(a_t \mid s_t)$ Probability How likely was this token given this context?
$\log \pi_\theta(a_t \mid s_t)$ Log-probability Same info, but in log space (more stable)
$\nabla_\theta \log \pi_\theta(a_t \mid s_t)$ Score function* Gradient of the log-probability; points in the direction that increases this token’s probability
$R(\tau)$ Return Total reward for this trajectory (0 or 1)

Note on terminology: The name β€œscore function*” comes from statistics, despite β€œscore” sounding like a scalar, the score function is a vector pointing in the direction of steepest increase for the log-probability.

The Log-Derivative Trick: Why the Math Works

The magic behind policy gradients is a simple calculus identity:

\[\nabla_\theta P = P \cdot \nabla_\theta \log P\]

This comes from the chain rule for logarithms:

\[\frac{d}{d\theta} \log P = \frac{1}{P} \cdot \frac{d}{d\theta} P\]

Rearranging:

\[\frac{d}{d\theta} P = P \cdot \frac{d}{d\theta} \log P\]

Why is this useful? It lets us convert β€œgradient of an expectation” into β€œexpectation of a gradient”—which we can estimate by sampling!

Numerical example:

Suppose $P(a) = 0.3$ is the probability of some action.

Direct gradient: $\nabla_\theta P = 1$ (some value)

Using the trick:

  • $\nabla_\theta \log P = \nabla_\theta \log(0.3) = \frac{1}{0.3} \cdot \nabla_\theta P = 3.33 \cdot 1$
  • $P \cdot \nabla_\theta \log P = 0.3 \times 3.33 = 1$ βœ“

Same answer! The trick is just a rearrangement that makes computation easier.

Deriving the Policy Gradient Step-by-Step

Let’s derive the REINFORCE equation from scratch.

Step 1: Write out the expectation explicitly

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)] = \sum_{\tau} P(\tau | \theta) \cdot R(\tau)\]

This sums over all possible trajectories, weighted by their probability.

Step 2: Take the gradient

\[\nabla_\theta J(\theta) = \sum_{\tau} \nabla_\theta P(\tau | \theta) \cdot R(\tau)\]

Note: $R(\tau)$ doesn’t depend on $\theta$ (it’s just β€œwas the answer correct?”)

Step 3: Apply the log-derivative trick

\[\nabla_\theta J(\theta)= \sum_{\tau} P(\tau | \theta) \cdot \nabla_\theta \log P(\tau | \theta) \cdot R(\tau)\]

Step 4: Recognize this as an expectation

\[\nabla_\theta J(\theta)= \mathbb{E}_{\tau \sim \pi_\theta}\left[\nabla_\theta \log P(\tau | \theta) \cdot R(\tau)\right]\]

Step 5: Expand the trajectory probability

A trajectory’s probability is:

\[P(\tau | \theta) = \underbrace{\rho_0(s_0)}_{\text{initial prompt}} \cdot \prod_{t=0}^{T} \underbrace{P(s_{t+1}|s_t, a_t)}_{\text{environment}} \cdot \underbrace{\pi_\theta(a_t|s_t)}_{\text{policy}}\]

Taking the log:

\[\log P(\tau | \theta) = \log \rho_0(s_0) + \sum_t \log P(s_{t+1}|s_t, a_t) + \sum_t \log \pi_\theta(a_t|s_t)\]

When we take $\nabla_\theta$, the first two terms vanish (they don’t depend on $\theta$):

\[\nabla_\theta \log P(\tau | \theta) = \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t)\]

Final result:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R(\tau)\right]\]

Intuitive Summary

The policy gradient says:

  1. Sample trajectories by running your policy
  2. For each token, compute β€œhow to make it more likely” ($\nabla_\theta \log \pi_\theta$)
  3. Scale by the reward β€” good outcomes get reinforced, bad ones don’t
  4. Average across trajectories

Part 5: Baselines β€” Reducing the Noise

The Problem with Vanilla REINFORCE

Vanilla REINFORCE has high variance. Here’s why:

Suppose your model already solves 90% of problems. Most trajectories get $R(\tau) = 1$, so the gradient says β€œincrease probability of these tokens!” even for trajectories that succeeded by luck.

The signal is noisyβ€”sometimes you’re reinforcing good reasoning, sometimes just lucky guesses.

The Solution: Subtract a Baseline

The fix is to subtract a baseline $b(s)$ that estimates β€œwhat return do we typically get?”:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot \underbrace{(R(\tau) - b(s_t))}_{\text{advantage}}\right]\]

The quantity $(R(\tau) - b(s_t))$ is called the advantage:

Advantage Meaning Effect
Positive Better than expected Reinforce these tokens
Negative Worse than expected Discourage these tokens
Zero Exactly as expected No change

A Concrete Example

Without baseline:

Trajectory $R(\tau)$ Gradient Signal
Correct 1 Make these tokens more likely!
Wrong 0 Do nothing

Every correct answer gets the same reinforcement, regardless of difficulty; every wrong answer gets no punishment. The model only learns from successes. This is actually a key limitation of vanilla REINFORCE with 0/1 rewards! You’re not learning what to avoid, only what worked.

With baseline (say, $b = 0.9$ because model gets 90% right):

Trajectory $R(\tau)$ Advantage = $R - 0.9$ Gradient Signal
Correct 1 +0.1 β€œSlightly reinforce”
Wrong 0 -0.9 β€œStrongly discourage!”

Now the model can also learn avoiding failures rather than redundantly reinforcing successes!

Why Baselines Don’t Add Bias

You might worry: β€œDoesn’t subtracting something change the answer?”

No! The baseline term vanishes in expectation. Let’s prove it.

The claim: For any baseline $b(s)$ that only depends on the state:

\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)] = 0\]

The proof:

Since $b(s)$ doesn’t depend on the action $a$, we can pull it out:

\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s) \cdot b(s)]= b(s) \cdot \mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)]\]

Now we show the expectation of the score function is zero:

\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)] = \sum_{a} \pi_\theta(a|s) \cdot \nabla_\theta \log \pi_\theta(a|s)\]

Using the identity $\nabla_\theta \log P = \frac{\nabla_\theta P}{P}$:

\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)]= \sum_{a} \pi_\theta(a|s) \cdot \frac{\nabla_\theta \pi_\theta(a|s)}{\pi_\theta(a|s)} = \sum_{a} \nabla_\theta \pi_\theta(a|s)\]

Swapping sum and gradient:

\[\mathbb{E}_{a \sim \pi_\theta}[\nabla_\theta \log \pi_\theta(a|s)]= \nabla_\theta \sum_{a} \pi_\theta(a|s) = \nabla_\theta 1 = 0\]

The last step works because probabilities over all possible actions that can be taken sum to 1.

Concrete example with softmax as the policy function:

Let’s work through a real example. Considering a language model, where token probabilities come from softmax (policy) over logits:

\[\pi(a) = \frac{e^{z_a}}{\sum_k e^{z_k}}\]

The log-probability simplifies nicely:

\[\log \pi(a) = z_a - \log \sum_k e^{z_k}\]

Taking gradients with respect to each logit $z_i$:

Gradient Formula Intuition
$\frac{\partial \log \pi(a)}{\partial z_a}$ $1 - \pi(a)$ Increasing own logit (or probability) helps (less help if already confident with high probability)
$\frac{\partial \log \pi(a)}{\partial z_b}$ $-\pi(b)$ Increasing competitor’s logit (or probability) hurts

Derivation: For the chosen token, $\frac{\partial}{\partial z_a}[z_a - \log\sum_k e^{z_k}] = 1 - \frac{e^{z_a}}{\sum_k e^{z_k}} = 1 - \pi(a)$. For other tokens, $\frac{\partial}{\partial z_b}[z_a - \log\sum_k e^{z_k}] = 0 - \frac{e^{z_b}}{\sum_k e^{z_k}} = -\pi(b)$.

Numerical example:

Suppose we have 3 tokens with probabilities $[\pi(A), \pi(B), \pi(C)] = [0.5, 0.3, 0.2]$.

The score function for each token (as a vector over logits $[z_A, z_B, z_C]$):

Token Score function $\nabla_z \log \pi$ Weighted by $\pi$
A $[1-0.5, -0.3, -0.2] = [+0.5, -0.3, -0.2]$ $0.5 \times [+0.5, -0.3, -0.2] = [+0.25, -0.15, -0.10]$
B $[-0.5, 1-0.3, -0.2] = [-0.5, +0.7, -0.2]$ $0.3 \times [-0.5, +0.7, -0.2] = [-0.15, +0.21, -0.06]$
C $[-0.5, -0.3, 1-0.2] = [-0.5, -0.3, +0.8]$ $0.2 \times [-0.5, -0.3, +0.8] = [-0.10, -0.06, +0.16]$
Sum Β  $[0, 0, 0]$ βœ“

Each component sums to zero! For example, the first component: $0.25 - 0.15 - 0.10 = 0$.

The β€œincrease my probability” directions (positive entries) are exactly canceled by the β€œdecrease others’ probability” directions (negative entries) when weighted by the policy.

Why this matters:

We can subtract any function of the state from our rewards without changing the expected gradient:

\[\mathbb{E}[\nabla_\theta \log \pi_\theta \cdot (R(\tau) - b(s))] = \mathbb{E}[\nabla_\theta \log \pi_\theta \cdot R(\tau)] - \underbrace{\mathbb{E}[\nabla_\theta \log \pi_\theta \cdot b(s)]}_{= 0}\]

We get lower variance (because advantages are centered around zero) without introducing bias. Free lunch!

Part 6: Off-Policy Learning β€” Reusing Old Data

The Inefficiency of On-Policy Learning

Vanilla REINFORCE is on-policy: you generate rollouts, take one gradient step, then throw away the data and generate fresh rollouts.

Generate 1000 responses β†’ one gradient step β†’ discard β†’ Generate 1000 more β†’ ...

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  ON-POLICY REINFORCE                                                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  Step 1: Sample 1000 questions from your question bank                      β”‚
β”‚          (e.g., "What is 2+3?", "Solve xΒ²=4", ...)                          β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 2: Current model Ο€_ΞΈ GENERATES responses for each question            β”‚
β”‚          (This is expensive! Running inference 1000 times)                  β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 3: Check answers β†’ rewards [1, 0, 1, 1, 0, ...]                       β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 4: Compute gradient, update ΞΈ β†’ ΞΈ'                                    β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 5: DISCARD all 1000 responses ← This is the wasteful part!            β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 6: Go back to Step 1 with the NEW model Ο€_ΞΈ'                          β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

This is wasteful! LLM inference is expensive, and we’re only using each sample once.

The Solution: Importance Sampling

In off-policy learning, we reuse rollouts from a previous policy $\pi_{\theta_{\text{old}}}$ to train the current policy $\pi_\theta$.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  OFF-POLICY                                                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  Step 1: Sample questions                                                   β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 2: Ο€_ΞΈ_old generates responses (expensive, but done ONCE)             β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 3: Check answers β†’ rewards                                            β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            β”‚
β”‚  β”‚  Step 4: Gradient step 1 (with importance   β”‚                            β”‚
β”‚  β”‚          weights to correct for Ο€_ΞΈ_old)    β”‚                            β”‚
β”‚  β”‚                     β”‚                       β”‚                            β”‚
β”‚  β”‚                     β–Ό                       β”‚                            β”‚
β”‚  β”‚  Step 5: Gradient step 2 (same data!)       β”‚  ← Reuse the same          β”‚
β”‚  β”‚                     β”‚                       β”‚    responses multiple      β”‚
β”‚  β”‚                     β–Ό                       β”‚    times!                  β”‚
β”‚  β”‚  Step 6: Gradient step 3 (same data!)       β”‚                            β”‚
β”‚  β”‚                     β”‚                       β”‚                            β”‚
β”‚  β”‚                     β–Ό                       β”‚                            β”‚
β”‚  β”‚  ... (typically 4-8 steps per batch)        β”‚                            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 7: NOW discard and generate fresh responses                           β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

The trick is importance sampling: we reweight samples to correct for the mismatch between old and new policies.

\[g_{\text{off-policy}} = \frac{1}{N} \sum_{i=1}^{N} \sum_t \underbrace{\frac{\pi_\theta(a_t^{(i)}|s_t^{(i)})}{\pi_{\theta_{\text{old}}}(a_t^{(i)}|s_t^{(i)})}}_{\text{importance weight}} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}) \cdot R(\tau^{(i)})\]

\(N\) is number of trajectories in the batch (e.g., 1000 responses).

The importance weight $\rho_t = \frac{\pi_\theta}{\pi_{\theta_{\text{old}}}}$ corrects for the distribution shift.

A Concrete Example

Suppose the old policy generated token β€œParis” with probability 0.5, but your current policy would generate it with probability 0.7.

Without correction: You’d undercount β€œParis” because it was sampled from a distribution that liked it less.

With importance weight: $\rho = 0.7 / 0.5 = 1.4$

You upweight this sample by 40% to compensate.

Old Policy Current Policy Importance Weight
P(β€œParis”) = 0.5 P(β€œParis”) = 0.7 0.7/0.5 = 1.4
P(β€œLyon”) = 0.3 P(β€œLyon”) = 0.1 0.1/0.3 = 0.33

The Catch: Don’t Stray Too Far

Importance sampling only works when $\pi_\theta$ and $\pi_{\theta_{\text{old}}}$ are similar. If they diverge:

  • Some importance weights explode (e.g., 100Γ—)
  • Gradient estimates become unreliable
  • Training becomes unstable

This is why algorithms like PPO and GRPO clip the importance weightsβ€”more on this next!

Now let’s put everything together with GRPO, the algorithm used to train DeepSeek R1.

Part 7: GRPO β€” Group Relative Policy Optimization

The Core Idea: Compare Siblings

Remember, we need a baseline to reduce variance. The standard approach is to train a separate model to predict expected returnsβ€”but this is extra work.

GRPO’s insight: Instead of learning a baseline, sample multiple answers for the same question and compare them to each other!

If you ask the model β€œWhat is 2+3?” five times and it gets three right and two wrong, the correct answers are β€œbetter than average” and the wrong ones are β€œworse than average.” No separate baseline network needed!

Step 1: Sample Multiple Outputs Per Question

For each question $q$, sample $G$ outputs (the β€œgroup”):

\[\{o^{(1)}, o^{(2)}, \ldots, o^{(G)}\} \sim \pi_\theta(\cdot | q)\]

Example with G=5:

Question Output $i$ Answer Correct? Reward $r^{(i)}$
β€œWhat is 15Γ—7?” 1 β€œ105” βœ“ 1
Β  2 β€œ105” βœ“ 1
Β  3 β€œ112” βœ— 0
Β  4 β€œ105” βœ“ 1
Β  5 β€œ107” βœ— 0

Step 2: Compute Group-Normalized Advantages

The advantage for output $i$ is computed by normalizing within the group:

\[A^{(i)} = \frac{r^{(i)} - \text{mean}(r^{(1)}, \ldots, r^{(G)})}{\text{std}(r^{(1)}, \ldots, r^{(G)}) + \epsilon}\]

Continuing the example:

  • mean$(r) = (1+1+0+1+0)/5 = 0.6$
  • std$(r) = 0.49$
Output $i$ $r^{(i)}$ $A^{(i)} = \frac{r^{(i)} - 0.6}{0.49}$ Interpretation
1 1 +0.82 Better than siblings β†’ reinforce
2 1 +0.82 Better than siblings β†’ reinforce
3 0 -1.22 Worse than siblings β†’ discourage
4 1 +0.82 Better than siblings β†’ reinforce
5 0 -1.22 Worse than siblings β†’ discourage

Key insight: The same advantage applies to every token in that output. If output 1 was correct, every token in its reasoning chain gets $A = +0.82$.

Step 3: The GRPO-Clip Objective

GRPO combines off-policy learning with clipping to stay stable:

\[J_{\text{GRPO-Clip}}(\theta) = \mathbb{E}\left[\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o^{(i)}|}\sum_{t} \min\left(\rho_t \cdot A^{(i)}, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \cdot A^{(i)}\right)\right]\]

Symbol-by-symbol breakdown:

Symbol Name Meaning
$J_{\text{GRPO-Clip}}(\theta)$ Objective function The thing we want to maximize β€” β€œhow good is our policy?”
$\theta$ Parameters All the weights in our neural network
$\mathbb{E}[\cdot]$ Expectation Average over many sampled questions and responses
$G$ Group size Number of responses we generate per question (e.g., 8)
$\frac{1}{G}\sum_{i=1}^{G}$ Average over group Average the objective across all G responses for this question
$i$ Response index Which response in the group (1st, 2nd, …, G-th)
$o^{(i)}$ Response i The i-th generated response (sequence of tokens)
$|o^{(i)}|$ Response length Number of tokens in response i
$\frac{1}{|o^{(i)}|}\sum_{t}$ Average over tokens Average the objective across all tokens in this response
$t$ Token index Which token position in the response
$\rho_t$ Probability ratio $\frac{\pi_\theta(o_t)}{\pi_{\theta_{old}}(o_t)}$ β€” how much more/less likely is this token under new vs old policy?
$A^{(i)}$ Advantage Was response i better or worse than average in its group?
$\epsilon$ Clip parameter How far we allow the policy to change (typically 0.1–0.2)
$\text{clip}(\rho_t, 1-\epsilon, 1+\epsilon)$ Clipped ratio Force $\rho_t$ to stay in range $[1-\epsilon, 1+\epsilon]$
$\min(\cdot, \cdot)$ Minimum Take the smaller of the two values (conservative update)

The probability ratio $\rho_t$ in detail:

\[\rho_t = \frac{\pi_\theta(o_t^{(i)} | q, o_{<t}^{(i)})}{\pi_{\theta_{\text{old}}}(o_t^{(i)} | q, o_{<t}^{(i)})}\]
$\rho_t$ value Meaning
1.0 Token probability unchanged since we generated the same response
1.5 New policy is 50% more likely to generate this token
0.7 New policy is 30% less likely to generate this token

The clipping function:

With $\epsilon = 0.2$, the clip function constrains $\rho_t$ to the range $[0.8, 1.2]$:

Input ρ_t:    0.5   0.8   1.0   1.2   1.5   2.0
              ↓     ↓     ↓     ↓     ↓     ↓
Output:       0.8   0.8   1.0   1.2   1.2   1.2
              ↑           ↑           ↑
           clipped     unchanged   clipped
              up                     down

The min operation β€” being conservative:

We compute BOTH the clipped and unclipped objectives, then take the minimum:

\[\min\left(\rho_t \cdot A^{(i)}, \text{clip}(\rho_t, 1-\epsilon, 1+\epsilon) \cdot A^{(i)}\right)\]

Case 1: Positive advantage (good response, $A > 0$)

$\rho_t$ Unclipped $\rho_t \cdot A$ Clipped Min (used)
0.8 0.8A 0.8A 0.8A
1.0 1.0A 1.0A 1.0A
1.2 1.2A 1.2A 1.2A
1.5 1.5A 1.2A 1.2A ← capped!

Once $\rho_t > 1 + \epsilon$, the objective stops increasing. No more reward for pushing probability higher.

Objective
    β–²
    β”‚            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ capped at (1+Ξ΅)A
    β”‚           /
    β”‚          /
    β”‚         /
    └────────┴─────────────────► ρ_t
           1.0  1.2

Case 2: Negative advantage (bad response, $A < 0$)

$\rho_t$ Unclipped $\rho_t \cdot A$ Clipped Min (used)
1.2 -1.2A -1.2A -1.2A
1.0 -1.0A -1.0A -1.0A
0.8 -0.8A -0.8A -0.8A
0.5 -0.5A -0.8A -0.8A ← capped!

Once $\rho_t < 1 - \epsilon$, the objective stops decreasing. No more penalty for pushing probability lower.

Objective
    β–²
    β”‚   ─────────┐
    β”‚             \
    β”‚              \
    β”‚               \
    └───────────────┴────────► ρ_t
               0.8  1.0

Complete worked example:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    GRPO-Clip Objective: Step by Step                        β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚   Question: "What is 2+3?"                                                  β”‚
β”‚                                                                             β”‚
β”‚   Step 1: Generate G=4 responses from Ο€_ΞΈ_old                               β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚  Response 1: "Let me think... 2+3 = 5" βœ“    reward = 1              β”‚   β”‚
β”‚   β”‚  Response 2: "2 plus 3 equals 6" βœ—          reward = 0              β”‚   β”‚
β”‚   β”‚  Response 3: "The answer is 5" βœ“            reward = 1              β”‚   β”‚
β”‚   β”‚  Response 4: "I believe it's 7" βœ—           reward = 0              β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                    β”‚                                        β”‚
β”‚                                    β–Ό                                        β”‚
β”‚   Step 2: Compute group-normalized advantages                               β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚  mean(rewards) = 0.5,  std(rewards) = 0.5                           β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  A⁽¹⁾ = (1 - 0.5) / 0.5 = +1.0   (better than average)              β”‚   β”‚
β”‚   β”‚  A⁽²⁾ = (0 - 0.5) / 0.5 = -1.0   (worse than average)               β”‚   β”‚
β”‚   β”‚  A⁽³⁾ = (1 - 0.5) / 0.5 = +1.0   (better than average)              β”‚   β”‚
β”‚   β”‚  A⁽⁴⁾ = (0 - 0.5) / 0.5 = -1.0   (worse than average)               β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                    β”‚                                        β”‚
β”‚                                    β–Ό                                        β”‚
β”‚   Step 3: For each token, compute clipped objective                         β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚  Example: Token "5" in Response 1 (A = +1.0)                        β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  Ο€_ΞΈ_old("5" | context) = 0.4                                       β”‚   β”‚
β”‚   β”‚  Ο€_ΞΈ("5" | context) = 0.6        (after some gradient steps)        β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  ρ_t = 0.6 / 0.4 = 1.5                                              β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  Unclipped: ρ_t Γ— A = 1.5 Γ— 1.0 = 1.50                              β”‚   β”‚
β”‚   β”‚  Clipped:   clip(1.5, 0.8, 1.2) Γ— A = 1.2 Γ— 1.0 = 1.20              β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  min(1.50, 1.20) = 1.20  ← Use the conservative value               β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                    β”‚                                        β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚   β”‚  Example: Token "6" in Response 2 (A = -1.0)                        β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  Ο€_ΞΈ_old("6" | context) = 0.3                                       β”‚   β”‚
β”‚   β”‚  Ο€_ΞΈ("6" | context) = 0.15       (model learned to avoid this)      β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  ρ_t = 0.15 / 0.3 = 0.5                                             β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  Unclipped: ρ_t Γ— A = 0.5 Γ— (-1.0) = -0.50                          β”‚   β”‚
β”‚   β”‚  Clipped:   clip(0.5, 0.8, 1.2) Γ— A = 0.8 Γ— (-1.0) = -0.80          β”‚   β”‚
β”‚   β”‚                                                                     β”‚   β”‚
β”‚   β”‚  min(-0.50, -0.80) = -0.80  ← Use the more negative value           β”‚   β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                                    β”‚                                        β”‚
β”‚                                    β–Ό                                        β”‚
β”‚   Step 4: Average over all tokens and responses β†’ Final objective J(ΞΈ)      β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Plain English summary:

The GRPO-Clip objective says:

  1. For each question: Generate G different responses, score them, compute advantages
  2. For each token: Check how much the probability changed ($\rho_t$), multiply by advantage
  3. But clip the change: Don’t let the policy move more than $\epsilon$ away from the old policy
  4. Average everything: Over all tokens and all responses

Why Clipping Matters

Without clipping, taking many gradient steps on the same batch leads to overfitting. Clipping ensures the policy can only move X% away from the old policy per batch. This keeps training stable.

The GRPO Training Loop Visualized

Here’s the complete GRPO workflow showing how it reuses generated responses:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           GRPO TRAINING LOOP                                β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚  Step 1: Sample a batch of questions from dataset                           β”‚
β”‚          ["What is 2+3?", "Solve xΒ²=4", "What is 7Γ—8?", ...]                β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 2: Snapshot current model as Ο€_ΞΈ_old                                  β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 3: Generate G responses per question using Ο€_ΞΈ_old                    β”‚
β”‚          (This is EXPENSIVE β€” full inference G times per question)          β”‚
β”‚                         β”‚                                                   β”‚
β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                    β”‚
β”‚          β”‚  Question: "What is 2+3?"   β”‚                                    β”‚
β”‚          β”‚  Response 1: "5" βœ“          β”‚                                    β”‚
β”‚          β”‚  Response 2: "6" βœ—          β”‚                                    β”‚
β”‚          β”‚  Response 3: "5" βœ“          β”‚                                    β”‚
β”‚          β”‚  Response 4: "7" βœ—          β”‚                                    β”‚
β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                    β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 4: Compute rewards and group-normalized advantages                    β”‚
β”‚          A⁽¹⁾=+1.0, A⁽²⁾=-1.0, A⁽³⁾=+1.0, A⁽⁴⁾=-1.0                         β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Step 5-8: MULTIPLE gradient steps on the SAME responses              β”‚  β”‚
β”‚  β”‚            (This is where we save compute!)                           β”‚  β”‚
β”‚  β”‚                                                                       β”‚  β”‚
β”‚  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”‚
β”‚  β”‚    β”‚  Gradient step 1:                                           β”‚    β”‚  β”‚
β”‚  β”‚    β”‚    - Compute ρ_t = Ο€_ΞΈ(token) / Ο€_ΞΈ_old(token) for all      β”‚    β”‚  β”‚
β”‚  β”‚    β”‚    - Apply clipping to keep ρ_t in [0.8, 1.2]               β”‚    β”‚  β”‚
β”‚  β”‚    β”‚    - Update ΞΈ                                               β”‚    β”‚  β”‚
β”‚  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”‚
β”‚  β”‚                           β”‚                                           β”‚  β”‚
β”‚  β”‚                           β–Ό                                           β”‚  β”‚
β”‚  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”‚
β”‚  β”‚    β”‚  Gradient step 2: (same responses, updated Ο€_ΞΈ)             β”‚    β”‚  β”‚
β”‚  β”‚    β”‚    - Recompute ρ_t with new Ο€_ΞΈ                             β”‚    β”‚  β”‚
β”‚  β”‚    β”‚    - Clipping prevents ρ_t from going too far               β”‚    β”‚  β”‚
β”‚  β”‚    β”‚    - Update ΞΈ again                                         β”‚    β”‚  β”‚
β”‚  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”‚
β”‚  β”‚                           β”‚                                           β”‚  β”‚
β”‚  β”‚                           β–Ό                                           β”‚  β”‚
β”‚  β”‚    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚  β”‚
β”‚  β”‚    β”‚  Gradient steps 3, 4, ... (typically 4-8 total)             β”‚    β”‚  β”‚
β”‚  β”‚    β”‚    - Eventually ρ_t hits clip boundaries                    β”‚    β”‚  β”‚
β”‚  β”‚    β”‚    - Gradients become zero β†’ time for fresh data            β”‚    β”‚  β”‚
β”‚  β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                         β”‚                                                   β”‚
β”‚                         β–Ό                                                   β”‚
β”‚  Step 9: Discard responses, go back to Step 1 with updated Ο€_ΞΈ              β”‚
β”‚                                                                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why is this more efficient than vanilla REINFORCE?

Method Responses generated Gradient steps Efficiency
REINFORCE 1000 1 1 step per 1000 inferences
GRPO 1000 4-8 4-8 steps per 1000 inferences

GRPO extracts 4-8Γ— more learning from each expensive batch of generated responses!

The Complete GRPO Algorithm

Algorithm: GRPO

Input: initial model Ο€_ΞΈ, reward function R, questions D

1:  Ο€_ΞΈ ← Ο€_ΞΈ_init                        # Start with base model

2:  for step = 1 to n_grpo_steps:          # Main training loop
3:      Sample batch of questions D_b
4:      Ο€_ΞΈ_old ← Ο€_ΞΈ                      # Snapshot current model
5:      
6:      # Sample G outputs per question
7:      for each question q in D_b:
8:          Sample {o^(1), ..., o^(G)} from Ο€_ΞΈ_old
9:          Compute rewards {r^(1), ..., r^(G)}
10:         Compute advantages A^(i) via group normalization
11:     
12:     # Take multiple gradient steps on same rollouts (off-policy)
13:     for train_step = 1 to n_train_steps_per_rollout_batch:
14:         Update Ο€_ΞΈ by maximizing GRPO-Clip objective
15:     
16: Output: trained Ο€_ΞΈ

Why this works:

  1. Group normalization provides a baseline without training a separate network
  2. Off-policy updates let us take multiple gradient steps per batch (efficient!)
  3. Clipping prevents the policy from changing too much (stable!)

Part 8: Putting It All Together

Summary Table

Concept What It Is What It Does Example
Policy $\pi_\theta$ Your LLM parameterized by weights $\theta$ Given context, outputs probability distribution over next tokens P(β€œParis” \mid β€œThe capital of France is”) = 0.85
State $s_t$ The text generated so far Provides context for the next decision β€œWhat is 2+3? I need to"
Action $a_t$ A single token The choice made at each step β€œadd”
Trajectory $\tau$ Complete sequence $(s_0, a_0, s_1, a_1, …, s_T, a_T)$ One full episode from prompt to end-of-text Question β†’ reasoning β†’ answer β†’ </answer>
Reward $r_t$ Scalar feedback signal Judges quality of action at state 0 for intermediate steps, 1 if final answer correct
Return $R(\tau)$ Sum of rewards over trajectory Single number measuring β€œhow good was this trajectory?” R = 1 (correct) or R = 0 (wrong)
Score function $\nabla_\theta \log \pi_\theta(a \mid s)$ Gradient of log-probability Direction in parameter space that increases P(action) A vector with one entry per model parameter
REINFORCE Vanilla policy gradient algorithm Multiply score function by return, average over trajectories Good trajectories β†’ reinforce all their tokens
Baseline $b(s)$ Estimate of expected return Subtract from reward to get advantage If model gets 70% right, b β‰ˆ 0.7
Advantage $A = R(\tau) - b$ β€œBetter or worse than expected?” Positive β†’ reinforce, Negative β†’ discourage, Zero β†’ no signal R=1, b=0.7 β†’ A=+0.3 (good!); R=0, b=0.7 β†’ A=-0.7 (bad!)
On-policy Generate β†’ one gradient step β†’ discard Uses fresh data for each update Wasteful: 1000 inferences for 1 gradient step
Off-policy Generate β†’ multiple gradient steps β†’ discard Reuses data with importance weights $\frac{\pi_\theta}{\pi_{\theta_{old}}}$ Efficient: 1000 inferences for 4-8 gradient steps
Importance weight $\rho_t$ Ratio $\frac{\pi_\theta(a)}{\pi_{\theta_{old}}(a)}$ Corrects for distribution mismatch when reusing old data If old P=0.4, new P=0.6, then ρ=1.5
Clipping Constrain $\rho_t$ to $[1-\epsilon, 1+\epsilon]$ Prevents policy from changing too fast With Ρ=0.2, ρ stays in [0.8, 1.2]
GRPO Group Relative Policy Optimization Sample G responses per question, use group statistics as baseline No need to train separate value network

Key Equations at a Glance

Equation Name Plain English
$\nabla_\theta J = \mathbb{E}{\tau}[\sum_t \nabla\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau)]$ Policy Gradient β€œIncrease probability of tokens that led to high rewards”
$\nabla_\theta J = \mathbb{E}{\tau}[\sum_t \nabla\theta \log \pi_\theta(a_t \mid s_t) \cdot (R(\tau) - b)]$ Baselined Policy Gradient β€œReinforce better-than-expected, discourage worse-than-expected”
$A^{(i)} = \frac{r^{(i)} - \text{mean}(r)}{\text{std}(r)}$ GRPO Advantage β€œCompare this response to its siblings”
$\min(\rho_t \cdot A, \text{clip}(\rho_t) \cdot A)$ Clipped Objective β€œUpdate conservatively β€” don’t change too much at once”

The PyTorch Connection

When you implement GRPO, the core looks like this:

def compute_grpo_loss(model, batch, old_log_probs, advantages, epsilon=0.2):
    """
    Compute GRPO-Clip loss for a batch of trajectories.
    """
    # Get current log probabilities
    logits = model(batch["input_ids"])
    log_probs = get_log_probs(logits, batch["input_ids"])
    
    # Compute probability ratios: Ο€_ΞΈ / Ο€_ΞΈ_old
    ratios = torch.exp(log_probs - old_log_probs)
    
    # Clipped objective
    unclipped = ratios * advantages
    clipped = torch.clamp(ratios, 1 - epsilon, 1 + epsilon) * advantages
    
    # Take minimum (conservative update)
    loss = -torch.min(unclipped, clipped).mean()
    
    return loss

The key steps are:

  1. Compute current vs. old log-probabilities
  2. Form the probability ratio
  3. Clip the ratio
  4. Take the minimum of clipped and unclipped objectives
  5. Maximize (so we minimize the negative)

The math may look intimidating at first, but the core ideas are simple:

  • Try things (sample trajectories)
  • See what works (check rewards)
  • Do more of what works (policy gradient)
  • Be smart about it (baselines, clipping, group normalization)

That’s really all there is to it!


Resources: