Blog posts

April 23, 2026

GRPO, Simply Explained

Education
Authors
Dylan Ebert
Editors
No items found.
Acknowledgements

GRPO, Simply Explained

GRPO (Group Relative Policy Optimization) is a reinforcement learning algorithm for training language models. Most recent RL-on-LLM methods, such as GSPO and DAPO, are variations on it. Same core loop: generate a group of outputs for a prompt, score them, nudge the model toward what beat average.

Background

Supervised fine-tuning teaches a model from examples: given this prompt, produce this output. Reinforcement learning teaches from rewards.

The model being trained is called the policy model. It attempts a task, earns a reward, and learns to earn higher rewards.

A reward tells the model what "good" means. It can come from an AI judge, a model trained on human feedback, or direct verification.

What GRPO does

Instead of one output per prompt, GRPO produces a group of rollouts. Each rollout is one complete generation from the policy model.

Each rollout gets a reward. The scores are ranked against the group's average.

Each rollout's standardized score is its advantage. It's applied to each of the rollout's tokens.

Positive advantage makes those tokens more likely. Negative makes them less.

In the training loss, each token's log-probability is weighted by its rollout's advantage.

Calculating loss

For each token in rollout i, the loss includes Ai × log π(token | context). Optimizing it pushes the token's probability up when A is positive, down when negative.

That's GRPO. The DeepSeekMath paper introduces it with the full gradient math. GSPO aggregates the signal at the sequence level instead of the token level, stabilizing training for mixture-of-experts models.

Copyright © 2026

Adaptive ML, Inc.
All rights reserved