Blog posts

December 2, 2025

A Simple Explanation of GSPO

Research
Authors
Dylan Ebert
Editors
No items found.
Acknowledgements

A Simple Explanation of GSPO

GSPO (Group Sequence Policy Optimization) is a reinforcement learning technique for training language models. Step through the visualization below to see how it works.

Interactive Explanation
Step 1 of 20
The Language Model
We begin with a neural network that generates text token by token.

For a more detailed explanation, continue reading.

The Problem

Training a language model on open-ended tasks is tricky. There's no single correct answer, but some responses are better than others. Reinforcement learning addresses this by rewarding preferred outputs. Here's how it works with GSPO. First, we generate multiple responses.

Generating Multiple Outputs

Score and Rank

Next, a reward function scores each response. This could be a trained model, a set of rules, or human ratings. Scores are compared to the group average: above-average responses get reinforced, below-average ones get suppressed.

Scoring Responses

Comparison to GRPO

Everything so far is identical to GRPO (Group Relative Policy Optimization). The difference is how we assign credit to individual tokens.

GRPO weights tokens differently, estimating each token's contribution to the reward. GSPO, on the other hand, weights tokens equally, emphasizing the full sequence over individual tokens. This improves training stability and typically yields better results.

Token Weighting
GRPO
GSPO

Wrapping Up

Thanks for reading! This was a simple overview of how GSPO trains language models through group-based reinforcement learning.

A technical deep dive with implementation details is planned. Join our mailing list to be notified when it's published.

Copyright © 2024

Adaptive ML, Inc.
All rights reserved