Blog posts

May 11, 2026

Product update - May 2026

‍

Product

Authors

Dani Lang

Editors

No items found.

Acknowledgements

Product update - May 2026

We’ve been focused on making Adaptive Engine a more powerful post-training platform for deploying specialized agents. This release improves control over outputs, increases visibility into training, and expands how teams evaluate and promote models.

Function graders and constrained decoding tighten how teams define evaluation logic and enforce output format. Checkpoint promotion gives teams more control which checkpoints reach production. A new Monitoring tab brings more training observability into the platform.

Here's what's new.

Function graders

As teams build more specialized agents, evaluation becomes more central to training quality.

Custom Python graders are now reusable objects in Adaptive Engine. Create, test, and manage them via UI or SDK and reuse them across RL and evaluation recipes without modifying code.

Function graders provide a deterministic alternative to LLM-based evaluation for cases where correctness is structural or rule-based. Instead of relying on an LLM judge, you define evaluation logic directly in Python and validate it against a live sample of your dataset before attaching it to an RL recipe.

import re

async def grade(thread: StringThread) -> float:
    completion = thread.completion()
    expected = thread.metadata.get("expected", "")
    match = re.search(r"<prediction>(.*?)</prediction>", completion, re.DOTALL)
    prediction = match.group(1) if match else ""
    # find prediction in completion, reward by direct string match vs ground truth
    return 1.0 if prediction.strip() == expected.strip() else 0.0

‍

Each grader runs in an isolated sandbox, making it safe to reuse across recipes. Full CRUD is available in the Python SDK for programmatic management.

This improves evaluation consistency for tasks with explicit correctness criteria. Learn more about our function graders in our docs.

Monitoring

Debugging training runs often requires stitching together metrics across multiple tools.

The monitoring tab centralizes training telemetry across runs, including loss curves, reward signals, and live metrics with run comparisons.

It consolidates fine-tuning and RL observability into a single interface, reducing reliance on external dashboards and manual debugging workflows.

Monitoring reduces iteration time and removes visibility gaps during training.

Constrained decoding

Model outputs are often difficult to reliably parse.

Chat completions now support JSON Schema or Pydantic model as a response_format parameter. Invalid tokens are excluded at each step, keeping generation within the schema.

The constraint is enforced during decoding by restricting token vocabulary based on the current parse state.

This works with models deployed on Harmony and external providers including OpenAI, Anthropic, and Gemini. Available in the Python SDK, REST API, and in-product chat with presets for classification, entity extraction, and simple object output.

For agentic systems where model outputs are passed between steps as structured data, this turns a parsing assumption into a typed interface. This helps remove an entire class of parsing errors in downstream systems.

Checkpoint promotion

The best-performing model is not always the final checkpoint.

Training runs now save model state at configurable intervals. Any checkpoint can be promoted to a standalone model in the registry for evaluation or deployment.

Intermediate checkpoints sometimes outperform the final one on held-out evaluations. Promotion allows direct comparison and shipping of the best-performing checkpoint. Each promoted model retains full lineage to its source run.

LoRA promotion automatically binds to its backbone model. Interrupted runs can resume from the last saved checkpoint rather than starting over. Multi-stage runs such as SFT, PPO, and GRPO track and resume each stage independently.

Learning and deep dive resources

For teams building specialized agents with post-training, we continue to publish additional resources.

GPU cost calculator: Compare GPU costs of frontier models against open-source specialist models
Adaptive Engine architecture deep dive: Learn how Adaptive Engine was designed for production-grade post-training
GRPO explained: Discover what Group Relative Policy Optimization is and when to use it instead of PPO‍
Prompting, SFT, and RL evaluation framework: Evaluate and choose between prompting, SFT, and RL for production LLM systems

Adaptive Engine Adapt Evaluate Serve

Use Cases RAG Text-to-SQL Customer Support

Company Technology About Blog

Socials LinkedIn Twitter YouTube