Blog posts

June 11, 2026

Multimodality in Adaptive Engine

Product
Authors
Dani Lang
Editors
No items found.
Acknowledgements

Multimodality in Adaptive Engine

Today, we’re announcing multimodal support in Adaptive Engine. Teams can now fine-tune and serve models that take images and text as input across the same SDK, UI, training pipeline, and serving stack they already use for text.

This release is built for enterprise teams running high-volume document systems. In this context, multimodal refers to vision-language models: systems that read images alongside text. The same Harmony codebase that powers Adaptive Engine today now supports end-to-end structured extraction from these workflows end-to-end.

Enterprises are sitting on gold mines of verifiable rewards

Document extraction is a model problem

Document extraction is becoming a model problem rather than a pipeline problem.

For most enterprises, the data already exists. It lives in scans, PDFs, forms, receipts, and multi-page documents. The challenge is not capture, but conversion into structured fields that downstream systems can reliably consume.

Insurance claims, prescriptions, receipts, invoices, onboarding forms, and compliance documents all share one constraint: information is embedded in layouts and scans, while downstream systems require structured fields.

These workflows share three characteristics:

  • High volume, often millions of documents per quarter
  • Structured outputs mapped to fixed schemas
  • Tight accuracy requirements where missing fields break downstream systems

Correctness is measured at the schema level. A model that extracts 76 of 80 fields correctly can still fail in production if the missing fields break downstream workflows.

Most teams end up choosing between two approaches: 

  • Traditional document pipelines combine OCR, layout detection, extraction models, and rules. They work in narrow use cases and become difficult to evolve as document formats change.
  • Frontier multimodal APIs simplify integration, but introduce fixed behavior and linear cost scaling with volume.

Enterprise teams do not want a generic model endpoint. They want models they can tune, evaluate, and improve against the documents their systems actually run on.

That shift is what Adaptive Engine is built for: models trained on proprietary schemas, deployed in customer environments, and improved as document data evolves.

Multimodal support across Adaptive Engine

Multimodal support is now available across Adaptive Engine:

  • SDK - Send images alongside text through the existing API with support for structured outputs.
  • Chat UI - Upload multiple images and validate extracted fields using a multi-image carousel that keeps source documents and outputs side by side.
  • Training - Fine-tune supported open-source multimodal models on image-and-text datasets using the same post-training workflow already used for text models.
  • Serving - Deploy multimodal models through the same inference system used for text, without introducing a separate serving path for image workloads.

The workflow stays consistent end-to-end. Teams prepare datasets, run fine-tuning, and deploy models directly into production systems.

One training and inference system for multimodal models

Adaptive Engine supports image-and-text workloads through the same training and inference system used for text models. The system extends the Harmony codebase rather than introducing a separate multimodal stack. Training, evaluation, and inference run in one place.

Most multimodal systems drift when training, evaluation, and serving are split across different environments. Small differences in preprocessing or token handling accumulate over time and create mismatches between training behavior and production behavior.

Adaptive Engine avoids this by running training, evaluation, and inference on the same codebase. Teams see the same behavior across all three stages without reconciling multiple environments.

Built for production document workloads

A multimodal model in Adaptive Engine consists of two components:

  • A vision tower encodes the image.
  • A language model consumes those encodings alongside text tokens and produces the completion.

Adaptive Engine can scale image processing separately from text generation. This matters in document systems where scanned forms, multi-page packets, and high-resolution images behave very differently from text requests. Separating these paths lets teams allocate compute based on workload type as volume grows.

Most enterprise document workflows involve multiple images per request. Claims packets, onboarding forms, prescription sets, and invoice bundles often span multiple pages. The multi-image carousel lets users review documents alongside extracted fields in a single view.

A healthcare team is using Adaptive Engine to extract 20+ structured fields from scanned prescriptions, forms, and clinical notes  into a fixed JSON schema used by downstream validation systems. They initially evaluated frontier multimodal APIs, but found them too generalist to accurately capture complex business rules (e.g. handling specialized medical abbreviations) while reliably extracting fields in the exact required format. They moved to a fine-tuned model running in their own environment, trained and evaluated on their own schema and specific requirements. The system now adapts and improves as new document formats appear in production, with full control over retraining and evaluation cycles.

Next steps 

We are continuing to expand multimodal support for document-heavy workflows. 

  • Direct PDF ingest - removing the need to convert PDFs into images before processing
  • Performance improvements for high - resolution and multi-page documents - improving throughput and memory behavior on large inputs

Start building

Multimodal support is available in Adaptive Engine today. Teams can start fine-tuning and deploying multimodal models using the existing SDK, training workflow, and serving system. Documentation and implementation details are available in the product docs, or teams can chat with our engineers to learn more about how we can support multimodals for post-training for you.

Copyright © 2026

Adaptive ML, Inc.
All rights reserved
Privacy Policy