Copyright © 2026
Adaptive ML, Inc.
All rights reserved
Privacy PolicyAdaptive ML, Inc.
All rights reserved






Multimodality in Adaptive Engine
.png)
Today, we’re announcing multimodal support in Adaptive Engine. Teams can now fine-tune and serve models that take images and text as input across the same SDK, UI, training pipeline, and serving stack they already use for text.
This release is built for enterprise teams running high-volume document systems. In this context, multimodal refers to vision-language models: systems that read images alongside text. The same Harmony codebase that powers Adaptive Engine today now supports end-to-end structured extraction from these workflows end-to-end.
.png)
Document extraction is becoming a model problem rather than a pipeline problem.
For most enterprises, the data already exists. It lives in scans, PDFs, forms, receipts, and multi-page documents. The challenge is not capture, but conversion into structured fields that downstream systems can reliably consume.
Insurance claims, prescriptions, receipts, invoices, onboarding forms, and compliance documents all share one constraint: information is embedded in layouts and scans, while downstream systems require structured fields.
These workflows share three characteristics:
Correctness is measured at the schema level. A model that extracts 76 of 80 fields correctly can still fail in production if the missing fields break downstream workflows.
Most teams end up choosing between two approaches:
Enterprise teams do not want a generic model endpoint. They want models they can tune, evaluate, and improve against the documents their systems actually run on.
That shift is what Adaptive Engine is built for: models trained on proprietary schemas, deployed in customer environments, and improved as document data evolves.
Multimodal support is now available across Adaptive Engine:
The workflow stays consistent end-to-end. Teams prepare datasets, run fine-tuning, and deploy models directly into production systems.
Adaptive Engine supports image-and-text workloads through the same training and inference system used for text models. The system extends the Harmony codebase rather than introducing a separate multimodal stack. Training, evaluation, and inference run in one place.
Most multimodal systems drift when training, evaluation, and serving are split across different environments. Small differences in preprocessing or token handling accumulate over time and create mismatches between training behavior and production behavior.
Adaptive Engine avoids this by running training, evaluation, and inference on the same codebase. Teams see the same behavior across all three stages without reconciling multiple environments.
A multimodal model in Adaptive Engine consists of two components:
Adaptive Engine can scale image processing separately from text generation. This matters in document systems where scanned forms, multi-page packets, and high-resolution images behave very differently from text requests. Separating these paths lets teams allocate compute based on workload type as volume grows.
Most enterprise document workflows involve multiple images per request. Claims packets, onboarding forms, prescription sets, and invoice bundles often span multiple pages. The multi-image carousel lets users review documents alongside extracted fields in a single view.
A healthcare team is using Adaptive Engine to extract 20+ structured fields from scanned prescriptions, forms, and clinical notes into a fixed JSON schema used by downstream validation systems. They initially evaluated frontier multimodal APIs, but found them too generalist to accurately capture complex business rules (e.g. handling specialized medical abbreviations) while reliably extracting fields in the exact required format. They moved to a fine-tuned model running in their own environment, trained and evaluated on their own schema and specific requirements. The system now adapts and improves as new document formats appear in production, with full control over retraining and evaluation cycles.
We are continuing to expand multimodal support for document-heavy workflows.
Multimodal support is available in Adaptive Engine today. Teams can start fine-tuning and deploying multimodal models using the existing SDK, training workflow, and serving system. Documentation and implementation details are available in the product docs, or teams can chat with our engineers to learn more about how we can support multimodals for post-training for you.