Copyright © 2024
Adaptive ML, Inc.
All rights reserved
Adaptive ML, Inc.
All rights reserved






On-Brand and On-Policy: Reinforcement Fine-Tuning Reliable AI Agents for Customer Support
Introduction
Nowhere else has the promise of LLM revolution been so great, and the reality of implementation been so fraught than in customer support. Why? Automating quality customer service is astonishingly complex, and the stakes are high for any mistakes.
AI agents must consistently follow brand guidelines while adhering to complex support policies across multi-turn conversations—a feat even frontier models struggle with.
Facilitating these complex exchanges using prompt-engineering alone is often not reliable enough for production workflows. Models struggle to adhere to such multifaceted instructions. Ignored instructions can expose customers to frustrating, cyclical conversations, hallucinated information, or further delays to resolution.
Instead, using Adaptive Engine’s reinforcement fine-tuning capabilities, we trained Mistral Small 24B to better identify customer intent and execute on policy guidelines than GPT-4o, encoding the desired customer service behaviors into the weights of the model itself. For enterprises, this unlocks the ability to use small, specialized models that are a fraction of the cost of proprietary alternatives, all while improving customer success by 10% or more.
Here’s how we did it.
Challenge: Better Intent Detection and Policy Following than GPT-4o
To be successful, our customer service AI agent must do two things particularly well:
This intent classification → policy guideline workflow is the backbone of any customer service operation; whether the agents be LLMs or human operators, they need to know (A.) what the customer wants, and (B.) what actions to take, or data to gather, to remedy the situation. See an example list of intentions and policies below:
The correct course of action for a given input depends on the intent expressed by the customer. Therefore, we need our LLM agent to first classify what the intent is, and then act accordingly—preferably in a single pass. The end-user does not see this reflection and classification step, but it is necessary for proper routing and resolution of the support case, as well as for explainability and continuous evaluation of our system.
We model the expected output as such, where bounding XML tags allow us to conditionally reveal parts of the generation to the customer:
<analysis>One sentence analysis on what the customer’s intent is.</analysis><intent>predicted intent identifier</intent> Model response, following the company policy for the predicted intent.
To get a baseline on performance for this task, we gather a small dataset composed of past user questions and their ground-truth intents, and run an initial evaluation on GPT-4o. Using an AI judge to evaluate, GPT-4o achieves 92% accuracy for intent detection and 95% accuracy for policy following.
At the outset, these figures don’t sound particularly poor, both are within the ninetieth percentile. But, consider that this is a two-step process; the total system accuracy is actually 87% (92% x 95%), and that’s for each turn within a multi-turn conversation. That means, statistically, GPT-4o could make 1 or more mistakes per 10 customer interactions—not nearly reliable enough for production deployment.
We’ll use a combination of reinforcement learning from execution feedback (RLEF) and reinforcement learning using AI feedback (RLAIF) to improve these results.
Improving Intent Detection using RLEF
After selecting the model we want to fine-tune, Mistral Small in this case, we are presented with several methods for tuning: RL with production feedback, RL with synthetic feedback, and RL with execution feedback (RLEF). RLEF allows us to write a custom reward function to use during training; since we want to reward our model based on an intent parsed from its completion, it makes the most sense to tune using this technique.
We provide the trainee model with a dataset composed of ~2000 prompts, i.e. sample questions that clients have historically asked to start a conversation with customer support, and their ground-truth intents. During training, we evaluate the model’s completions using our feedback endpoint. We use a regular expression to find the predicted intent bounded by <intent></intent> tags, and compare it against the ground-truth intent. The model is maximally rewarded if the predicted intent is correct, and punished if the wrong intent was classified or if the output formatting does not respect XML tags.
Adaptive Engine allows us to granularly adjust the training parameters for each RL training run. For instance, we can choose to run PPO, DPO, or GRPO, or even adjust individual hyperparameters. For simplicity, Adaptive Engine will recommend default settings based on the task, model, and feedback method.
That’s it. We can initiate a reinforcement learning training run in just 6 clicks. In a few hours, we will have a model trained using RLEF on intent identification. Next, we’ll want to train this same model for policy following as well.
Historically, to fine-tune a model for something like policy following using supervised fine-tuning, we would need to collect and annotate golden examples of answers which perfectly follow each policy, and guarantee that all policies and desired behaviours were represented in our training data.
Instead, using Adaptive Engine, we can use an AI judge and our trainee model to generate synthetic data and feedback, allowing us to train with RL without additional data collection or annotation.
In the same training flow, we select RL with synthetic feedback; this prompts us to enter free-form guidelines regarding how we want our model to behave. Here, we add our list of policies and respective guidelines. For instance, if the customer is looking to cancel their order, we want the model to inform them of a $4.99 fee.
While these might look like prompts to our model, they’re actually guidelines to inform an AI judge on how to evaluate completions from our trainee model and give feedback. Thus, we can upload the same dataset of potential customer service inquiries, generate responses with our trainee model, and evaluate for policy following using this customized AI judge—in this case, a Llama 3.3 70B.
Because of various inference-time compute strategies, such as self-consistency and best-of-N, we’re able to get results from our candidate model that actually outperform our teacher model. We’re not simply distilling the capabilities of the teacher model; we can reinforcement fine-tune an 8B model that is actually better at policy following than our 70B teacher model.
Now that we have tuned our model for both intent detection using RLEF and policy following using RLAIF, we want to evaluate how it compares to GPT-4o.
For our evaluation, we use the same external feedback endpoint used for training to assess intent classification performance, and the same AI judge to evaluate policy-following, which greatly reduces the burden of building evals that match our training process. And, to save compute, we can run this AI judge against interactions we’ve already generated through training and previous evaluations.
What we find is that our tuned Mistral model achieves a 100% success rate for intent detection—an 8% improvement compared to GPT-4o, and a 13% improvement over the base model performance. Moreover, our tuned model follows the provided policy guidelines 97% of the time, improving on GPT-4o’s score of 95%.
This brings our total system accuracy to 97% with Mistral 24B versus 87% for GPT-4o, a much safer score to promote to production. It should be noted that this score is just a starting point for our refinement, too; we can improve our policy following score by leveraging a small set of annotated examples from customer service experts and/or leveraging production feedback to iron-out edge cases or anomalous situations.
These evaluation scores are helpful to understand overall model performance; however, it’s often important for organizations to understand precisely where an agent went wrong, or right.
Adaptive Engine’s Interaction Browser allows us to inspect every trace from our models, and pinpoint precise examples where one model followed policy and the other didn’t, or cases where these models might misidentify customer intent. For example, see example below for a trace where GPT-4o misidentified the intent of a customer question:
While a bit awkwardly phrased, it’s clear that the customer would like to initiate a new refund, rather than track an existing refund. While GPT-4o takes the correct steps to track a refund (it follows the provided policy) those actions are not needed, because the customer doesn’t have an ongoing refund to track. This would result in wasted time for the customer, reduced CSAT, and escalation to a human agent, costing company resources.
Adaptive Engine’s integrated fine-tuning and evaluation capabilities allowed us to rapidly train a small, cost-efficient model to outperform proprietary APIs for customer service intent detection and policy following, closing the gap between PoC and production. Book a demo today.