Coaching a multi-turn agent in Amazon SageMaker AI to resolve help tickets or reasonable content material means dealing with a sequence of dependent steps, not a single response. These brokers learn directions, make instrument calls, learn the outcomes, resolve the following motion, and get better from a mistake earlier than committing to a solution. That flexibility can also be what makes agentic reinforcement studying (RL) difficult. Extra methods to behave imply extra methods to fulfill the reward with out doing the duty, and the surroundings the agent trains towards can quietly corrupt the coaching sign.
On this submit, we share finest practices for dependable multi-turn RL coaching. We cowl the right way to construct a coaching surroundings you possibly can belief, arrange an exterior analysis, design a reward aligned with the top activity, handle what modifications as soon as the agent runs for a number of turns, and monitor the metrics that let you know when to iterate. We draw our examples from the SOP-Bench dataset, an Amazon Science benchmark that evaluates brokers’ capacity to resolve duties primarily based on complicated Customary Working Procedures (SOP) throughout 12 enterprise domains.
SageMaker AI multi-turn reinforcement studying
Amazon SageMaker AI multi-turn RL (SageMaker AI MTRL) supplies the coaching loop for agentic duties. Your agent can run on Amazon Bedrock AgentCore, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Compute Cloud (Amazon EC2), AWS Fargate, or infrastructure of your alternative. You join it by way of a small adapter that exposes your instrument floor to the rollout server, and SageMaker AI MTRL handles the remaining:
- A modular agent-environment interface that retains integration low-code whereas supplying you with full algorithmic management. Customized rewards, customized instrument loops, and multi-turn dialog shapes are all yours to outline.
- Serverless execution that simplifies infrastructure considerations, so that you get production-scale agentic RL at per-token pricing with out provisioning or managing GPU clusters.
- Asynchronous rollout and trajectory assortment with bounded off-policy staleness. Era and gradient updates run in parallel with out drifting too removed from the present coverage, which hastens coaching.
- A local algorithm library spanning Proximal Coverage Optimization (PPO), Clipped Significance Sampling Coverage Optimization (CISPO), and importance-sampling (IS) losses, paired with a number of group-based benefit estimators (GRPO, GRPO
go@okay, RLOO, and extra). These cowl the alternatives most related to multi-turn agentic RL. - Sequence-extension coaching to maintain wall-clock down on lengthy multi-turn trajectories.
- Trajectory and reward observability in MLflow managed by Amazon SageMaker AI, so you possibly can learn what your agent did flip by flip, and throughout coaching steps.
- Analysis jobs report reward,
go@okay, trajectory metrics, and extra earlier than you deploy to a SageMaker AI endpoint or Amazon Bedrock.
The service supplies the coaching loop, {hardware}, and orchestration. The alternatives that resolve whether or not you get a dependable agent are yours. You construct the surroundings the agent trains towards, measure success outdoors the reward, design the reward itself, and resolve the right way to iterate when the curve stalls.

Determine 1: Overview of the SageMaker AI multi-turn RL service
Construct a coaching surroundings that’s low-cost, reproducible, and consultant
Single-turn RL wants a immediate and a reward operate. Multi-turn RL provides an surroundings for the agent to behave in throughout turns: the instruments it calls and the techniques behind them. That surroundings is a part of your coaching setup, and the way in which you construct it shapes each what the mannequin can study and whether or not you possibly can belief your metrics.
When coaching an agent, construct a sandboxed or simulated surroundings that resembles manufacturing however stays remoted from stay visitors. Instrument calls and responses maintain the identical schemas and enterprise logic. They’re pushed by recorded responses or remoted state as an alternative of stay calls.
Simulated environments are the advisable place to begin as a result of a typical run produces many hundreds of rollouts, every making a number of instrument calls. For instance, a batch measurement of 128 with group measurement 8 is 1,024 rollouts per step. Pointing that visitors at stay techniques can result in buyer impression. And not using a simulated surroundings, exploration can produce actual uncomfortable side effects. For instance, an agent studying by trial and error will problem refunds, delete data, or set off workflows that you simply didn’t intend. Moreover, stay knowledge shifts underneath you, so the identical trajectory scores in another way throughout runs. You should know the proper final result to compute a reward, which implies a set, labeled set of duties (or a reliable choose mannequin) no matter the place the instrument calls go.
The way you construct the simulated surroundings is determined by what your instruments do. Three patterns cowl most use-cases you’ll encounter:
- Learn-only instruments: Replay recorded responses keyed by their inputs. These instruments assist the agent retrieve data related to a activity. For instance, in SOP-Bench the customer support activity supplies ten mocked instruments (
validateAccount,getAuthenticationDetails,createSessionAndOpenTicket, and so forth), every returning a deterministic response from a fixture, reminiscent of a selected row from a CSV file primarily based on the instrument name arguments. - Stateful instruments: Seeded sandboxes that maintain state for the size of an episode. When the agent writes one thing and reads it again, the surroundings wants reminiscence. The sample: allocate per-episode assets at the beginning of the rollout, and register every little thing the agent creates. Tear all of it down in a
strive/lastlyblock when the episode ends, whether or not by reaching a terminal motion, hittingmax_turns, or crashing. No state leaks into the following rollout. - Verifiable outcomes: Real execution in an remoted simulation surroundings. When the agent’s output is code, SQL, or math, you possibly can run it in an remoted surroundings. Use a Docker
execfor code, an in-memory SQLite per rollout for SQL, a pure Python eval for math. Actual execution, deterministic per-instance, similar enter plus similar sandbox state equals similar end result. For instance, AgentCore Code Interpreter supplies managed remoted environments for code execution.
Whichever sample suits, maintain two properties mounted:
- Reproducibility: the identical instrument referred to as with the identical arguments returns the identical end result, so the reward for an an identical trajectory is steady and your analysis is comparable throughout runs.
- Representativeness: construct the surroundings out of your actual schemas and knowledge distributions so the conduct the mannequin learns transfers to manufacturing.
Earlier than you begin coaching, verify your surroundings is configured accurately:
Arrange an exterior analysis earlier than you prepare
After your surroundings is in place and verified, construct a technique to measure success earlier than you write a reward operate. That measure ought to seize your finish objective straight. RL optimizes the reward sign actually, so if the reward is the one quantity you watch, you can’t separate progress on the duty from progress on satisfying the reward standards. You want an exterior analysis you possibly can belief to information your selections whilst you iterate on rewards, surroundings seeding, and hyperparameters.
Sample
Arise a held-out analysis that scores the end result you care about at deployment, computed independently of the reward. In observe this can be a small piece of code that takes a mannequin, runs it by way of the rollout server on a set take a look at break up, and returns a single task-success fee. It may be minimal, so long as it’s sincere.
For SOP-Bench, the analysis is exact-match on the ultimate JSON object inside : each subject within the agent’s output has to match the ground-truth subject, or the rollout scores zero. The reward operate can compute partial credit score and weighted elements. The analysis doesn’t.
Earlier than any coaching, set up a baseline. Run the bottom mannequin and a reference mannequin (a frontier mannequin hosted on Amazon Bedrock is an effective match) by way of the identical analysis. This tells you two issues: how far the bottom mannequin has to go, and what good seems like on this activity.
Anti-pattern
Treating the coaching reward, or a metric derived from it, as your measure of success. This may appear intuitive, however to seize reward hacking, you want exterior analysis. Multi-turn brokers want particular consideration: a reward that pays out for instrument calls teaches the agent to name as many instruments as it will probably. A reward that penalizes flip rely teaches the agent to decide to a solution earlier than it has the knowledge it wants. Both means, the coaching reward rises however the agent’s actual success at its activity falls.
Earlier than you begin coaching, verify your analysis is reliable:
Design an excellent multi-turn RL reward operate
Reward design is among the more difficult open issues in RL. The identical flexibility that lets the agent resolve an actual activity lets it discover methods to fulfill the reward with out doing the duty. Each element you add, each reward weight you tune, each formatting bonus you layer in is one other floor the place the agent can climb with out fixing the duty. The mannequin optimizes what you wrote down, not what you meant. By default use the identical scoring rule for coaching and analysis, and solely deviate when you have got a concrete motive.
Take SOP-Bench. The benchmark expects the reply as a JSON object inside tags:
The benchmark scores 1 if each subject matches and 0 in any other case. Coaching and analysis often share this scoring rule and differ solely in what you observe round it. The coach consumes one reward (scalar or listing of scalars) per rollout. Analysis runs at decrease frequency on a set break up, so you possibly can monitor extra metrics: per-field accuracy, completion fee (did the agent emit in any respect), tool-call distribution, flip funds exhaustion, format compliance.
There are two actual causes to deviate from the default benchmark scoring rule, and each name for a denser reward.
The primary is algorithmic. RL computes the educational sign from variance throughout a gaggle of group_size rollouts per immediate, utilizing a group-based benefit methodology (advantage_method). The service default group_based is GRPO. Many different strategies like rloo and grpo_passk are additionally accessible. See the documentation for a full listing. A binary rating can collapse that variance: when each rollout in a gaggle scores the identical, the relative sign is zero and the group contributes no gradient. When rollout/reward/valid_mean (the imply over non-zero-advantage teams) drifts under rollout/reward/imply and the mannequin stalls, that hole is the symptom.
The second is convergence velocity. Even when group variance is wholesome, a dense reward offers the mannequin gradient towards partial progress on each rollout, not solely those that totally succeed. A rollout that will get 5 of six fields proper teaches the mannequin what nearer seems like. A binary rating teaches it nothing about that.
A dense reward for the SOP-Bench activity scores every subject independently and returns a reward scalar or listing of scalars (per-turn rewards) plus a metrics dictionary.
Your agent stories the reward by way of update_reward, and the metrics dictionary (completion, field_acc) seems in MLflow. To credit score particular person turns as an alternative of the entire trajectory, update_reward additionally accepts a per-turn listing, paired with the group_based_per_turn benefit methodology, so your reward operate may return one reward worth per flip.
- Confirm the reward on actual outputs earlier than you prepare on it. A reward parser extra forgiving than your analysis is its personal type of reward hack. In certainly one of our SOP-Bench runs the reward accepted a looser output format than the benchmark scored: a naked
wrapper earned credit score regardless that the benchmark solely reads. Coaching did precisely what we requested: the mannequin realized to drop the tag the benchmark wanted, the reward climbed, however the exterior analysis fell. - Make sure that the bottom mannequin has a foothold first. RL improves what the bottom mannequin can already do some fraction of the time. It doesn’t invent functionality from nothing. If the bottom mannequin produces zero profitable trajectories in your activity, the reward sign has nothing to amplify and coaching stalls.
SageMaker AI MTRL can run such a baseline as a managed analysis job. MultiTurnRLEvaluator replays your agent over a held-out immediate set and stories eval/reward and go@okay. When you’ve got already skilled a mannequin, a single name with evaluate_base_model=True scores the bottom and fine-tuned mannequin facet by facet. As a result of go@okay thresholds the reward at success_threshold, setting success_threshold=1 offers you a strict success fee: the fraction of rollouts that scored an ideal reward alongside the imply.
Within the specified s3_output_path, you will discover the reported metrics of the analysis which it’s also possible to evaluate in MLflow, together with analysis trajectories. For reward-based analysis of fine-tuned and base fashions, see the documentation on Mannequin analysis.
Maintain one distinction in thoughts: the analysis job scores rollouts together with your agent’s personal reward operate, so it measures held-out generalization, not independence from the reward. A lenient reward parser would look wholesome right here, as a result of the metric is the reward itself. The impartial verify that catches reward-parser bugs stays separate: rating the identical rollouts with a stricter, impartial parser (for SOP-Bench, the benchmark’s exact-match scorer) and evaluate. You possibly can even run that strict scorer as its personal analysis job by pointing MultiTurnRLEvaluator at an agent whose reward is the impartial metric.
For a deeper remedy of reward design, sparse vs. dense rewards, choose fashions, multi-objective shaping, and the trade-offs between them, see the SageMaker AI reward design finest practices.
Earlier than you belief your reward, verify:
Handle what modifications when the agent runs for a number of turns
A multi-turn agent has to handle considerations single-turn doesn’t see. These are value designing for explicitly earlier than you begin coaching.
Context grows each flip, and switch budgets are a part of the reward design. Every instrument name extends the dialog: the decision, its arguments, the end result, and the reasoning the mannequin produces between them. Lengthy trajectories accumulate context quick, and MTRL makes use of sequence-extension coaching to maintain wall-clock manageable as they develop. A activity that wants eight calls in sequence may run out of room earlier than it finishes. Two budgets sure this: max_turns, which your agent loop controls, and the per-turn token funds, which the service units by way of sampling_max_tokens (rollout) and val_sampling_params.sampling_max_tokens (analysis). Decide each to match what your activity wants and what you possibly can afford to serve at deployment.
For SOP-Bench, eight turns and a 2,048-token per-turn funds cowl the canonical process with margin to spare (sampling_max_tokens permits as much as 8,192). A rule of thumb: if a human walkthrough of the duty takes N turns, set max_turns = ceil(N * 1.5) in your agent loop. The appropriate flip funds is the smallest one which lets the agent end with a small security margin. Watch rollout/tokens/response_max for responses clustering on the cap. If greater than 5 % of rollouts hit it, elevate sampling_max_tokens. That sign is silent loss in any other case. The mannequin learns from a truncated trajectory however doesn’t see the reward it will have earned by ending.
Separate completion from correctness
A trajectory that finishes with the mistaken reply and one which by no means finishes are totally different failures, and conflating them hides the place the mannequin is breaking. The rollout and val metric households in MLflow provide you with each alerts individually:
| Metric | What it tells you | |
| 1 | rollout/reward/imply |
Common trajectory reward, your training-side sign |
| 2 | rollout/reward/zero_frac |
Fraction of trajectories that scored precisely 0 |
| 3 | rollout/turns/imply |
Common turns per trajectory |
| 4 | evaluation/zero_adv_groups |
Teams the place each rollout scored the identical, losing rollouts |
| 5 | val/reward/imply |
Imply validation reward your held-out knowledge sign |
| 6 | val/reward/pass_k_1, pass_k_8 |
go@1 and go@okay on the held-out set |
A excessive val/reward/pass_k_1 on a low completion fee (rollouts hitting max_turns earlier than emitting a ) means the mannequin will get the straightforward paths proper and stalls on the arduous ones, suggesting turn-budget tuning. A excessive completion fee on a low val/reward/pass_k_1 means it solutions fluently however mistaken, suggesting reward redesign. The 2 failure modes name for various fixes, so it’s value telling them aside.
Earlier than you commit a flip funds, verify:
Monitor coaching metrics
After you’ve arrange and verified your analysis, surroundings, and reward, it’s time to begin coaching. SageMaker AI MTRL supplies the high-level MultiTurnRLTrainer and MultiTurnRLEvaluator constructs to coach and rating your agent:
Whereas coaching, watch rollout/reward/imply subsequent to the completion fee and open just a few trajectories in MLflow (underneath the Traces tab), so a reward that rises on flat completion doesn’t slip previous. The sign that issues at analysis is disagreement: when rollout/reward/imply climbs however val/reward/imply stays flat, the reward is being hacked. Open these trajectories and evaluate what the reward credited towards what the analysis scored. That comparability drives your reward design iteration: tighten the reward parser, reshape a element, or curate the info, then run once more. Every iteration is quicker than the final as a result of the surroundings and analysis keep mounted. Solely the reward and the info change, and MTRL’s per-model starter recipes provide you with a tuned level to begin from.
For instance, in certainly one of our earliest makes an attempt we had been attempting to coach an agent on all SOP-Bench duties on the similar time, which led to duties competing and reward fluctuating:

Determine 2: Reward fluctuating when attempting to coach all SOP-Bench duties collectively
After limiting our knowledge to give attention to a single activity (aircraft_inspection), we seen validation reward happening whereas rollout reward had saturated. In our reward formulation the max reward was 5.0, however reward had stalled round 3.7:

Determine 3: Reward stalling and validation reward dropping
The mannequin wasn’t incomes full reward on aircraft_inspection, and the Process Success Price on the exterior benchmark went down for the fine-tuned mannequin in comparison with the bottom mannequin. We wanted to evaluate rollout trajectories to seek out out why. The SOP’s one-shot instance didn’t match the duty’s ground-truth knowledge in two methods. It omitted the cross_check_response subject that the info required, so the mannequin couldn’t produce a whole reply, and it wrapped the output in a unique tag than the analysis anticipated. We aligned the instance with the info and dropped the unanswerable subject, which let the reward and the analysis measure the identical factor.

Determine 4: Wholesome reward alerts for the aircraft_inspection activity of SOP-Bench
When measuring the Process Success Price (TSR) of a fine-tuned GPT-OSS 20B mannequin towards the exterior benchmark, we noticed TSR improve by 13 % and per-field accuracy develop by roughly 16 % on the aircraft_inspection activity, confirming that our reward operate aligns with our exterior analysis.
Placing it collectively: An iteration loop
The items described earlier add as much as a single coaching loop, run within the order they had been launched. You construct the surroundings and the analysis first, as a result of they’re the mounted scaffolding each later step is determined by. You then design the reward towards that analysis, and solely after that do you prepare and browse the metrics. Protecting the early items mounted is what makes every go quick, so most of your effort goes into the reward and the info. A model that has labored properly for us:
- Acquire consultant activity knowledge and break up into prepare, validation, and held-out take a look at units.
- Construct the coaching surroundings from manufacturing schemas: airtight, seeded, reproducible.
- Arise the exterior analysis towards the take a look at set, computed independently of the reward.
- Set up a baseline by working the bottom mannequin and a frontier reference mannequin by way of the analysis. If the bottom mannequin scores zero, cease and simplify earlier than persevering with.
- Design the reward, then validate it on actual mannequin outputs from the baseline earlier than any coaching has occurred.
- Prepare, monitoring rollout/reward, completion fee, and a pattern of trajectories to know what your mannequin is producing throughout coaching.
- Consider the skilled mannequin with the exterior analysis. Learn trajectories, particularly those the place the reward and the analysis disagree.
- Regulate the reward, the surroundings, or the info, and run once more.
When the curve stalls or collapses, stroll these so as earlier than tuning anything:
| Symptom | Very first thing to alter | Diagnostic to verify | |
| 1 | Reward flat from step 0 | Confirm mannequin output codecs are aligned with reward | Carry out standalone evaluations on totally different rewards to align format reward with mannequin’s output construction |
| 2 | Prepare reward flat, all teams rating the identical | Drop group_size from 8 to 4 and improve batch_size |
Watch evaluation/zero_adv_groups, ought to drop |
| 3 | Prepare reward rising however val/reward/imply flat |
Reward is being hacked. Re-read trajectories, tighten the reward parser | Re-run the offline reward evaluate towards new baseline rollouts |
| 4 | Reward collapses (drops to ~0.0) after step 40–80 | Set async_config.max_steps_off_policy = 0. If on CISPO, swap to PPO with (0.8, 1.2) |
Reward ought to stabilize, even when decrease |
| 5 | Reward stalls with restricted enchancment, all knobs wholesome | Double LoRA capability (lora_rank=64, lora_alpha=128) |
Greater ceiling inside 50 steps if there’s room to develop |
Make one change at a time, observing metrics for 25–50 coaching steps (gradient updates) per choice. In our runs, most failures grew to become identifiable inside roughly 30 steps when these parameters are adjusted intentionally.
Conclusion
Your reward high quality and your analysis resolve whether or not coaching produces a helpful agent, way more than the algorithm or the hyperparameters do. The reward is the one sign the mannequin optimizes, and an analysis stored separate from it’s what tells you whether or not the agent is studying the duty or studying the reward. A rigorously designed reward and an analysis that matches the top activity can produce a helpful agent; with out them, even a robust algorithm yields a mannequin that appears good in coaching and fails in manufacturing.
SageMaker AI multi-turn RL takes care of many of the operational work and complexity of working a distributed agentic RL coaching, abstracting away the {hardware}, orchestration, and coaching engine. With SageMaker AI multi-turn RL, you give attention to creating an correct surroundings, the place Strands Brokers and AgentCore may also help you transition your manufacturing surroundings to an agentic setup, and give attention to the reward design, analysis, and parameter tuning.
To get began with agentic RL, you possibly can stroll by way of the instance pocket book for MTRL setup. See the SageMaker AI multi-turn RL documentation for service-level steering and the reward design finest practices for a deeper remedy of the reward matter, or this AWS weblog submit on GRPO with verifiable rewards. Lastly, the SOP-Bench paper and dataset are the supply of the working instance used right here.
Concerning the authors

