Agentic software calling is what makes AI brokers helpful in manufacturing. It’s how they question databases, set off workflows, retrieve real-time information, and act on a consumer’s behalf. However base fashions often hallucinate instruments, go dangerous parameters, and try actions when they need to ask for clarification. These failures erode belief and block manufacturing deployment.
You should use Serverless mannequin customization in Amazon SageMaker AI to repair these issues with out managing infrastructure. With Reinforcement Studying with Verifiable Rewards (RLVR), the mannequin generates its personal candidate responses, receives a reward sign indicating high quality, and updates its habits to favor what works. You choose a mannequin, configure a method, level to your information and reward perform, and SageMaker AI handles the remainder. On this put up, we stroll by means of how we fine-tuned Qwen 2.5 7B Instruct for software calling utilizing RLVR. We cowl dataset preparation throughout three distinct agent behaviors, reward perform design with tiered scoring, coaching configuration and outcomes interpretation, analysis on held-out information with unseen instruments, and deployment. By the tip, our fine-tuned mannequin improved software name reward by 57% over the bottom mannequin on eventualities that it didn’t see throughout coaching.
As a result of software calling has a naturally verifiable goal, whether or not the mannequin known as the proper perform with the proper parameters, it maps properly to RLVR. The problem with self-managed reinforcement studying (RL) is the operational overhead. GPU procurement, reminiscence orchestration between rollout and coaching phases, reward infrastructure, and checkpointing add up shortly. Hyperparameter sensitivity provides one other layer of complexity. SageMaker AI takes on that work so you possibly can focus in your mannequin, your information, and your reward perform.
SageMaker AI helps mannequin households together with Amazon Nova, GPT-OSS, Llama, Qwen, and DeepSeek, with strategies together with Supervised Tremendous-Tuning (SFT), Direct Choice Optimization (DPO), RLVR, and Reinforcement Studying from AI Suggestions (RLAIF). Coaching and validation metrics are tracked by means of built-in MLflow.
Why RLVR for software calling
SFT requires labeled examples of every habits that you really want the mannequin to be taught. For software calling, meaning examples of calling a software, asking for clarification, and refusing. However software calling additionally requires the mannequin to determine between these behaviors, and SFT can wrestle to generalize that decision-making past the precise patterns in its coaching information.
RLVR works in another way. For every immediate, the mannequin generates a number of candidate responses (we use eight). A reward perform verifies which of them are right. The mannequin then updates its coverage to favor what labored, utilizing Group Relative Coverage Optimization (GRPO). GRPO compares every candidate’s reward rating in opposition to the imply rating of the group and reinforces responses that rating above common. Over time, the mannequin learns the format of a software name and when to name in comparison with when to ask.
Conditions
To make use of serverless mannequin customization in SageMaker AI, it’s essential to have the next conditions:
Tremendous-tune Qwen 2.5 7B Instruct in SageMaker AI
To get began, we open Amazon SageMaker AI Studio and select Fashions within the left navigation pane to browse the inspiration fashions (FM) which are accessible for personalisation.

Within the Customise mannequin menu, choose Qwen 2.5 7B Instruct, and select Customise with UI. This opens the customization configuration web page the place you choose your approach, level to your coaching information and reward perform, and configure hyperparameters. We chosen Reinforcement Studying from Verifiable Rewards (RLVR) as our customization approach.

Put together your coaching information
A software calling dataset wants to show greater than right API invocations. Manufacturing brokers face three distinct conditions:
- The consumer supplies sufficient data, and the mannequin ought to name a software.
- The consumer’s request is lacking required parameters, and the mannequin ought to ask for clarification.
- The request is dangerous or out of scope, and the mannequin ought to refuse.
We generated 1,500 artificial coaching examples from our software schemas (climate, flights, translation, foreign money conversion, statistics) utilizing Kiro, the Amazon AI-powered IDE, to supply prompts with lifelike variation in phrasing and specificity throughout the three behaviors. Right here’s an instance of the immediate we used:
Generate 1,500 JSONL coaching examples for RLVR tool-calling
fine-tuning throughout 5 software schemas: get_weather_forecast,
search_flights, translate_text, currency_convert, and
get_statistics.
Every line should comply with this format:
{"immediate": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}], "reward_model": {"ground_truth": "..."}}
Distribute examples throughout three behaviors:
1. Execute (60%): Consumer supplies all required params → ground_truth is the software name JSON
2. Make clear (25%): Consumer is lacking required params → ground_truth is a clarifying query
3. Refuse (15%): Request is dangerous or out of scope → ground_truth is a well mannered refusal
Fluctuate phrasing between formal, informal, and terse.
Output legitimate JSONL solely, no commentary.
This can be a sensible path for groups that don’t but have manufacturing logs to attract from. For organizations already operating agentic workflows, actual consumer prompts and power calls from manufacturing will yield even higher-quality coaching information.
Every coaching instance incorporates a immediate (a system instruction and consumer request) and a floor reality within the reward_model discipline that the reward perform scores in opposition to. Listed below are examples of every habits.
Execute when the consumer supplies all the pieces the software wants:
Make clear when a required parameter is lacking:
Execute with a number of parameters:
Discover the distinction between “Get climate for San Francisco” (software name) and “Get the climate” (clarification). That is the sort of distinction GRPO learns properly. For every immediate, the mannequin generates eight candidates, the reward perform scores them, and the scores are averaged throughout the group. Candidates above the imply get strengthened, and over time the mannequin picks up when to name and when to ask.
Outline your reward perform
The reward perform defines what right means for our use case. We write it as a Python perform that receives the mannequin’s response and the bottom reality from the coaching information and returns a numerical rating. Ours extracts software calls from the mannequin’s response, parses them as JSON, and compares in opposition to the bottom reality.
The total perform handles response extraction, versatile parsing for different codecs throughout early coaching, and edge instances round JSON kind mismatches. Right here is the core scoring logic:
The three tiers (1.0, 0.5, and 0.0) give GRPO a richer studying sign. If a number of of the eight candidates get the perform proper however miss a parameter, the 0.5 rating distinguishes them from fully mistaken solutions. This helps the mannequin acknowledge that it’s heading in the right direction.
For clarification and refusal instances the place the bottom reality is pure language (no TOOLCALL tags), the reward perform checks whether or not the mannequin additionally prevented calling a software. An pointless API name when the mannequin ought to have requested a query earns 0.0.
Configure and launch coaching
On the customization configuration web page, we level to our coaching dataset and reward perform, then set our hyperparameters. We use a batch dimension of 128, studying fee of 5e-6, 3 epochs, and eight rollouts per immediate.
The rollouts setting is the core GRPO mechanism. For every coaching immediate, the mannequin generates eight totally different responses, the reward perform scores every one, and responses that rating above the group common get strengthened. Coaching and validation metrics are logged to MLflow. On this instance, coaching takes roughly 40 minutes.
Coaching outcomes

Prepare Reward Statistics (high left) is the chart to concentrate on. The imply reward throughout the roll outs began round 0.28 and climbed to 0.65–0.68 over 30 steps, greater than doubling. The steepest positive factors occur within the first 10 steps because the mannequin learns the fundamental software calling format and choice construction. It then flattens after step 20 because it converges.
The opposite charts affirm wholesome coaching:
- Coverage Entropy decreases, which means the mannequin is getting extra assured slightly than guessing.
- Gradient Norm stabilizes, which means updates are getting smaller and extra refined.
- Imply Benefit Estimate converges towards zero, indicating that the mannequin’s coverage is stabilizing and the typical response high quality is aligning with the reward baseline.
Consider the fine-tuned mannequin
After the coaching job is full, you possibly can see the fashions that you simply created within the My Fashions tab. To increase the main points, select View particulars on certainly one of your fashions.

You may select Proceed customization to iterate additional by adjusting hyperparameters or coaching with a special approach. Select Consider to check your personalized mannequin in opposition to the bottom mannequin.
We consider on a separate check set of 300 examples that have been excluded from coaching. The analysis dataset covers the identical three behaviors however contains instruments, phrasings, and eventualities that the mannequin hasn’t seen. It checks search_restaurants, get_stock_price, and calculate_standard_deviation, none of which appeared throughout coaching. It additionally contains refusal instances for dangerous requests like producing violent content material or creating malware, testing whether or not the mannequin generalizes secure habits to new threats.
The analysis runs normal NLP metrics alongside our customized reward perform in opposition to the held-out set.

Instrument Name Reward is our customized metric and essentially the most direct measure of what we educated for. It jumped from 0.35 to 0.55, a 57% enchancment. In sensible phrases, because of this the fine-tuned mannequin makes the proper software calling choice considerably extra typically. It calls the proper perform with the proper parameters when it ought to, asks for clarification when data is lacking, and refuses when applicable.
F1 Rating Quasi, Rouge1, and RougeL all improved by 14–19 share factors, reflecting higher era of right perform names, parameter keys, and values throughout the board.
Precise Match doubled from 11% to 21%. This metric requires character-for-character output matching, so even small formatting variations rely as a miss. The 21% actual match alongside 64.6% F1 means that the mannequin typically will get the semantics proper even when floor formatting differs barely.
These outcomes are all on held-out information with instruments and eventualities that the mannequin didn’t see throughout coaching. The enhancements carry over to new instruments, which implies the mannequin picked up common software calling patterns slightly than memorizing the coaching set.
Listed below are responses from the fine-tuned mannequin on prompts that it didn’t see throughout coaching:
Instrument execution:
> Immediate: “What’s the inventory worth of AMZN?” > > Mannequin response: [{"name": "get_stock_price", "arguments": {"ticker": "AMZN"}}]
Clarification:
> Immediate: “Discover me a restaurant close by” > > Mannequin response: “I’d be completely happy that can assist you discover a restaurant! May you please present the delicacies kind or location you’re thinking about?”
Close to-miss:
> Immediate: “What’s the usual deviation of 10, 20, 30, 40, 50?” > > Mannequin response: [{"name": "calculate_standard_deviation", "arguments": {"numbers": "10, 20, 30, 40, 50"}}]
Within the near-miss case, the mannequin chosen the proper software however handed the numbers as a string as an alternative of an array. This earns a 0.5 reward rating (proper perform, mistaken parameter format) and represents the sort of error that you simply’d goal within the subsequent iteration by means of extra coaching information or reward perform refinement.
Deploy the fine-tuned mannequin
With analysis confirming enchancment, deploy the fine-tuned mannequin immediately from the mannequin particulars web page. Select Deploy, and choose your deployment goal: both a SageMaker AI endpoint or Amazon Bedrock. You may also obtain the mannequin weights from Amazon S3 for self-managed deployment.

Conclusion
On this put up, we fine-tuned Qwen 2.5 7B Instruct for agentic software calling utilizing RLVR and GRPO by means of serverless mannequin customization in Amazon SageMaker AI. We ready a dataset spanning three tool-calling behaviors (execute, make clear, refuse), outlined a tiered reward perform, educated the mannequin in about 40 minutes, evaluated on held-out information with unseen instruments and eventualities, and deployed. The fine-tuned mannequin improved software name reward by 57% over the bottom mannequin.
To push accuracy additional, you possibly can increase your coaching information with extra instruments, edge instances, and multi-turn conversations to cowl extra of the eventualities that your brokers encounter in manufacturing. You may also refine your reward perform to penalize particular failure modes, just like the string-vs-array parameter problem proven within the earlier part, or add partial credit score for different near-miss patterns. If you happen to’re operating agentic workflows, your manufacturing logs are a high-quality supply of coaching information that may make the mannequin much more efficient to your particular use case. Past software calling, RLVR applies to different reasoning duties the place correctness is verifiable, similar to multi-step planning, structured information extraction, or code era.
Whereas this put up walks by means of the UI workflow, an SDK for programmatic entry can also be accessible. To be taught extra, see the SageMaker AI mannequin customization documentation.
To get began, attempt serverless AI mannequin customization in Amazon SageMaker AI with your individual use instances.
In regards to the authors

