The artwork and science of hyperparameter optimization on Amazon Nova Forge

Giant language fashions (LLMs) ship sturdy outcomes on normal duties, however they usually battle with specialised work that requires understanding proprietary knowledge, inner processes, or domain-specific terminology. Amazon Nova Forge addresses this by enabling you to construct your personal frontier fashions utilizing Amazon Nova. You can begin growth from early mannequin checkpoints, mix proprietary knowledge with Amazon Nova-curated coaching knowledge, and host customized fashions securely on AWS. A key functionality is knowledge mixing, which blends your coaching knowledge with curated datasets. This helps the mannequin take up your area whereas retaining broad reasoning, instruction-following, and language capabilities. This prevents catastrophic forgetting that usually undermines area customization.

Profitable customization requires cautious hyperparameter tuning. Studying charge, knowledge mixing ratio, checkpoint choice, and coaching strategies all work together in methods that may silently undermine a coaching run. If any of them are mistaken, you commerce one downside for an additional. This publish covers the artwork (strategic trade-offs) and science (metric-driven choices) of hyperparameter tuning on Amazon Nova Forge that will help you keep away from costly failed coaching runs.

Tremendous-tuning for domain-specific duties means bettering efficiency in a single space with out degrading the mannequin’s normal capabilities, and getting that steadiness proper is tougher than it appears. This publish walks by find out how to navigate that steadiness, from choosing the appropriate customization technique in your knowledge and process, to configuring the coaching parameters that almost all affect outcomes, like studying charge, batch dimension, and checkpointing. We additionally cowl the widespread errors that result in wasted coaching runs and find out how to catch them early, so you may enhance area efficiency with out degrading normal capabilities or burning by compute on avoidable failures.

By the tip, you’ll know find out how to enhance area efficiency with out degrading normal capabilities and find out how to keep away from the costly failures that come from getting the steadiness mistaken.

The hyperparameter tuning problem

Reaching this steadiness is tougher than it seems. Three elementary challenges make hyperparameter tuning notably troublesome on domain-specialized fashions.

Problem 1: Catastrophic forgetting

Once you prepare a mannequin on slim area knowledge, the mannequin can overwrite normal capabilities it discovered throughout pre-training. This phenomenon, referred to as catastrophic forgetting, reveals up as degraded efficiency on duties outdoors your coaching area. The mannequin turns into extremely specialised however loses instruction-following capability, reasoning functionality, and broad information. In manufacturing, this implies a customer support mannequin fine-tuned in your help tickets could now not purpose about ambiguous requests or keep coherent multi-turn conversations.

This creates a stability-flexibility tradeoff. Ideally, the mannequin is versatile sufficient to study a corporation’s area however steady sufficient to retain normal capabilities. Nova Forge addresses this by knowledge mixing, which blends your coaching knowledge with curated datasets throughout coaching, and checkpoint choice, which helps you to select how a lot present alignment to protect.

Problem 2: Discovering the appropriate studying charge

The educational charge controls how a lot the mannequin’s weights change in response to every batch of coaching examples. It’s essentially the most delicate hyperparameter throughout all customization strategies. A studying charge that’s too excessive causes the mannequin to overshoot the optimum state, destabilize throughout coaching, or neglect base capabilities quickly. A studying charge that’s too low wastes compute on very gradual convergence. The proper worth is determined by your knowledge distribution, mixing ratio, and coaching method.

Nova Forge gives calibrated service defaults for every coaching method that account for these interactions. Once you use knowledge mixing, the sensitivity will increase additional. Deviating from the default studying charge when mixing Nova knowledge with your personal knowledge is the commonest supply of coaching instability, so these service defaults are the really useful start line.

Problem 3: Baseline efficiency constraints

Reinforcement fine-tuning (RFT) is a method that improves mannequin conduct by producing a number of candidate responses and scoring them in opposition to high quality standards. The mannequin learns by evaluating its personal outputs and reinforcing the higher ones. RFT works at its full capability inside a selected vary of baseline process accuracy, measured by how usually the mannequin produces right or high-quality responses earlier than fine-tuning. If baseline accuracy is simply too low (the mannequin hardly ever produces right responses), there aren’t sufficient good examples for reward-guided exploration to be taught from. If baseline accuracy is already very excessive, further coaching yields diminishing returns and dangers degrading present efficiency. This implies RFT can’t shut massive competence gaps the place the mannequin essentially lacks the information or reasoning capability to try a process. It refines and strengthens behaviors the mannequin can already partially exhibit, somewhat than instructing completely new capabilities from scratch.

The Nova Forge pipeline addresses each bounds. For low-baseline eventualities, run supervised fine-tuning (SFT) first to determine the foundational capabilities wanted for efficient reward-based studying. For top-baseline duties, guarantee that your reward operate has discriminative energy throughout the mannequin’s high quality vary. If most responses already rating extremely, RFT has no significant sign to optimize in opposition to.

The Nova Forge customization pipeline

Understanding these challenges frames how the Amazon Nova Forge customization pipeline is designed to handle them. Nova Forge gives three complementary customization strategies, every serving a definite function within the mannequin growth lifecycle.

Method	What it does	When to make use of	Enter knowledge
Continued pre-training (CPT)	Expands foundational mannequin (FM) information by self-supervised studying on massive portions of unlabeled, domain-specific proprietary knowledge. CPT teaches the mannequin area terminology and patterns out of your textual content corpus.	You want the mannequin to know specialised vocabulary, business ideas, or organizational information that doesn’t exist within the base mannequin.	Giant volumes of unlabeled area textual content. Nova Forge helps CPT with knowledge mixing and three checkpoint choices (pre-trained, mid-trained, and post-trained), every suited to totally different knowledge scales and downstream necessities.
Supervised fine-tuning (SFT)	Customizes mannequin conduct utilizing a coaching dataset of input-output pairs particular to your goal duties. SFT teaches the mannequin “given X, output Y” conduct by demonstrations.	You want the mannequin to comply with particular response codecs, undertake explicit tones, or carry out structured duties like classification or extraction.	1,000–10,000 high-quality demonstrations per process. High quality, consistency, and variety matter greater than quantity. Nova Forge helps SFT with knowledge mixing utilizing Amazon Nova-curated datasets, together with reasoning-instruction-following classes that protect normal capabilities.
Reinforcement fine-tuning (RFT)	Steers mannequin output towards most popular outcomes utilizing reward indicators. RFT optimizes the mannequin inside a behavioral neighborhood established by prior coaching for single-turn or multi-turn conversational duties.	You have got a transparent reward operate that may consider response high quality and need to push efficiency past what SFT alone achieves.	Prompts and a reward operate. Nova Forge helps bringing your personal exterior reward setting by AWS Lambda, enabling customized verification logic for domain-specific high quality evaluation.

When all three phases are used collectively (CPT, then SFT, then RFT), they produce the strongest outcomes. Nonetheless, with the appropriate pipeline, every stage could be non-obligatory. It is determined by your knowledge availability, process kind, and start line. CPT is simply wanted when the bottom mannequin lacks area vocabulary or information your process requires. SFT and RFT can be utilized independently or mixed relying on what your process calls for.

Determine 1: The Amazon Nova Forge customization pipeline. CPT teaches area information from unlabeled textual content, SFT teaches task-specific conduct from demonstrations, and RFT optimizes efficiency utilizing reward indicators. Every stage is non-obligatory, and the complete pipeline (CPT, then SFT, then RFT) produces the strongest outcomes when all three are relevant to your use case.

Amazon SageMaker AI affords totally different environments for personalisation: SageMaker Serverless gives a UI-driven expertise with computerized compute provisioning, SageMaker AI coaching jobs (SMTJ) present a totally managed expertise with out cluster administration, whereas Amazon SageMaker HyperPod affords specialised environments for superior distributed coaching eventualities.

Strategic choices

With the customization pipeline in view, the following step is knowing the qualitative trade-offs that form your configuration. These strategic choices matter as a lot as any particular person hyperparameter worth: checkpoint choice, knowledge mixing, and coaching mode.

Checkpoint choice (most impactful resolution)

For CPT, checkpoint choice is extra impactful than any hyperparameter. Amazon Nova Forge gives three checkpoint choices, every suited to totally different knowledge scales and downstream necessities.

Pre-trained checkpoints are essentially the most versatile and provide the quickest convergence. These checkpoints settle for new patterns readily and work greatest for large-scale CPT with substantial token budgets exceeding 100 billion tokens. When utilizing pre-trained checkpoints with massive datasets, you should use a better studying charge (akin to 1e-4) to speed up information absorption. You then must steadily cut back the training charge again to roughly 1e-6 for mannequin stability earlier than operating SFT to let the mannequin “settle” into what it discovered with out overshooting. Bear in mind that pre-trained checkpoints don’t have any directions for tuning. After CPT, you need to run SFT to make the mannequin helpful for downstream duties.
Mid-trained checkpoints steadiness flexibility and alignment. They settle for area information whereas retaining some instruction-following conduct. Use mid-trained checkpoints for medium-sized datasets the place you need quicker area adaptation than post-trained however extra stability than pre-trained. Mid-trained checkpoints work nicely for full rank coaching, which updates each parameter within the mannequin throughout fine-tuning, with massive, structured datasets.
Submit-trained checkpoints are essentially the most proof against new patterns however protect instruction-following and normal capabilities. Use post-trained for smaller-scale CPT the place preserving alignment issues greater than maximizing area information absorption. Submit-trained checkpoints are the really useful start line for LoRA (Low-Rank Adaptation), which freezes the unique mannequin weights and trains small adapter matrices on high, and different parameter-efficient fine-tuning strategies, as they keep the mannequin’s present capabilities whereas permitting focused adaptation. For small datasets or later-stage checkpoints, use conservative studying charge values from the service defaults.

Determine 2: Checkpoint choice for continued pre-training. Pre-trained checkpoints provide most flexibility for big datasets however require SFT afterward to revive instruction-following. Submit-trained checkpoints protect alignment and swimsuit smaller datasets or parameter-efficient strategies like LoRA.

Information mixing technique

With out knowledge mixing, coaching on slim area knowledge could cause the mannequin to develop into unstable, leading to erratic coaching conduct (gradient instability or loss spikes) or a sudden degradation in efficiency.

When configuring knowledge mixing, steadiness your buyer knowledge round 50 % of the entire combine for many use circumstances. For SFT, at all times embody the “reasoning-instruction-following” class in your Nova knowledge combine. This single class considerably improves generic benchmark efficiency after fine-tuning. Skipping this class is a standard reason behind degraded reasoning efficiency in fine-tuned fashions.

Information mixing could be very delicate to studying charge. Deviating from the default studying charge when utilizing knowledge mixing causes instability. That is the commonest mistake practitioners make. If you happen to observe coaching instability with knowledge mixing, the training charge is the primary suspect.

Discovering the optimum mixing ratio requires experimentation. Maintain your area knowledge fixed and differ the Nova knowledge proportion throughout a number of runs. Area efficiency usually stays fixed whereas normal capabilities preserve bettering the extra Nova knowledge is combined in. Place your highest-quality knowledge towards the tip of coaching for higher convergence.

Coaching mode: Low-Rank Adaptation (LoRA) vs Full Rank

Amazon Nova Forge helps two coaching modes that decide how mannequin parameters are up to date throughout coaching:

LoRA updates solely adapter layers, providing decrease compute prices, quicker iteration, and compatibility with on-demand inference. LoRA achieves close to Full Rank efficiency for many duties whereas being extra forgiving of suboptimal hyperparameters. The default alpha scaling issue of 64 works for many duties. Improve alpha if LoRA is under-adapting to your knowledge or lower it if LoRA is over-adapting and dropping normal capabilities. Use post-trained checkpoints as your start line for LoRA coaching.
Full Rank updates all mannequin parameters, offering most adaptation capability. Full Rank requires Amazon Bedrock Provisioned Throughput for deployment (On-Demand is simply obtainable for LoRA-based customization) and better compute throughout coaching. Use Full Rank when you could have validated your pipeline and your deployment structure justifies the extra price. Mid-trained checkpoints work nicely for Full Rank coaching with massive, structured datasets.

Begin with LoRA to validate your pipeline, knowledge high quality, and reward operate (for RFT). Graduate to Full Rank when you could have confirmed the method works, and your manufacturing necessities justify it (for instance, mannequin efficiency or price constraints).

Really helpful workflow

Making use of these strategic choices to your particular state of affairs is determined by what knowledge and goals you could have. The next paths map your beginning circumstances to the appropriate sequence of strategies.

When you’ve got labeled demonstrations and a verifiable reward operate (SFT then RFT):

Begin with SFT utilizing LoRA to show the goal conduct and set up baseline competency.
Allow knowledge mixing with “reasoning-instruction-following” included to protect the mannequin’s capability to comply with structured prompts and produce well-formatted outputs throughout area adaptation.
Use default studying charges with out modification.
Monitor validation loss to pick the very best SFT checkpoint.
Graduate to RFT on the SFT checkpoint to optimize additional by reward indicators.
Think about Full Rank coaching solely after validating the method with LoRA.
Take a look at completely on each your area process and normal benchmarks earlier than manufacturing deployment (see the Experiments and insights part for an instance).

If you happen to can outline verifiable outcomes however can not simply label responses at scale (RFT solely):

Consider base mannequin efficiency on a consultant pattern of your process first.
Proceed with RFT immediately if the bottom mannequin achieves greater than roughly 5 % constructive reward.
Fall again to SFT if reward scores are constantly close to zero. The mannequin wants baseline competency earlier than reward-guided studying can take impact.

If the bottom mannequin lacks area vocabulary or information your process requires, begin with CPT:

Run CPT to soak up area information from unlabeled textual content.
Comply with with SFT. Pre-trained checkpoints used for CPT don’t have any instruction tuning, so SFT is required after CPT to make the mannequin helpful.
Optionally comply with with RFT to additional optimize efficiency.

Parameter configuration

With strategic choices made, now you can optimize particular hyperparameters that govern how every method executes. This part gives steerage for every method.

Studying charge configuration

Studying charge controls how rapidly the mannequin updates based mostly on coaching indicators. Service defaults signify examined configurations that work throughout various use circumstances.

For CPT: Begin at service defaults. For giant datasets exceeding one trillion tokens, you should use a better studying charge (akin to 1e-4) to speed up information absorption, however you want a ramp-down stage to cut back the training charge again to roughly 1e-6 for mannequin stability earlier than SFT. The constant_steps parameter controls what number of steps the mannequin trains on the peak studying charge earlier than this ramp-down stage begins. Improve constant_steps for very massive token runs the place extra steps at full studying charge assist area absorption. For smaller datasets or later-stage checkpoints, use the default (decrease) studying charge from the beginning.
For SFT: Stick with service defaults, particularly with knowledge mixing. The really useful studying charge is 1e-5 for LoRA and 5e-6 for full-rank SFT. Deviating from the default studying charge when mixing Nova knowledge causes instability. If you happen to observe coaching instability with knowledge mixing, the training charge is the primary suspect.
For RFT: Begin at service defaults. Regulate in small multiplier increments provided that wanted. If reward drops out of the blue and doesn’t get better, the training charge is probably going too excessive. Even a small multiplier enhance can drop efficiency under baseline.

Configure warmup steps to roughly 15 % of your complete coaching steps. Warmup stabilizes preliminary coaching by steadily rising the training charge somewhat than beginning on the full worth.

Batch dimension and coaching length

Batch dimension (managed by global_batch_size) is the batch parameter throughout all coaching strategies (CPT, SFT, RFT) and all environments (SageMaker Serverless, SMTJ, HyperPod). It defines the variety of coaching samples processed per optimizer step. For CPT and SFT, that is easy with one pattern equal to 1 input-output pair (SFT) or one token sequence (CPT). RFT introduces an extra parameter, number_generation, that controls what number of candidate responses are generated per immediate for reward scoring. This parameter doesn’t exist in CPT or SFT recipes, as a result of these strategies prepare immediately on offered input-output pairs somewhat than producing candidates. When the variety of generations parameter is current, batch dimension semantics differ between environments. Getting this mistaken results in surprising conduct.

On SMTJ (RFT solely): Batch dimension means prompts per step. Every immediate generates N candidate responses (managed by number_generation). Complete samples per step equals batch dimension multiplied by variety of generations.
On SageMaker HyperPod (RFT solely): Batch dimension means complete samples per step (prompts multiplied by generations). Translate rigorously when shifting configurations between environments.

For CPT, goal 2-20 million tokens per step. Use 20 million for big token budgets and a pair of million for smaller budgets. Calculate world batch dimension as the closest energy of two of tokens per step divided by max sequence size. For instance, 4 million tokens per step with a 4096-sequence size yields a batch dimension of roughly 1024. Smaller batch sizes produce noisier gradients, which may also help generalization and allow quicker iteration. Bigger batch sizes produce smoother gradients however could over-smooth domain-specific indicators. Begin with average batch sizes for stability.

Match your max sequence size to your knowledge distribution. Don’t exceed what your knowledge wants. Smaller context lengths enhance token throughput and cut back coaching prices. For CPT, course of at most one epoch of your dataset. Keep away from repeating knowledge, as a number of epochs on restricted CPT knowledge results in overfitting and lack of normal capabilities. Monitor validation loss to trace progress. For SFT, Full Rank coaching usually wants fewer epochs than LoRA. LoRA coaching can tolerate barely extra epochs. Monitor validation loss to detect overfitting and choose the very best checkpoint.

RFT-specific parameters

RFT introduces further parameters not current in CPT or SFT.

Variety of generations controls what number of candidate responses the mannequin generates per immediate for the reward operate to check. Fewer candidates imply quicker coaching however much less sign variety. Too many candidates add noise with out bettering sign and almost double coaching time. Average values hit the very best accuracy-to-time ratio. Improve in case your process has excessive variance in response high quality. Lower for fast reward operate iteration throughout growth.
KL-Divergence Loss Coefficient constrains how far the mannequin’s coverage can drift from its unique conduct. This parameter is obtainable on SMTJ solely. A low coefficient lets the mannequin discover freely however dangers discovering shortcuts that recreation the reward operate. A excessive coefficient prevents significant studying by pulling the mannequin again to its start line. Improve if KL divergence spikes throughout coaching to steadiness real studying in opposition to behavioral drift.
Reasoning Effort controls how a lot chain-of-thought reasoning the mannequin performs earlier than answering. Excessive reasoning effort produces the very best accuracy however will increase latency and serving price. Low reasoning effort affords quicker inference with modest accuracy trade-offs. Use excessive for optimum accuracy throughout validation, then take into account decreasing for latency-sensitive manufacturing deployments.
Lambda Concurrency Restrict (SMTJ solely) controls parallel AWS Lambda capabilities for reward analysis. Improve considerably for quick reward capabilities to keep away from analysis throughput turning into a bottleneck.

Keep in mind that batch dimension semantics differ between platforms. On SMTJ, global_batch_size means prompts per step the place every generates N candidates. On SageMaker HyperPod, global_batch_size means complete samples (prompts multiplied by generations). Translate rigorously between environments.

Regularization parameters

Regularization parameters assist stop overfitting, particularly on smaller datasets.

Weight decay defaults to zero. Improve modestly in case you observe overfitting on small datasets. Weight decay applies L2 regularization to constrain parameter magnitudes.
Dropout (hidden and a focus) defaults to zero. Improve hidden dropout modestly for smaller datasets to cut back overfitting. Improve consideration dropout cautiously, as excessive values can harm complicated reasoning capabilities.
Clip ratio and age tolerance are superior SageMaker HyperPod parameters. Clip ratio limits how a lot the coverage can change in a single coaching step. Age tolerance determines how lengthy coaching knowledge stays legitimate earlier than being thought of too stale. Refit frequency controls how usually the mannequin collects recent coaching knowledge. Defaults work for many use circumstances. Solely alter these superior settings in case you perceive the particular stability situation you’re addressing.

Experiments and insights

With these hyperparameters in thoughts, we ran a collection of HPO experiments utilizing Amazon Nova 2.0 throughout public benchmarks together with CoCoHD, MedReason and LLaVA-CoT. The next desk summarizes the experimental configurations and key findings for every parameter sweep.

Dataset	Rank	Alpha	GBS	LR	Max Steps	Warmup	Base Goal Perf.	SFT Goal Perf.	Rank	Perf Diff
MedReason	32	64	32	1.00E-05	312	47	57.38%	63.54%	2	10.75% ↑
MedReason	64	64	32	1.00E-05	312	47	57.38%	63.78%	1	11.16% ↑
MedReason	32	64	32	5.00E-06	312	47	57.38%	63.33%
MedReason	32	64	32	1.00E-05	624	94	57.38%	61.42%
LLavaCOT	64	64	32	1.00E-05	312	47	16.22%	68.47%	1	322.13% ↑
LLavaCOT	32	128	32	1.00E-05	312	47	16.22%	65.77%	2	305.49% ↑

We ran LoRA SFT on Amazon Nova 2 Lite utilizing Nova Forge with rank 32, alpha 64, batch dimension 32, 15 % warmup, and 1 epoch, sweeping solely the training charge to isolate its impact on the right track accuracy. The service default of 1e-5 produced the very best consequence at 63.54 %, a ten.75 % raise over the v4 base. Dropping the training charge to 5e-6 adversely impacted goal efficiency with out meaningfully defending normal capabilities, as MMLU, IFEval, and GPQA scores have been inside noise of the 1e-5 run. Doubling to 2 epochs on the identical studying charge dropped accuracy to 61.42 %, confirming that overtraining on slim area knowledge erodes each area and normal efficiency.

We diversified LoRA rank (32 vs 64) and alpha (64 vs 128) on a multimodal reasoning process the place the bottom mannequin begins at solely 16.22 % accuracy. The most effective configuration, rank 64 with alpha 64, lifted accuracy to 68.47 %, a 322 % relative enchancment over the bottom. Doubling alpha to 128 at rank 32 produced an identical goal acquire at 65.77 %, however at a meaningfully increased general-capability regression price. For duties the place the baseline accuracy is low, rising rank is a higher-leverage adjustment than rising alpha. Alpha needs to be elevated solely when LoRA is under-adapting, and decreased if the mannequin is dropping normal capabilities.

No single hyperparameter configuration works greatest for all use circumstances. These really useful defaults are sturdy beginning factors, not ensures of optimum efficiency.

Widespread pitfalls and find out how to keep away from them

The next desk summarizes the commonest errors practitioners ought to keep away from when tuning Amazon Nova Forge fashions.

Pitfall	Symptom	Resolution
Skipping SFT earlier than RFT	RFT produces no enchancment or degrades efficiency	Run SFT first to get the mannequin into the appropriate behavioral neighborhood earlier than RFT optimization.
Deviating from default LR with knowledge mixing	Coaching instability, loss spikes, functionality collapse	Stick with service defaults when utilizing knowledge mixing. That is the commonest mistake.
Poor reward operate high quality	Accuracy decreases regardless of coaching, or mannequin video games the metric	Refine your reward operate earlier than altering any coaching parameter. Validate with no less than two unbiased judges.
A number of epochs on restricted CPT knowledge	Overfitting, lack of normal capabilities, memorization	Course of at most one epoch of your CPT dataset. Monitor validation loss to detect overfitting early.
Mismatched reasoning settings	Inference conduct doesn’t match coaching conduct	Match `reasoning_enabled` between coaching and inference. If you happen to prepare with reasoning, infer with reasoning.

When tuning fashions with Nova Forge, put money into your reward operate earlier than anything. A poor reward operate will lower accuracy no matter different hyperparameter decisions, whereas a refined one produces constant positive factors on an identical infrastructure. Be certain your reward operate has discriminative energy throughout the mannequin’s high quality vary, as a result of if the whole lot scores excessive, RFT has no gradient to optimize.

The identical validation self-discipline applies to LLM-as-judge choice. Your choose mannequin should reliably distinguish high quality variations throughout the mannequin’s output vary. Validate choose settlement with no less than two unbiased evaluators earlier than committing to a coaching run.

Bear in mind that coaching setting stability mechanisms differ between platforms. SMTJ applies steady KL penalty as a tender constraint, whereas SageMaker HyperPod makes use of gradient clipping as a tough cap per step. Each obtain comparable accuracy, however they require totally different tuning intuitions. Don’t assume parameters switch immediately between environments.

All through all of this, prioritize knowledge high quality over quantity. Filtering aggressively and ensuring coaching examples precisely signify the goal conduct will outperform merely scaling up low-quality knowledge.

Measuring success

Once you apply correct hyperparameter tuning, the outcomes could be substantial. The AWS China Utilized Science staff demonstrated this of their analysis of Amazon Nova Forge, reaching 17 % F1 rating enchancment on a fancy Voice of Buyer classification process whereas sustaining near-baseline MMLU scores.

Key metrics to watch

Coaching loss ought to lower steadily with out sudden spikes. Spikes usually point out studying charge points or knowledge high quality issues.

Validation loss reveals overfitting. If validation loss will increase whereas coaching loss decreases, you’re overfitting. Cut back epochs, enhance regularization, or add extra various knowledge.

KL divergence (for RFT) reveals how far the coverage has drifted. Sudden spikes counsel the mannequin is making massive, doubtlessly unstable updates. Improve the KL loss coefficient if this happens.

Reward metrics (for RFT) ought to enhance steadily. If reward improves quickly then plateaus or drops, the mannequin could also be gaming the reward operate. Revisit your reward design.

Conclusion

Optimizing mannequin customization with Amazon Nova Forge requires balancing artwork and science. The artwork entails understanding trade-offs: checkpoint choice, knowledge mixing technique, and coaching mode choices form your consequence greater than any single hyperparameter. The science entails systematic tuning: studying charge, batch dimension, and technique-specific parameters require cautious configuration based mostly in your knowledge and goals.

Information and reward high quality exceed any hyperparameter in significance. Earlier than tuning coaching parameters, optimize your knowledge pipeline and reward operate. Begin with service defaults, particularly for studying charge and knowledge mixing, as these defaults exist as a result of they work throughout a variety of use circumstances.

For many manufacturing eventualities, the strongest pipeline is SFT adopted by RFT. RFT refines present functionality however can not get better from a low baseline, so supervised fine-tuning wants to determine stable efficiency first. Information mixing needs to be handled as important for manufacturing workloads, not non-obligatory. It prevents catastrophic forgetting and gives optimization stability wanted for dependable outcomes.

When working with continued pre-training, checkpoint choice is essentially the most impactful resolution you’ll make. Match checkpoint flexibility to your knowledge scale: earlier checkpoints for large-scale area adaptation, later checkpoints for smaller datasets the place preserving instruction-following conduct issues.

To get began with Amazon Nova Forge, discover the Amazon Nova documentation and the SageMaker HyperPod recipes repository on GitHub. For hands-on examples of information mixing in motion, see the Nova Forge knowledge mixing weblog publish. For a deeper dive into RFT with Nova Forge see the Reinforcement fine-tuning for Amazon Nova: Educating AI by suggestions weblog publish.

Acknowledgements

The authors wish to thank Zheng Du, Bharathan Balaji, Anjie Fang, and Mengnong Xu from the AWS AGI Customization Science staff for his or her technical steerage.