Coaching a frontier mannequin is very compute-intensive, requiring a distributed system of a whole bunch, or hundreds, of accelerated situations working for a number of weeks or months to finish a single job. For instance, pre-training the Llama 3 70B mannequin with 15 trillion coaching tokens took 6.5 million H100 GPU hours. On 256 Amazon EC2 P5 situations (p5.48xlarge, every with 8 NVIDIA H100 GPUs), this could take roughly 132 days.
Distributed coaching workloads run in a synchronous method as a result of every coaching step requires all collaborating situations to finish their calculations earlier than the mannequin can advance to the subsequent step. It implies that if a single occasion fails, it stops the whole job. As cluster sizes develop, the probability of failure will increase because of the variety of {hardware} parts concerned. Every {hardware} failure can lead to wasted GPU hours and requires worthwhile engineering time to establish and resolve the problem, making the system liable to downtime that may disrupt progress and delay completion. To evaluate system reliability, engineering groups typically depend on key metrics akin to imply time between failures (MTBF), which measures the typical operational time between {hardware} failures and serves as a worthwhile indicator of system robustness.
On this put up, we discover the challenges of large-scale frontier mannequin coaching, specializing in {hardware} failures and the advantages of Amazon SageMaker HyperPod—a resilient answer that minimizes disruptions, enhances effectivity, and reduces coaching prices.
Occasion failure charge
To grasp the standard MTBF for large-scale frontier mannequin coaching, it helps to first perceive occasion failure charges by reviewing three noteworthy examples:
- When coaching OPT-175B on 992 A100 GPUs, Meta AI encountered important {hardware} reliability challenges. Throughout 2 months, the workforce managed 35 handbook restarts and cycled over 100 hosts attributable to {hardware} points, and automatic programs triggered greater than 70 restarts. Working 124 situations (every with 8 GPUs) repeatedly over 1,440 hours, Meta gathered a complete of 178,560 instance-hours. The noticed failure charge throughout this era was round 0.0588% per instance-hour, underscoring the reliability hurdles in coaching giant frontier fashions at this scale.
- Through the coaching of Llama 3.1 405B on 16,000 H100 GPUs, a complete of 417 unscheduled {hardware} failures occurred throughout a 54-day interval. This interprets to an efficient failure charge of about 0.0161% per instance-hour.
- MPT-7B was skilled on 1 trillion tokens over the course of 9.5 days on 440 x A100-40GB. Throughout this era, the coaching job skilled 4 {hardware} failures, leading to an efficient failure charge of roughly 0.0319% per instance-hour.
Primarily based on these examples, it’s sensible to count on that in a single hour of large-scale distributed coaching, an occasion will fail about 0.02%–0.06% of the time.
Bigger clusters, extra failures, smaller MTBF
As cluster measurement will increase, the entropy of the system will increase, leading to a decrease MTBF. The next desk illustrates how the MTBF (in hours) modifications with the variety of situations in a cluster and the estimated failure charge for every occasion. For instance, with a 0.04% per-hour failure charge per occasion, a 512-instance system is anticipated to expertise a failure roughly each 5 hours. The next desk exhibits MTBF (in hours) by failure charges.
. | Measurement of cluster (situations) | |||||||
Failure charge (per occasion per hour) | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
0.01% | 2500 | 1250 | 625 | 313 | 157 | 79 | 40 | 20 |
0.02% | 1250 | 625 | 313 | 157 | 79 | 40 | 20 | 10 |
0.04% | 625 | 313 | 157 | 79 | 40 | 20 | 10 | 5 |
0.08% | 313 | 157 | 79 | 40 | 20 | 10 | 5 | 3 |
Desk 1: The change in MTBF (in hours) with the variety of situations in a coaching cluster (with assumed failure charges within the columns)
What occurs after a failure?
In an ideal world, with out failures, the coaching job proceeds as proven within the following graph, which illustrates the overall coaching time with out failures, demonstrating a linear development.

Determine 1: Coaching is linear in an ideal world with out failures, since there are not any interruptions to completion.
Nevertheless, as beforehand famous, {hardware} failures are inevitable. Troubleshooting these failures sometimes includes a number of steps:
- Root trigger evaluation (imply time to detect) – Figuring out {hardware} failures as the basis trigger of coaching interruptions may be time-consuming, particularly in complicated programs with a number of potential failure factors. The time taken to find out the basis trigger is known as imply time to detect (MTTD).
- {Hardware} restore or substitute (imply time to interchange) – Typically, a easy occasion restart resolves the problem. At different instances, the occasion should be changed, which might contain logistical delays, particularly if specialised parts aren’t available. If a substitute occasion isn’t readily available when a GPU fails, the system should look forward to one to develop into out there. Frequent redistribution methods, akin to PyTorch FSDP, don’t allow workload redistribution amongst remaining situations.
- System restoration and resumption (imply time to restart) – After resolving {hardware} points and changing the occasion, further time is required to revive it to its earlier state. The brand new occasion should match the unique configuration, and the whole cluster should load the mannequin weights from the newest saved checkpoint.
Every failure incurs engineering effort to establish its root trigger. When {hardware} points come up, diagnostics verify the issue and isolate the defective occasion, pausing the coaching job and growing downtime. The impression of those failures is illustrated within the following determine and may be empirically measured for giant distributed coaching jobs. The determine outlines the troubleshooting steps that observe a failure.

Determine 2: Impression of failures on a distributed coaching run. As soon as a failure happens, time (idle GPUs) is spent on detecting (MTD), changing (MTT Substitute), and persevering with (MTR Restart) a coaching run, typically losing time and costly assets.
In a situation the place a distributed coaching job is working on an Amazon Elastic Kubernetes Service (Amazon EKS) cluster with n reserved situations and an Auto Scaling group set to take care of a minimal of n situations, a {hardware} concern akin to a GPU failure may cause the job to fail. The affected occasion can be marked as Unhealthy by a Kubernetes well being monitor akin to Node Downside Detector, and Amazon EKS will try to reschedule the coaching pods to wholesome situations. If no situations have adequate assets, the pods stay in a Pending state, and since the occasion rely is proscribed to n, no new occasion can be routinely provisioned.
In such circumstances, the failed job should be manually recognized via pod logs or the Kubernetes API and deleted. The failed occasion additionally must be remoted and terminated manually, both via the AWS Administration Console, AWS Command Line Interface (AWS CLI), or instruments like kubectl or eksctl. To revive cluster capability, the consumer should improve the cluster measurement by modifying the Auto Scaling group or updating the occasion group. After the brand new occasion is provisioned, bootstrapped, and added to the cluster, the coaching job should be restarted manually. If checkpointing is enabled, the job can resume from the final saved state. The general downtime is determined by the time required to provision a brand new occasion and restart the job by rescheduling the pods.
Quicker failure detection (shorter MTTD), shorter substitute instances (shorter MTTR), and speedy resumption will all contribute to lowering whole coaching time. Automating these processes with minimal consumer intervention is a key benefit of Amazon SageMaker HyperPod.
Amazon SageMaker HyperPod resilient coaching infrastructure
SageMaker HyperPod is a compute setting optimized for large-scale frontier mannequin coaching. This implies customers can construct resilient clusters for machine studying (ML) workloads and develop or fine-tune state-of-the-art frontier fashions, as demonstrated by organizations akin to Luma Labs and Perplexity AI. SageMaker HyperPod runs well being monitoring brokers within the background for every occasion. When it detects a {hardware} failure, SageMaker HyperPod routinely repairs or replaces the defective occasion and resumes coaching from the final saved checkpoint. This automation alleviates the necessity for handbook administration, which suggests prospects can prepare in distributed settings for weeks or months with minimal disruption. The advantages are significantly important for purchasers deploying many situations (higher than 16) in a cluster.
Frontier mannequin builders can additional improve mannequin efficiency utilizing built-in ML instruments inside SageMaker HyperPod. They will use Amazon SageMaker AI with MLflow to create, handle, and observe ML experiments, or use Amazon SageMaker AI with TensorBoard to visualise mannequin structure and deal with convergence points. Moreover, integrating with observability instruments akin to Amazon CloudWatch Container Insights, Amazon Managed Service for Prometheus, and Amazon Managed Grafana supplies deeper insights into cluster efficiency, well being, and utilization, in the end saving worthwhile growth time. The next determine compares the downtime of an infrastructure system utilizing SageMaker HyperPod versus one with out SageMaker HyperPod.

Determine 3: Evaluating downtime chart from determine 1 with downtime on SageMaker HyperPod. When a failure happens, it’s detected routinely by HyperPod brokers, and the occasion is changed within the background. Coaching can be resumed from the newest checkpoint
SageMaker HyperPod reduces the downtime per {hardware} failure by routinely detecting {hardware} points. When these points are detected, SageMaker HyperPod routinely replaces the defective node(s) and resumes your coaching job from the newest checkpoint, assuming that checkpoints are written.
To judge this, we performed experiments on SageMaker HyperPod utilizing totally different cluster sizes of p5.48xlarge situations. The ends in the next desk, exhibiting empirical measurements of time to renew by cluster measurement, shows the ninetieth percentile (P90), which represents a price that can be met or exceeded 90% of the time.
Cluster measurement (variety of situations) | P90 time to detect (in seconds) | P90 time to interchange (in seconds) | P90 time to renew (in seconds) | Whole downtime per failure (in seconds) | Whole downtime per failure (in minutes) |
16 | 83 | 912 | 1212 | 2207 | 36.8 |
64 | 90 | 963 | 1320 | 2373 | 39.6 |
256 | 89 | 903 | 1398 | 2390 | 39.8 |
1024 | 80 | 981 | 1440 | 2501 | 41.7 |
Desk 2: MTTResume (in seconds) on clusters with totally different sizes
As proven, the imply time to interchange an occasion is impartial of cluster measurement. For a cluster of 256 x p5.48xlarge situations coaching Meta Llama 3.1 70B parameter mannequin with batch measurement = 8, changing an occasion takes about 940 seconds (or 15.7 minutes). After substitute, the brand new occasion should set up further packages utilizing lifecycle scripts and run deep well being checks earlier than studying from the newest saved checkpoint. When it’s operational, the coaching job resumes from the newest checkpoint, minimizing progress loss regardless of the interruption. For a 256-instance cluster, it took us about 2,390 seconds (about 40 minutes) to routinely resume the coaching job after every failure.
With out SageMaker HyperPod, when a GPU failure happens throughout a coaching job, the time it takes to renew the coaching can fluctuate extensively relying on the infrastructure and processes in place. With correct check-pointing, automated job orchestration, and environment friendly {hardware} provisioning, the resume time may be decreased. Nevertheless, with out these optimizations, the impression may be rather more extreme. Empirical proof from buyer experiences—together with a number one open supply frontier mannequin supplier, a prime giant language mannequin (LLM) startup, an AI firm specializing in enterprise frontier fashions, and a cutting-edge scientific analysis institute—signifies that with out SageMaker HyperPod, the overall downtime per GPU failure can common roughly 280 minutes per failure. Thus, Amazon SageMaker HyperPod saves about 240 minutes (or about 4 hours) of downtime per failure:
. | With out SageMaker HyperPod (in minutes) | With SageMaker HyperPod (in minutes) |
Imply time to root-cause | 10 | 1.5 |
Imply time to interchange | 240 | 15 |
Imply time to renew | 30 | 23.5 |
Whole downtime per failure | 280 | 40 |
Desk 3: Typical failure numbers, in minutes (as described in part “What occurs after a failure?” with and with out SageMaker HyperPod)
Quantifying the downtime financial savings
Relying on the frequency of failures, we will calculate the time to coach and the fee financial savings of utilizing SageMaker HyperPod. As an instance this calculation, we assume it takes 40 minutes to interchange an occasion with SageMaker HyperPod in comparison with 280 minutes with out it (as beforehand defined). Moreover, for this calculation, let’s assume a coaching job requiring 10 million GPU hours on H100 situations, working on a 256-instance P5 cluster.
Though the precise overhead (in hours) is determined by the scale of the coaching job, the relative overhead stays fixed. The advantages of SageMaker HyperPod in lowering whole coaching time are demonstrated within the following chart. For instance, in a 256-instance cluster with a failure charge of 0.05%, SageMaker HyperPod reduces whole coaching time by 32%.
. | Measurement of cluster (situations) | |||||||
Failure charge (per occasion per hour) |
4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 |
0.01% | 0% | 0% | 1% | 1% | 2% | 5% | 9% | 17% |
0.02% | 0% | 1% | 1% | 2% | 5% | 9% | 17% | 28% |
0.05% | 1% | 2% | 3% | 6% | 11% | 20% | 32% | 48% |
0.07% | 1% | 2% | 4% | 8% | 15% | 25% | 40% | 55% |
Desk 4: Whole % of coaching time decreased by SageMaker HyperPod in comparison with a P5 cluster of comparable measurement
To translate this into precise financial savings, for a coaching job requiring 10 million GPU hours on a 256-instance cluster, SageMaker HyperPod saves 104 days of coaching time. Consequently, prospects can scale back time-to-market by 3.5 months. With out SageMaker HyperPod, the overall time to coach can be roughly 325 days, 121 of that are simply spent on isolating and mitigating {hardware} points. The next desk exhibits the time to coach advantages.
H100 GPU hours for coaching | 10,000,000 |
Variety of situations | 256 |
Failure charge (per occasion per hour) | 0.05% |
Extra time to repair per failure (hours) | 4 |
Days misplaced attributable to {hardware} points (with SageMaker HyperPod) | 17 |
Days misplaced attributable to {hardware} points (with out SageMaker HyperPod) | 121 |
Time to coach with SageMaker HyperPod (days) | 221 |
Time to coach with out SageMaker HyperPod (days) | 325 |
SageMaker HyperPod enchancment | 32% |
Time saved with SageMaker HyperPod (days) | 104 |
Desk 5: Advantages offered by SageMaker HyperPod for a coaching run requiring 10 million GPU hours and a 256 occasion cluster. SageMaker HyperPod saves 104 days of coaching time total, leading to a quicker time to market (by 3.5 months!)
For a similar instance, we will estimate the overall price financial savings utilizing:
Days misplaced attributable to {hardware} points = (Variety of situations) × (Failure charge per occasion per hour) × (24 hours per day) × (Whole coaching days) × (Downtime per failure in hours)
The next exhibits price to coach advantages.
H100 GPU hours for coaching | 10,000,000 |
Variety of situations | 256 |
Failure charge (per occasion per hour) | 0.05% |
Time saved with SageMaker HyperPod (days) | 104 |
Price per GPU per hour | $5 |
Whole price saving with SageMaker HyperPod | $25,559,040 |
Desk 6: Utilizing the calculation described above, the fee to coach advantages laid out for a coaching run requiring 10 million GPU hours, 256 GPU primarily based situations, and an assumed failure charge of 0.05% per occasion per hour
A coaching job requiring 10 million GPU hours and 104 further days of resolving {hardware} points ends in important idle cluster time. Assuming a GPU price of $5 per hour (equal to the value of P5 situations on Capability Blocks for ML), the overall price financial savings with SageMaker HyperPod quantities to $25,559,040.
Abstract
Coaching frontier fashions is a fancy, resource-intensive course of that’s significantly weak to {hardware} failures. On this put up, we explored the occasion failure charge, which might vary about 0.02%–0.07% per hour throughout large-scale distributed coaching. As cluster measurement grows, the probability of failures will increase, and the MTBF decreases. We additionally examined what occurs after failure, together with root trigger evaluation, {hardware} restore or substitute, and system restoration and resumption.
Subsequent, we examined Amazon SageMaker HyperPod—a purpose-built, absolutely resilient cluster for frontier mannequin coaching. By incorporating strong fault-tolerance mechanisms and automatic well being monitoring, SageMaker HyperPod minimizes disruptions brought on by {hardware} points. This not solely streamlines the coaching course of but in addition enhances the reliability and effectivity of mannequin growth, enabling quicker and simpler innovation supply. The advantages are measurable and correlate with each cluster measurement and failure charge. For a 256-instance cluster with a 0.05% per-instance-per-hour failure charge, SageMaker HyperPod reduces whole coaching time by 32%, leading to an approximate financial savings of $25.6 million in whole coaching prices.
By addressing the reliability challenges of frontier mannequin coaching, SageMaker HyperPod permits ML groups to deal with mannequin innovation fairly than infrastructure administration. Organizations can now conduct lengthy coaching runs with confidence, understanding that {hardware} failures can be routinely detected and resolved with minimal disruption to their ML workloads. Get began with Amazon SageMaker HyperPod.
Particular due to Roy Allela, Senior AI/ML Specialist Options Architect for his assist on the launch of this put up.
In regards to the Authors
Anoop Saha is a Sr GTM Specialist at Amazon Net Companies (AWS) specializing in generative AI mannequin coaching and inference. He companions with prime frontier mannequin builders, strategic prospects, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop held a number of management roles at startups and huge firms, primarily specializing in silicon and system structure of AI infrastructure.
Trevor Harvey is a Principal Specialist in generative AI at Amazon Net Companies (AWS) and an AWS Licensed Options Architect – Skilled. Trevor works with prospects to design and implement machine studying options and leads go-to-market methods for generative AI companies.
Aman Shanbhag is a Specialist Options Architect on the ML Frameworks workforce at Amazon Net Companies (AWS), the place he helps prospects and companions with deploying ML coaching and inference options at scale. Earlier than becoming a member of AWS, Aman graduated from Rice College with levels in pc science, arithmetic, and entrepreneurship.