As organizations scale their AI infrastructure to assist trillion-parameter fashions, they face a tough trade-off: diminished coaching time with decrease value or sooner coaching time with the next value. Once they checkpoint steadily to hurry up restoration time and reduce misplaced coaching time, they incur in considerably greater storage value. And after they checkpoint sometimes, they scale back prices on the threat of dropping priceless coaching progress when failures happen.
This problem is exacerbated in giant distributed coaching environments, with 1000’s of accelerators, the place points can happen steadily. In line with an article launched by Meta, one failure occurred each 3 hours through the Meta Llama 3 mannequin coaching. The GPU points accounted for 60% of the overall failures, and community, CPU, and disks account the opposite 40%. With rare checkpointing, these collected failures can lead to dropping days of coaching progress over the course of a whole coaching run, thereby driving up prices and time to market. Frequent checkpoints can saturate networks, overload storage, and end in unpredictable efficiency.
To assist remedy these challenges, AWS introduced managed tiered checkpointing in Amazon SageMaker HyperPod, a purpose-built infrastructure to scale and speed up generative AI mannequin improvement throughout 1000’s of AI accelerators. Managed tiered checkpointing makes use of CPU reminiscence for high-performance checkpoint storage with automated information replication throughout adjoining compute nodes for enhanced reliability. Though SageMaker HyperPod identifies node points routinely and replaces these nodes so your coaching can resume, managed tiered checkpointing helps you implement the perfect checkpointing technique and maximize your coaching throughput.
Managed tiered checkpointing has been examined on giant distributed coaching clusters starting from tons of of GPU to over 15,000 GPU, with checkpoints being saved inside seconds.
On this submit, we dive deep into these ideas and perceive learn how to use the managed tiered checkpointing function.
Resolution overview
Checkpointing is the strategy of saving an intermediate mannequin’s state through the coaching course of. You’ll be able to resume coaching from a latest checkpoint within the occasion of a difficulty by saving the mannequin’s parameters, optimizer states, and different metadata throughout coaching. Moreover, you possibly can resolve coaching issues, corresponding to irregular studying charges, with out a full restart by loading an earlier checkpoint state.
Use the next system to discover a tough preliminary estimate of the overall measurement of the checkpoint on your mannequin with out the optimizer state:Mannequin checkpoint measurement (GB) = (Variety of parameters × Bytes per parameter) ÷ 10243 bytesFor instance, in case you prepare a Meta Llama 3 70-billion-parameter mannequin utilizing BFloat16 because the parameter’s precision, the checkpoint measurement shall be 130 GB. When you prepare a DeepSeek-R1 671-billion-parameter mannequin utilizing BFloat16, the checkpoint measurement shall be 1.25 TB. All with out storing optimizer states.Checkpoints embody optimizer states, coaching metadata (corresponding to step quantity), and different extra information, leading to a bigger than anticipated measurement. When utilizing an Adam optimizer, the optimizer will save three extra float16 statistics per parameter, leading to an extra 6 bytes per parameter. Due to this fact, with the optimizer state saved, the Meta Llama 3 70B mannequin checkpoint measurement shall be roughly 521 GB, and the DeepSeek-R1 671B mannequin checkpoint measurement shall be roughly 5 TB. That may be a four-times enhance in measurement, and dealing with these checkpoints turns into a problem.
The next desk summarizes the checkpoint sizes for every mannequin.
Mannequin title | Measurement of Checkpoint | Measurement of Checkpoint + Optimizer States |
Meta Llama 3 70B | 130 GB | 521 GB |
DeepSeek R1 671B | 1.43 TB | 5 TB |
It’s additionally vital to contemplate the coaching technique. In a Absolutely Sharded Information Parallel (FSDP) situation, every rank (a single GPU course of in a distributed coaching) saves its personal a part of the checkpoint. On the identical time, it reduces the quantity of information every rank has to avoid wasting throughout a checkpoint, and imposes a stress on the file system stage. On a Community File System (NFS) shared file system, these concurrent writes change into a bottleneck. Utilizing a distributed file system, such Amazon FSx for Lustre, may also help alleviate that stress at the next complete value. In a Distributed Information Parallel (DDP) situation, a single rank writes the entire checkpoint at one time, and all ranks learn the checkpoint when loading it again. On the file system stage, this implies a single author and a number of readers. On an NFS file system, many readers generally is a downside as a result of they are going to be constrained based mostly on the file system, community stack, and queue measurement. A single author, over the community, won’t benefit from all of the community throughput. Right here once more, a quick, distributed file system like FSx for Lustre may also help remedy these issues at the next complete value of possession.
As we are able to see, conventional checkpointing strategies that rely solely on distant persistent storage create a computational overhead throughout checkpoint creation, as a result of writing terabytes of mannequin parameters to persistent storage would possibly throttle it, eat costly community bandwidth, and require advanced orchestration throughout distributed programs. By storing checkpoints in fast-access in-memory areas, corresponding to CPU RAM, whereas sustaining configurable backup to Amazon Easy Storage Service (Amazon S3) for persistence, the system delivers sooner restoration occasions, and is a cheap resolution in comparison with conventional disk-based approaches.
Managed tiered checkpointing works as follows:
- When coaching your mannequin, you outline the checkpoint frequency.
- Mannequin coaching makes use of GPU HBM reminiscence to retailer the mannequin, its parameters, and intermediate outcomes, and do the heavy computation.
- Triggering a checkpoint stops mannequin coaching. The GPU will convert the mannequin weights (tensors) right into a state dictionary and replica the information to the occasion’s CPU, then the coaching resumes whereas managed tiered checkpointing copies the information to RAM.
- As a result of RAM is risky, managed tiered checkpointing copies the information asynchronously from the host RAM to adjoining nodes utilizing RDMA over Elastic Material Adapter (EFA). If a node experiences a difficulty, its checkpoint information shall be obtainable on different nodes too.
- Every so often, it copies the information to a second layer of persistent storage, corresponding to Amazon S3. This helps each when writing to RAM fails and whenever you need to persistently retailer the checkpoint information for future use.
With managed tiered checkpointing, you possibly can configure frequency and retention insurance policies for each in-memory and protracted storage tiers. You utilize the primary layer (in-memory) to avoid wasting checkpoints at a excessive frequency and for quick restoration, periodically saving to Amazon S3 for backup. Managed tiered checkpointing offers a file system that may be seamlessly built-in together with your PyTorch Distributed Checkpointing (DCP) coaching. Including it to your coaching script solely requires a couple of strains of code. Moreover, it improves the efficiency of checkpoints by utilizing in-memory storage whereas utilizing different tiers for persistent storage. PyTorch DCP solves the difficulty of saving a mannequin’s checkpoint when it makes use of distributed assets, corresponding to a number of GPUs throughout a number of compute nodes. Trainers, parameters, and the dataset are partitioned throughout these nodes and assets, then PyTorch DCP saves and hundreds from a number of ranks in parallel. PyTorch DCP produces a number of information per checkpoint, at the very least one per rank. Relying on the quantity of these information, quantity and measurement, shared and community file programs corresponding to NFS will wrestle with inode and metadata administration. Managed tiered checkpointing helps remedy that concern by making it attainable to make use of a number of tiers, decreasing intrusion to the coaching time and nonetheless receiving the advantages of PyTorch DCP, corresponding to deduplication of checkpoint information.
With managed tiered checkpointing in SageMaker HyperPod, you possibly can keep a excessive coaching throughput even in large-scale environments susceptible to failures. It makes use of your present SageMaker HyperPod cluster orchestrated by Amazon Elastic Kubernetes Service (Amazon EKS) and compute nodes, and there aren’t any extra prices to make use of the library.
Within the following sections, we discover learn how to configure the SageMaker HyperPod cluster’s coaching scripts to make use of this new function.
Configure your SageMaker HyperPod cluster for managed tiered checkpointing
SageMaker HyperPod provisions resilient clusters for working machine studying (ML) workloads and creating state-of-the-art fashions corresponding to giant language fashions (LLMs), diffusion fashions, and basis fashions (FMs). By decreasing the advanced work of constructing and sustaining compute clusters utilizing accelerators like AWS Trainium and NVIDIA H200/B200 GPUs, it hastens the creation of basis fashions. To create a brand new SageMaker HyperPod cluster, discuss with the Amazon SageMaker HyperPod Developer Information. If you wish to speed up your deployment by utilizing area hardened property, discuss with the next GitHub repo.
The examples shared on this submit are meant that can assist you study extra about this new function. When you’re contemplating working the examples offered right here in a manufacturing atmosphere, have your safety staff evaluation the content material and ensure they adhere to your safety requirements. At AWS, safety is the highest precedence and we perceive that each buyer has their very own safety framework.Earlier than creating or updating a cluster so as to add the managed tiered checkpointing function, it’s essential to arrange the EKS pods to entry an S3 bucket both by yourself account or throughout accounts. When working with buckets on the identical account because the SageMaker HyperPod EKS cluster, you should utilize the next coverage (change your bucket title earlier than making use of it):
If the bucket is in a distinct account, it’s essential to authorize an AWS Id and Entry Administration (IAM) principal to entry these buckets. The next IAM coverage will do this for you. Make sure to change each the bucket title and the IAM principal (for instance, your AWS account ID).
To create a brand new cluster with managed tiered checkpointing, you possibly can move a parameter utilizing --tiered-storage-config
and setting Mode
to Allow
utilizing an AWS Command Line Interface (AWS CLI) command:
It’s also possible to replace it utilizing the UpdateCluster
API and move the CachingConfig
parameter with the required AllocatedMemory
configuration. You need to use the CachingConfiguration
parameter to outline a set worth or a proportion of the CPU RAM for checkpointing.
Now that your SageMaker HyperPod cluster has the managed tiered checkpointing function, let’s put together the coaching scripts and add them.
Set up the managed tiered checkpoint libraries and combine together with your coaching script
Managed tiered checkpointing integrates with PyTorch DCP. You begin by putting in the sagemaker-checkpointing
library. Then you definitely create and configure a namespace to retailer the checkpoints based mostly on the outlined frequency. Lastly, you add the checkpoint perform inside your coaching loop.
To put in the library, we merely use Python’s pip. Be sure you have already got the dependencies put in: Python 3.10 or greater, PyTorch with DCP assist, and the AWS credentials configured correctly. To combine Amazon S3 as one other storage layer, you additionally want s3torchconnector
put in.
Now you possibly can import the library in your script and configure the namespace and frequency for checkpointing:
Within the previous code snippet, we’ve configured managed tiered checkpointing with the identical world_size
because the variety of ranks in our cluster. Whenever you begin a distributed coaching, every GPU within the cluster is assigned a rank quantity, and the overall variety of GPUs obtainable is the world_size
. We arrange Amazon S3 as our backup persistent storage, setting managed tiered checkpointing to retailer information in Amazon S3 each 100 coaching steps. Each world_size
and namespace
are required parameters; the others are non-compulsory.
Now that the configuration is prepared, let’s arrange PyTorch DCP and combine managed tiered checkpointing.
First, configure the storage author. This part will move on to the PyTorch DCP async_save
perform alongside the mannequin’s state dictionary. We use the SageMakerTieredStorageWriter
when writing the checkpoints and the SageMakeTieredStorageReader
when restoring from these checkpoints.
Inside your mannequin coaching loop, you add the storage author configuration and move alongside each the managed tiered checkpointing configuration and the step quantity:
You’ll be able to outline the step quantity explicitly for the storage author, or you possibly can let the storage author establish the step quantity from the trail the place the checkpoint is being saved. If you wish to let the storage author infer the step quantity from the bottom path, don’t set the step
parameter and ensure your path accommodates the step quantity in it.
Now you possibly can name the PyTorch DCP asynchronous save perform and move alongside the state dictionary and the storage author configuration:async_save(state_dict=state_dict, storage_writer=storage_writer)
We now have arrange managed tiered checkpointing to write down checkpoints at our desired frequency and placement (in-memory). Let’s use the storage reader to revive these checkpoints. First, move the managed tiered checkpointing configuration to the SageMakerTieredStorageReader
, then name the PyTorch DCP load perform, passing the mannequin state dictionary and the storage reader configuration:
To work by way of a whole instance, discuss with the next GitHub repository, the place we’ve created a easy coaching script, together with the managed tiered checkpointing function.
Clear up
After you’ve gotten labored with managed tiered checkpointing, and also you need to clear up the atmosphere, merely take away the amzn-sagemaker-checkpointing
library by working pip uninstall amzn-sagemaker-checkpointing
.
When you put in the answer in a Python digital atmosphere, then simply deleting the digital atmosphere will suffice.Managed tiered checkpointing is a free function that doesn’t require extra assets to run. You utilize your present SageMaker HyperPod EKS cluster and compute nodes.
Finest practices to optimize your checkpoint technique with managed tiered checkpointing
Managed tiered checkpointing will try to write down to the in-memory tier first. This optimizes the writing efficiency as a result of in-memory offers ultra-low latency checkpoint entry. You must configure managed tiered checkpointing to write down to a second layer, corresponding to Amazon S3, every so often. For instance, configure managed tiered checkpointing to write down to the in-memory layer each 10 steps, and configure it to write down to Amazon S3 each 100 steps.
If managed tiered checkpointing fails to write down to the in-memory layer, and the node experiences a difficulty, you then nonetheless have your checkpoint saved on Amazon S3. Whereas writing to Amazon S3, managed tiered checkpointing makes use of a number of TCP streams (chunks) to optimize Amazon S3 writes.
By way of consistency, managed tiered checkpointing makes use of an all-or-nothing writing technique. It implements a fallback mechanism that may seamlessly transition between the storage tiers. Checkpoint metadata, corresponding to step quantity, is saved alongside the information for each tier.
When attempting to troubleshoot managed tiered checkpointing, you possibly can verify the log written domestically to /var/log/sagemaker_checkpointing/{namespace}_checkpointing.log
. It publishes information in regards to the coaching step, rank quantity, and the operation particulars. The next is an instance output of that file:
Managed tiered checkpointing additionally writes these metrics to the console, so it’s simple to troubleshoot throughout improvement. They comprise info on which step quantity is being written to which storage layer and the throughput and complete time taken to write down the information. With that info, you possibly can monitor and troubleshoot managed tiered checkpointing totally.
Whenever you mix these instruments with the SageMaker HyperPod observability stack, you get a whole view of all metrics of your coaching or inference workload.
Conclusion
The brand new managed tiered checkpointing function in SageMaker HyperPod augments FM coaching effectivity by intelligently distributing checkpoints throughout a number of storage tiers. This superior method locations mannequin states in quick entry areas corresponding to CPU RAM reminiscence, whereas utilizing persistent storage corresponding to Amazon S3 for cost-effective, long-term persistence. As of the time of this launch, managed tiered checkpointing is supported solely on SageMaker HyperPod on Amazon EKS.
Managed tiered checkpointing delivers quick restoration occasions with out elevated storage prices, avoiding advanced trade-offs between resiliency, coaching effectivity, and storage prices. It has been validated on giant distributed coaching clusters that vary from tons of of GPU to greater than 15,000 GPU, with checkpoints being saved inside seconds.
Integrating managed tiered checkpointing in your coaching scripts is simple, with just some strains of code, offering quick entry to stylish checkpoint administration with out requiring deep engineering experience.
For extra info on how managed tiered checkpointing works, learn how to set it up, and different particulars, discuss with HyperPod managed tier checkpointing.
In regards to the authors
Paulo Aragao is a Principal WorldWide Options Architect targeted on Generative AI on the Specialist Organisation on AWS. He helps Enterprises and Startups to construct their Basis Fashions technique and innovate sooner by leveraging his in depth data on Excessive Perfomance Computing and Machine Studying. A very long time bass participant, and pure born rock fan, Paulo enjoys spending time travelling together with his household, scuba diving, and enjoying actual time technique and role-playing video games.
Kunal Jha is a Principal Product Supervisor at AWS. He’s targeted on constructing Amazon SageMaker Hyperpod because the best-in-class alternative for Generative AI mannequin’s coaching and inference. In his spare time, Kunal enjoys snowboarding and exploring the Pacific Northwest.
Mandar Kulkarni is a Software program Improvement Engineer II at AWS, the place he works on Amazon SageMaker. He makes a speciality of constructing scalable and performant machine studying libraries and infrastructure options, notably specializing in SageMaker HyperPod. His technical pursuits span machine studying, synthetic intelligence, distributed programs and utility safety. When not architecting ML options, Mandar enjoys mountain climbing, working towards Indian classical music, sports activities, and spending high quality time together with his younger household.
Vinay Devadiga is a Software program Improvement Engineer II at AWS with a deep ardour for synthetic intelligence and cloud computing. He focuses on constructing scalable, high-performance programs that allow the facility of AI and machine studying to resolve advanced issues. Vinay enjoys staying on the forefront of know-how, repeatedly studying, and making use of new developments to drive innovation. Outdoors of labor, he likes enjoying sports activities and spending high quality time together with his household.
Vivek Maran is a Software program Engineer at AWS. He at present works on the event of Amazon SageMaker HyperPod, a resilient platform for giant scale distributed coaching and inference. His pursuits embody giant scale distributed programs, community programs, and synthetic intelligence. Outdoors of labor, he enjoys music, working, and protecting updated with enterprise & know-how developments.