We’re thrilled to introduce Amazon Elastic Kubernetes Service (Amazon EKS) help in Amazon SageMaker HyperPod, a purpose-built infrastructure engineered with resilience at its core. This functionality permits for the seamless addition of SageMaker HyperPod managed compute to EKS clusters, utilizing automated node and job resiliency options for basis mannequin (FM) growth.
FMs are sometimes educated on large-scale compute clusters with tons of or hundreds of accelerators. Below such circumstances, {hardware} failures pose a major problem, as a result of a single accelerator failure amongst hundreds can halt your entire coaching course of. For instance, Meta Llama 3 405B pre-training over 54 days on 16K NVIDIA H100 Tensor Core GPUs skilled 419 sudden interruptions, with 78% attributed to confirmed or suspected {hardware} points, and with 58.7% of those interruptions being GPU-related issues, together with NVLink failures and HBM3 reminiscence failures.
Since its inception, SageMaker HyperPod was designed with a give attention to managed resiliency options to mitigate such {hardware} failures, enabling FM builders reminiscent of Thomson Reuters, Perplexity AI, and Hugging Face to scale their FM coaching and inference on Slurm clusters. With the EKS help in HyperPod, now you can additionally profit from the resiliency options on Kubernetes clusters by managing machine studying (ML) workloads utilizing the HyperPod compute and managed Kubernetes management airplane on the EKS cluster.
AI startups like Observea and Articul8, and enterprises like Thomson Reuters use this new function set to handle their ML mannequin growth lifecycle:
“By means of our use of SageMaker HyperPod, our clients and inside groups not have to fret about working and configuring the Kubernetes management airplane, and SageMaker HyperPod supplies the community efficiency and optimized configurations to help complicated HPC workloads. With Amazon EKS help in SageMaker HyperPod, we are able to scale back time we spent for undifferentiated heavy lifting in infrastructure administration and scale back operational prices by over 30%.”
– Observea
“As a Kubernetes home, we at the moment are thrilled to welcome the launch of Amazon EKS help for SageMaker HyperPod. This can be a sport changer for us because it integrates seamlessly with our current coaching pipelines and makes it even simpler for us to handle and function our large-scale Kubernetes clusters. As well as, this additionally helps our finish clients as we at the moment are in a position to bundle and productize this functionality into our GenAI platform, enabling our clients to run their very own coaching and fine-tuning workloads in a extra streamlined method.”
– Articul8 AI
This submit is designed for Kubernetes cluster directors and ML scientists, offering an outline of the important thing options that SageMaker HyperPod introduces to facilitate large-scale mannequin coaching on an EKS cluster.
The submit is organized into the next three sections:
- Overview of Amazon EKS help in SageMaker HyperPod – This part supplies a high-level overview of Amazon EKS help in SageMaker HyperPod, introducing three key resiliency options HyperPod compute supplies on the EKS cluster. Moreover, this part explains how HyperPod supplies a clean developer expertise for admins and scientists.
- HyperPod cluster setup and node resiliency options – This part supplies an in depth information on integrating HyperPod managed compute into your EKS cluster as Kubernetes employee nodes, emphasizing how its built-in resiliency options present infrastructure stability. This part is particularly useful for admins.
- Coaching job resiliency with the job auto resume performance – On this part, we exhibit how scientists can submit and handle their distributed coaching jobs utilizing both the native Kubernetes CLI (kubectl) or optionally the brand new HyperPod CLI (hyperpod) with computerized job restoration enabled.
Overview of EKS help in SageMaker HyperPod
This part supplies a high-level overview of Amazon EKS help in SageMaker HyperPod, introduces three key resiliency options HyperPod compute supplies on the EKS cluster, and discusses how SageMaker HyperPod supplies clean consumer experiences for admins and scientists.
Structure overview
Amazon EKS help in HyperPod helps a 1-to-1 mapping between an EKS cluster (serving as a Kubernetes management airplane) and a HyperPod compute (hooked up as a gaggle of employee nodes). You will have three digital personal clouds (VPCs) on this structure, internet hosting several types of assets:
- Amazon EKS VPC – An AWS managed VPC hosts the EKS management airplane. This VPC doesn’t seem within the buyer account. Amazon EKS creates a extremely out there endpoint for the managed Kubernetes API server that you simply use to speak along with your cluster (utilizing instruments like kubectl). The managed endpoint makes use of Community Load Balancer to load stability Kubernetes API servers.
- HyperPod VPC – An AWS managed VPC hosts the HyperPod compute. This VPC doesn’t seem within the buyer account. The nodes hook up with the EKS management airplane via a cross-account elastic community interface (ENI).
- SageMaker consumer VPC – A user-managed VPC hosts assets reminiscent of Amazon FSx for Lustre, which is optionally related to Amazon Easy Storage Service (Amazon S3) utilizing an information repository affiliation, in your account.
Cross-account ENIs additionally bridge communication between HyperPod compute cases and different AWS companies in your account, reminiscent of Amazon Elastic Container Registry (Amazon ECR) and Amazon CloudWatch.
The next diagram illustrates the high-level structure of Amazon EKS help in HyperPod.
HyperPod-managed resiliency options
Amazon EKS help in HyperPod supplies the next three capabilities to ensure the cluster stays wholesome and coaching jobs proceed underneath sudden interruptions:
- Deep well being checks – This can be a managed well being examine for stress testing GPUs and AWS Trainium cases, in addition to performing Elastic Cloth Adapter (EFA) These checks might be run throughout the cluster creation, replace, or node alternative phases, and might be enabled or disabled via HyperPod APIs.
- Automated node restoration – HyperPod performs managed, light-weight, and non-invasive checks, coupled with automated node alternative functionality. The HyperPod monitoring agent constantly screens and detects potential points, together with reminiscence exhaustion, disk failures, GPU anomalies, kernel deadlocks, container runtime points, and out-of-memory (OOM) crashes. Primarily based on the underlying problem, the monitoring agent both replaces or reboots the node.
- Job auto resume – SageMaker HyperPod supplies a job auto resume functionality utilizing the Kubeflow Coaching Operator for PyTorch to offer restoration and continuation of coaching jobs within the occasion of interruptions or failures. The extension makes positive the job waits and restarts after the node is changed.
Consumer experiences
Along with the aforementioned managed resiliency options, SageMaker HyperPod supplies clean consumer experiences for each admins and scientists which are important for managing a big cluster and working large-scale coaching jobs on them as a part of the Amazon EKS integration:
- Admin expertise – SageMaker HyperPod supplies APIs and a console expertise to create and handle node teams within the EKS cluster, together with the power to SSH into the cluster nodes. SageMaker HyperPod additionally supplies a mechanism to put in further dependencies on the cluster nodes utilizing lifecycle scripts, and an API-based mechanism to offer cluster software program updates and enhance total observability.
- Scientist expertise – Together with enabling scientists to coach FMs utilizing Amazon EKS because the orchestrator, SageMaker HyperPod supplies further capabilities for scientists to effortlessly prepare fashions. With the HyperPod CLI, scientists can submit coaching jobs by offering a
.yaml
file and handle jobs (checklist, describe, view, cancel) without having to make use ofkubectl
. Scientists can use open supply instruments like Kueue (a Kubernetes instrument for job queuing) and adjoining SageMaker capabilities like managed MLflow to handle their experiments and coaching runs. Scientists may also entry native SageMaker distributed coaching libraries that present efficiency enhancements by as much as 20%. You can too allow SageMaker HyperPod compute with Amazon EKS help utilizing third-party instruments like KubeRay, which runs on the Kubernetes API. This lets you convey your most popular job submission and administration capabilities used with different Kubernetes clusters into your HyperPod atmosphere.
HyperPod compute setup and node resiliency options
On this part, we offer an in depth information on integrating HyperPod managed compute into your EKS cluster as Kubernetes employee nodes, and focus on how its built-in resiliency options present infrastructure stability.
Stipulations
It’s worthwhile to have the next in place previous to the HyperPod compute deployment:
- EKS cluster – You’ll be able to affiliate HyperPod compute to an current EKS cluster that satisfies the set of conditions. Alternatively, you’ll be able to deploy a ready-made EKS cluster with a single AWS CloudFormation template. Refer the structure information for step-by-step setup instruction.
- Customized assets – Operating multi-node distributed coaching requires varied assets varied parts, reminiscent of gadget plugins, CSI drivers, and Coaching Operators, to be pre-deployed on the EKS cluster. You additionally have to deploy further assets for the well being monitoring agent and deep well being examine. HyperPodHelmCharts simplify the method utilizing Helm, certainly one of mostly used bundle mangers for Kubernetes. Refer the developer information for set up.
HyperPod compute setup
With the aforementioned assets efficiently deployed, you’re now ready to create the HyperPod compute. The cluster configuration is specified utilizing a JSON file; the next code supplies an instance:
The offered configuration file comprises two key highlights:
- “OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs HyperPod to conduct a deep well being examine every time new GPU or Trainium cases are added
- “NodeRecovery”: “Automated” – Permits HyperPod’s automated node restoration performance
You’ll be able to create a HyperPod compute with the next aws command (you want model 2.17.47 or newer):
To confirm the cluster standing, you need to use the next command:
This command shows the cluster particulars, together with the cluster title, standing, and creation time:
Alternatively, you’ll be able to confirm the cluster standing via the SageMaker console. After a quick interval, you’ll be able to observe that the standing for all nodes transitions to Operating.
Node resiliency options
To achieve additional perception into the cases, you need to use kubectl get nodes
and look at the node labels. The sagemaker.amazonaws.com/node-health-status
label reveals the life stage of every node. As an example, nodes with the ml.m5.2xlarge
occasion sort are labeled as Schedulable, indicating that they’ve efficiently handed the common HyperPod well being examine. Conversely, nodes with the ml.p5.48xlarge
occasion sort are labeled as Unschedulable, indicating that they’ve entered the preliminary deep well being checks. The next code exhibits an instance:
The deep well being examine logs are saved within the CloudWatch log group at /aws/sagemaker/Clusters/
. The log streams are logged at DeepHealthCheckResults/
. When the deep well being checks establish a problem, the output log supplies detailed data, together with the occasion ID that failed the deep well being checks and the precise failure purpose. For instance:
You’ll be able to examine the progress of the deep well being examine with the next values for the sagemaker.amazonaws.com/deep-health-check
label on every node:
amazonaws.com/deep-health-check: InProgress
amazonaws.com/deep-health-check: Handed
amazonaws.com/deep-health-check: Failed
If a node fails the deep well being checks, it is going to be changed. In any other case, it is going to be marked with the Schedulable label:
sagemaker.amazonaws.com/node-health-status: Schedulable
While you wish to manually substitute a selected node in your cluster, you are able to do so by manually modifying the label.
For full checklist of resilience-related Kubernetes labels, please refer AWS documentation.
Even after the preliminary deep well being checks, HyperPod periodically runs common well being checks. To view the well being occasions detected by the HyperPod well being monitoring agent, you’ll be able to examine the CloudWatch stream log:
- Instance log group title –
/aws/sagemaker/Clusters/
/ - Instance log stream title –
SagemakerHealthMonitoringAgent/
/
The SagemakerHealthMonitoringAgent log stream for every node comprises solely the detection occasions from the well being monitoring agent. For instance:
The deep well being checks or the well being monitor agent establish points in a sure node, the node is labeled with sagemaker.amazonaws.com/node-health-status=UnschedulablePendingReplace:NoSchedule
to keep away from scheduling pods, after which the node is changed or rebooted.
You’ll be able to monitor the well being standing of HyperPod nodes via CloudWatch Container Insights, now with enhanced observability for Amazon EKS. Container Insights helps acquire, mixture, and summarize metrics and logs from containerized purposes and microservices, offering detailed insights into efficiency, well being, and standing metrics for CPU, GPU, Trainium, EFA, and file system as much as the container degree. For the entire checklist of metrics tracked, see Amazon EKS and Kubernetes Container Insights metrics. With the Container Insights integration with SageMaker HyperPod, you may also examine the person node well being standing and the overall variety of schedulable and unschedulable nodes, as proven within the following screenshots.
You will discover the Container Insights arrange information in Amazon EKS Assist in Amazon SageMaker HyperPod Workshop.
Coaching job resiliency with the job auto resume performance
Along with infrastructure resiliency options, you need to use the use job auto resume functionality utilizing the Kubeflow Coaching Operator for PyTorch to keep up the restoration and continuation of coaching jobs within the occasion of interruptions or failures. The job auto resume function makes an attempt to proceed the job, whereas the HyperPod node auto restoration performance works on resolving node failures (node reboot or alternative as wanted) to attenuate coaching downtime. This part demonstrates the job auto resume function utilizing a PyTorch FSDP instance on the awsome-distributed-training repository.
To allow the job auto resume function, you create a PyTorchJob with the fsdp.yaml manifest, which incorporates the next annotations
and nodeSelector
:
With the annotations sagemaker.amazonaws.com/enable-job-auto-resume: "true"
and sagemaker.amazonaws.com/job-max-retry-count: "2"
, SageMaker HyperPod resumes interrupted coaching jobs as much as two instances and schedules the resumed jobs onto wholesome nodes. These wholesome nodes are recognized by the node selector label sagemaker.amazonaws.com/node-health-status: Schedulable
, making certain that solely nodes which have handed fundamental well being checks and can be found for working workloads are used for resumed jobs.
Submit the PyTorchJob utilizing the kubectl
command:
With the job auto resume function enabled, if a job fails as a consequence of a {hardware} failure or any transient points throughout coaching, SageMaker HyperPod initiates the node alternative workflow and restarts the job after the defective nodes are changed. You’ll be able to confirm the standing of job auto resume by describing the PyTorchJob:
Within the occasion of a {hardware} failure, the Kubeflow coaching job restarts as follows:
While you submit a coaching job with the HyperPod CLI, you may also request the job to be auto resumed within the following means:
Seek advice from config.yaml for full configuration. For different CLI choices, discuss with the documentation on Github repository.
Clear up
To delete your SageMaker HyperPod compute, use both the SageMaker console or the next AWS Command Line Interface (AWS CLI) command:
Cluster deletion can take a couple of minutes. You’ll be able to verify profitable deletion after you see no clusters on the SageMaker console.
Conclusion
With the help for Amazon EKS in SageMaker HyperPod, clients who’ve standardized their FM growth workflows on Kubernetes can undertake SageMaker HyperPod and handle their cluster assets utilizing a well-known Kubernetes interface in SageMaker HyperPod. When coaching an FM, SageMaker HyperPod routinely screens cluster well being, and when an infrastructure fault reminiscent of a GPU failure happens, SageMaker HyperPod routinely remediates the difficulty and restarts the coaching course of from the final saved checkpoint, with none human intervention. Amazon EKS additional enhances this functionality by working deep well being checks. Every time a brand new occasion is added to the SageMaker HyperPod compute, it undergoes a deep well being examine course of to establish and substitute probably problematic cases. SageMaker HyperPod then routinely replaces or reboots nodes recognized as defective and resumes coaching processes within the occasion of sudden interruptions, involving node alternative and job resubmission.
For an end-to-end tutorial on cluster administration and FM coaching, go to the . For extra data on infrastructure deployment and extra distributed coaching take a look at instances, discuss with the awsome-distributed-training repository. When you’re keen on deploying HyperPod with step-by-step instructions, you can begin from the aws-do-hyperpod repository.
Concerning the authors
Keita Watanabe is a Senior GenAI Specialist Options Architect within the world-wide specialist group at Amazon Net Providers, the place he helps develop machine studying options utilizing OSS initiatives reminiscent of Slurm and Kubernetes. His background is in machine studying analysis and growth. Previous to becoming a member of AWS, Keita labored within the ecommerce trade as a analysis scientist growing picture retrieval methods for product search. Keita holds a PhD in Science from the College of Tokyo.
Alex Iankoulski is a full-stack software program and infrastructure architect who likes to do deep, hands-on work. He’s presently a Principal Options Architect within the world-wide specialist group at AWS. In his function, he focuses on serving to clients with the orchestration and scaling of ML and AI workloads on container-powered AWS companies. He’s additionally the creator of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s largest challenges. In the course of the previous 10 years, Alex has labored on democratizing generative AI and ML, combating local weather change, and making journey safer, healthcare higher, and power smarter.
Tomonori Shimomura is a Senior Options Architect on the Amazon SageMaker staff, the place he supplies in-depth technical session to SageMaker clients and suggests product enhancements to the product staff. Earlier than becoming a member of Amazon, he labored on the design and growth of embedded software program for online game consoles, and now he leverages his in-depth expertise in cloud-side expertise. In his free time, he enjoys taking part in video video games, studying books, and writing software program.
Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker staff. He makes a speciality of giant language mannequin coaching workloads, serving to clients construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Exterior of labor, he enjoys working, mountaineering, and cooking.
Manoj Ravi is a Senior Product Supervisor on the Amazon SageMaker staff. He’s captivated with constructing next-gen AI merchandise and works on purposes and instruments to make basis mannequin growth and deployment easy for patrons. He holds an MBA from the Haas Faculty of Enterprise and a grasp’s diploma from Carnegie Mellon College. In his spare time, Manoj enjoys taking part in tennis and pursuing panorama pictures.