Trendy generative AI mannequin suppliers require unprecedented computational scale, with pre-training typically involving hundreds of accelerators working constantly for days, and generally months. Basis Fashions (FMs) demand distributed coaching clusters — coordinated teams of accelerated compute cases, utilizing frameworks like PyTorch — to parallelize workloads throughout a whole bunch of accelerators (like AWS Trainium and AWS Inferentia chips or NVIDIA GPUs).
Orchestrators like SLURM and Kubernetes handle these advanced workloads, scheduling jobs throughout nodes, managing cluster sources, and processing requests. Paired with AWS infrastructure like Amazon Elastic Compute Cloud (Amazon EC2) accelerated computing cases, Elastic Cloth Adapter (EFA), and distributed file programs like Amazon Elastic File System (Amazon EFS) and Amazon FSx, these extremely clusters can run large-scale machine studying (ML) coaching and inference, dealing with parallelism, gradient synchronization and collective communications, and even routing and cargo balancing. Nonetheless, at scale, even strong orchestrators face challenges round cluster resilience. Distributed coaching workloads particularly run synchronously, as a result of every coaching step requires collaborating cases to finish their calculations earlier than continuing to the following step. Which means that if a single occasion fails, your complete job fails. The probability of those failures will increase with the dimensions of the cluster.
Though resilience and infrastructure reliability generally is a problem, developer expertise stays equally pivotal. Conventional ML workflows create silos, the place information and analysis scientists prototype on native Jupyter notebooks or Visible Studio Code cases, missing entry to cluster-scale storage, and engineers handle manufacturing jobs by separate SLURM or Kubernetes (kubectl
or helm
, for instance) interfaces. This fragmentation has penalties, together with mismatches between pocket book and manufacturing environments, lack of native entry to cluster storage, and most significantly, sub-optimal use of extremely clusters.
On this put up, we discover these challenges. Particularly, we suggest an answer to boost the information scientist expertise on Amazon SageMaker HyperPod—a resilient extremely cluster answer.
Amazon SageMaker HyperPod
SageMaker HyperPod is a compute setting function constructed for large-scale frontier mannequin coaching. You’ll be able to construct resilient clusters for ML workloads and develop state-of-the-art frontier fashions. SageMaker HyperPod runs well being monitoring brokers within the background for every occasion. When it detects a {hardware} failure, SageMaker HyperPod routinely repairs or replaces the defective occasion and resumes coaching from the final saved checkpoint. This automation alleviates the necessity for handbook intervention, which suggests you possibly can practice in distributed settings for weeks or months with minimal disruption.
To study extra in regards to the resilience and Complete Value of Possession (TCO) advantages of SageMaker HyperPod, try Scale back ML coaching prices with Amazon SageMaker HyperPod. As of penning this put up, SageMaker HyperPod helps each SLURM and Amazon Elastic Kubernetes Service (Amazon EKS) as orchestrators.
To deploy a SageMaker HyperPod cluster, consult with the SageMaker HyperPod workshops (SLURM, Amazon EKS). To study extra about what’s being deployed, try the structure diagrams later on this put up. You’ll be able to select to make use of both of the 2 orchestrators based mostly in your choice.
Amazon SageMaker Studio
Amazon SageMaker Studio is a completely built-in growth setting (IDE) designed to streamline the end-to-end ML lifecycle. It gives a unified, web-based interface the place information scientists and builders can carry out ML duties, together with information preparation, mannequin constructing, coaching, tuning, analysis, deployment, and monitoring.
By centralizing these capabilities, SageMaker Studio alleviates the necessity to change between a number of instruments, considerably enhancing productiveness and collaboration. SageMaker Studio helps a wide range of IDEs, similar to JupyterLab Notebooks, Code Editor based mostly on Code-OSS, Visible Studio Code Open Supply, and RStudio, providing flexibility for numerous growth preferences. SageMaker Studio helps personal and shared areas, so groups can collaborate successfully whereas optimizing useful resource allocation. Shared areas permit a number of customers to entry the identical compute sources throughout profiles, and personal areas present devoted environments for particular person customers. This flexibility empowers information scientists and builders to seamlessly scale their compute sources and improve collaboration inside SageMaker Studio. Moreover, it integrates with superior tooling like managed MLflow and Companion AI Apps to streamline experiment monitoring and speed up AI-driven innovation.
Distributed file programs: Amazon FSx
Amazon FSx for Lustre is a completely managed file storage service designed to offer high-performance, scalable, and cost-effective storage for compute-intensive workloads. Powered by the Lustre structure, it’s optimized for functions requiring entry to quick storage, similar to ML, high-performance computing, video processing, monetary modeling, and large information analytics.
FSx for Lustre delivers sub-millisecond latencies, scaling as much as 1 GBps per TiB of throughput, and thousands and thousands of IOPS. This makes it perfect for workloads demanding speedy information entry and processing. The service integrates with Amazon Easy Storage Service (Amazon S3), enabling seamless entry to S3 objects as recordsdata and facilitating quick information transfers between Amazon FSx and Amazon S3. Updates in S3 buckets are routinely mirrored in FSx file programs and vice versa. For extra info on this integration, try Exporting recordsdata utilizing HSM instructions and Linking your file system to an Amazon S3 bucket.
Concept behind mounting an FSx for Lustre file system to SageMaker Studio areas
You should utilize FSx for Lustre as a shared high-performance file system to attach SageMaker Studio domains with SageMaker HyperPod clusters, streamlining ML workflows for information scientists and researchers. Through the use of FSx for Lustre as a shared quantity, you possibly can construct and refine your coaching or fine-tuning code utilizing IDEs like JupyterLab and Code Editor in SageMaker Studio, put together datasets, and save your work immediately within the FSx for Lustre quantity.This similar quantity is mounted by SageMaker HyperPod through the execution of coaching workloads, enabling direct entry to ready information and code with out the necessity for repetitive information transfers or customized picture creation. Knowledge scientists can iteratively make adjustments, put together information, and submit coaching workloads immediately from SageMaker Studio, offering consistency throughout growth and execution environments whereas enhancing productiveness. This integration alleviates the overhead of transferring information between environments and gives a seamless workflow for large-scale ML tasks requiring excessive throughput and low-latency storage. You’ll be able to configure FSx for Lustre volumes to offer file system entry to SageMaker Studio person profiles in two distinct methods, every tailor-made to completely different collaboration and information administration wants.
Choice 1: Shared file system partition throughout each person profile
Infrastructure directors can arrange a single FSx for Lustre file system partition shared throughout person profiles inside a SageMaker Studio area, as illustrated within the following diagram.

Determine 1: A FSx for Lustre file system partition shared throughout a number of person profiles inside a single SageMaker Studio Area
- Shared mission directories – Groups engaged on large-scale tasks can collaborate seamlessly by accessing a shared partition. This makes it potential for a number of customers to work on the identical recordsdata, datasets, and FMs with out duplicating sources.
- Simplified file administration – You don’t must handle personal storage; as a substitute, you possibly can depend on the shared listing on your file-related wants, decreasing complexity.
- Improved information governance and safety – The shared FSx for Lustre partition is centrally managed by the infrastructure admin, enabling strong entry controls and information insurance policies to keep up safety and integrity of shared sources.
Choice 2: Shared file system partition throughout every person profile
Alternatively, directors can configure devoted FSx for Lustre file system partitions for every particular person person profile in SageMaker Studio, as illustrated within the following diagram.

Determine 2: A FSx for Lustre file system with a devoted partition per person
This setup gives customized storage and facilitates information isolation. Key advantages embrace:
- Particular person information storage and evaluation – Every person will get a non-public partition to retailer private datasets, fashions, and recordsdata. This facilitates impartial work on tasks with clear segregation by person profile.
- Centralized information administration – Directors retain centralized management over the FSx for Lustre file system, facilitating safe backups and direct entry whereas sustaining information safety for customers.
- Cross-instance file sharing – You’ll be able to entry your personal recordsdata throughout a number of SageMaker Studio areas and IDEs, as a result of the FSx for Lustre partition gives persistent storage on the person profile degree.
Answer overview
The next diagram illustrates the structure of SageMaker HyperPod with SLURM integration.

Determine 3: Structure Diagram for SageMaker HyperPod with Slurm because the orchestrator
The next diagram illustrates the structure of SageMaker HyperPod with Amazon EKS integration.

Determine 4: Structure Diagram for SageMaker HyperPod with EKS because the orchestrator
These diagrams illustrate what you’ll provision as a part of this answer. Along with the SageMaker HyperPod cluster you have already got, you provision a SageMaker Studio area, and fix the cluster’s FSx for Lustre file system to the SageMaker Studio area. Relying on whether or not or not you select a SharedFSx
, you possibly can both connect the file system to be mounted with a single partition shared throughout person profiles (that you just configure) inside your SageMaker area, or connect it to be mounted with a number of partitions for a number of remoted customers. To study extra about this distinction, consult with the part earlier on this put up discussing the speculation behind mounting an FSx for Lustre file system to SageMaker Studio areas.
Within the following sections, we current a walkthrough of this integration by demonstrating on a SageMaker HyperPod with Amazon EKS cluster how one can:
- Connect a SageMaker Studio area.
- Use that area to fine-tune the DeepSeek-R1-Distill-Qwen-14B utilizing the FreedomIntelligence/medical-o1-reasoning-SFT dataset.
Stipulations
This put up assumes that you’ve got a SageMaker HyperPod cluster.
Deploy sources utilizing AWS CloudFormation
As a part of this integration, we offer an AWS CloudFormation stack template (SLURM, Amazon EKS). Earlier than deploying the stack, be sure you have a SageMaker HyperPod cluster arrange.
Within the stack for SageMaker HyperPod with SLURM, you create the next sources:
- A SageMaker Studio area.
- Lifecycle configurations for putting in needed packages for the SageMaker Studio IDE, together with SLURM. Lifecycle configurations might be created for each JupyterLab and Code Editor. We set it up in order that your Code Editor or JupyterLab occasion will primarily be configured as a login node on your SageMaker HyperPod cluster.
- An AWS Lambda operate that:
- Associates the created
security-group-for-inbound-nfs
safety group to the SageMaker Studio area. - Associates the
security-group-for-inbound-nfs
safety group to the FSx for Lustre ENIs. - Optionally available:
- If
SharedFSx
is ready toTrue
, the created partition is shared within the FSx for Lustre quantity and related to the SageMaker Studio area. - If
SharedFSx
is ready toFalse
, a Lambda operate creates the partition/{user_profile_name}
and associates it to the SageMaker Studio person profile.
- If
- Associates the created
Within the stack for SageMaker HyperPod with Amazon EKS, you create the next sources:
- A SageMaker Studio area.
- Lifecycle configurations for putting in needed packages for SageMaker Studio IDE, similar to
kubectl
andjq
. Lifecycle configurations might be created for each JupyterLab and Code Editor. - A Lambda operate that:
- Associates the created
security-group-for-inbound-nfs
safety group to the SageMaker Studio area. - Associates the
security-group-for-inbound-nfs
safety group to the FSx for Lustre ENIs. - Optionally available:
- If
SharedFSx
is ready toTrue
, the created partition is shared within the FSx for Lustre quantity and related to the SageMaker Studio area. - If
SharedFSx
is ready toFalse
, a Lambda operate creates the partition/{user_profile_name}
and associates it to the SageMaker Studio person profile.
- If
- Associates the created
The primary distinction within the implementation of the 2 is within the lifecycle configurations for the JupyterLab or Code Editor servers working on the 2 implementations of SageMaker HyperPod—that is due to the distinction in the way you work together with the cluster utilizing the completely different orchestrators (kubectl
or helm
for Amazon EKS, and ssm
or ssh
for SLURM). Along with mounting your cluster’s FSx for Lustre file system, for SageMaker HyperPod with Amazon EKS, the lifecycle scripts configure your JupyterLab or Code Editor server to have the ability to run recognized Kubernetes-based command line interfaces, together with kubectl
, eksctl
, and helm
. Moreover, it preconfigures your context, in order that your cluster is able to use as quickly as your JupyterLab or Code Editor occasion is up.
Yow will discover the lifecycle configuration for SageMaker HyperPod with Amazon EKS on the deployed CloudFormation stack template. SLURM works a bit in another way. We designed the lifecycle configuration in order that your JupyterLab or Code Editor occasion would function a login node to your SageMaker HyperPod with SLURM cluster. Login nodes permit you to log in to the cluster, submit jobs, and consider and manipulate information with out working on the vital slurmctld
scheduler node. This additionally makes it potential to run monitoring servers like intention, TensorBoard, or Grafana or Prometheus. Due to this fact, the lifecycle configuration right here routinely installs SLURM and configures it so to interface along with your cluster utilizing your JupyterLab or Code Editor occasion. Yow will discover the script used to configure SLURM on these cases on GitHub.
Each these configurations use the identical logic to mount the file programs. The directions present in Including a customized file system to a website have been achieved in a customized useful resource (Lambda operate) outlined within the CloudFormation stack template.
For extra particulars on deploying these supplied stacks, try the respective workshop pages for SageMaker HyperPod with SLURM and SageMaker HyperPod with Amazon EKS.
Knowledge science journey on SageMaker HyperPod with SageMaker Studio
As a knowledge scientist, after you arrange the SageMaker HyperPod and SageMaker Studio integration, you possibly can log in to the SageMaker Studio setting by your person profile.

Determine 5: You’ll be able to log in to your SageMaker Studio setting by your created person profile.
In SageMaker Studio, you possibly can choose your most well-liked IDE to begin prototyping your fine-tuning workload, and create the MLFlow monitoring server to trace coaching and system metrics through the execution of the workload.

Determine 6: Choose your most well-liked IDE to hook up with your HyperPod cluster
The SageMaker HyperPod clusters web page gives details about the out there clusters and particulars on the nodes.
Figures 7,8: You too can see details about your SageMaker HyperPod cluster on SageMaker Studio
For this put up, we chosen Code Editor as our most well-liked IDE. The automation supplied by this answer preconfigured the FSx for Lustre file system and the lifecycle configuration to put in the mandatory modules for submitting workloads on the cluster through the use of the hyperpod-cli
or kubectl
. For the occasion kind, you possibly can select a variety of accessible cases. In our case, we opted for the default ml.t3.medium.

Determine 9: CodeEditor configuration
The event setting already presents the partition mounted as a file system, the place you can begin prototyping your code for information preparation of mannequin fine-tuning. For the aim of this instance, we fine-tune DeepSeek-R1-Distill-Qwen-14B utilizing the FreedomIntelligence/medical-o1-reasoning-SFT dataset.

Determine 10: Your cluster’s recordsdata are accessible immediately in your CodeEditor house, on account of your file system being mounted on to your CodeEditor house! This implies you possibly can develop domestically, and deploy onto your ultra-cluster.
The repository is organized as follows:
- download_model.py – The script to obtain the open supply mannequin immediately within the FSx for Lustre quantity. This fashion, we offer a sooner and constant execution of the coaching workload on SageMaker HyperPod.
- scripts/dataprep.py – The script to obtain and put together the dataset for the fine-tuning workload. Within the script, we format the dataset through the use of the immediate type outlined for the DeepSeek R1 fashions and save the dataset within the FSx for Lustre quantity. This fashion, we offer a sooner execution of the coaching workload by avoiding asset copy from different information repositories.
- scripts/practice.py – The script containing the fine-tuning logic, utilizing open supply modules like Hugging Face transformers and optimization and distribution methods utilizing FSDP and QLoRA.
- scripts/analysis.py – The script to run ROUGE analysis on the fine-tuned mannequin.
- pod-finetuning.yaml – The manifest file containing the definition of the container used to execute the fine-tuning workload on the SageMaker HyperPod cluster.
- pod-evaluation.yaml – The manifest file containing the definition of the container used to execute the analysis workload on the SageMaker HyperPod cluster.
After downloading the mannequin and getting ready the dataset for the fine-tuning, you can begin prototyping the fine-tuning script immediately within the IDE.

Determine 11: You can begin creating domestically!
The updates finished within the script might be routinely mirrored within the container for the execution of the workload. While you’re prepared, you possibly can outline the manifest file for the execution of the workload on SageMaker HyperPod. Within the following code, we spotlight the important thing parts of the manifest. For a whole instance of a Kubernetes manifest file, consult with the awsome-distributed-training GitHub repository.
The important thing parts are as follows:
- replicas: 8 – This specifies that eight employee pods might be created for this PyTorchJob. That is notably essential for distributed coaching as a result of it determines the size of your coaching job. Having eight replicas means your PyTorch coaching might be distributed throughout eight separate pods, permitting for parallel processing and sooner coaching occasions.
- Persistent quantity configuration – This consists of the next:
- title: fsx-volume – Defines a named quantity that might be used for storage.
- persistentVolumeClaim – Signifies that is utilizing Kubernetes’s persistent storage mechanism.
- claimName: fsx-claim – References a pre-created
PersistentVolumeClaim
, pointing to an FSx for Lustre file system used within the SageMaker Studio setting.
- Container picture – This consists of the next:
- Coaching command – The highlighted command exhibits the execution directions for the coaching workload:
- pip set up -r /information/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/necessities.txt – Installs dependencies at runtime, to customise the container with packages and modules required for the fine-tuning workload.
- torchrun … /information/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/scripts/practice.py – The precise coaching script, by pointing to the shared FSx for Lustre file system, within the partition created for the SageMaker Studio person profile
Knowledge-Scientist
. - –config /information/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/args-fine-tuning.yaml – Arguments supplied to the coaching script, which comprises definition of the coaching parameters, and extra variables used through the execution of the workload.
The args-fine-tuning.yaml
file comprises the definition of the coaching parameters to offer to the script. As well as, the coaching script was outlined to save lots of coaching and system metrics on the managed MLflow server in SageMaker Studio, in case the Amazon Useful resource Identify (ARN) and experiment title are supplied:
# Location within the FSx for Lustre file system the place the bottom mannequin was saved
model_id: "/information/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/DeepSeek-R1-Distill-Qwen-14B"
mlflow_uri: "${MLFLOW_ARN}"
mlflow_experiment_name: "deepseek-r1-distill-llama-8b-agent"
# sagemaker particular parameters
# File system path the place the workload will retailer the mannequin
output_dir: "/information/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/mannequin/"
# File system path the place the workload can entry the dataset practice dataset
train_dataset_path: "/information/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/information/practice/"
# File system path the place the workload can entry the dataset take a look at dataset
test_dataset_path: "/information/Knowledge-Scientist/deepseek-r1-distill-qwen-14b/information/take a look at/"
# coaching parameters
lora_r: 8
lora_alpha: 16
lora_dropout: 0.1
learning_rate: 2e-4 # studying price scheduler
num_train_epochs: 1 # variety of coaching epochs
per_device_train_batch_size: 2 # batch measurement per system throughout coaching
per_device_eval_batch_size: 2 # batch measurement for analysis
gradient_accumulation_steps: 2 # variety of steps earlier than performing a backward/replace go
gradient_checkpointing: true # use gradient checkpointing
bf16: true # use bfloat16 precision
tf32: false # use tf32 precision
fsdp: "full_shard auto_wrap offload"
fsdp_config:
backward_prefetch: "backward_pre"
cpu_ram_efficient_loading: true
offload_params: true
forward_prefetch: false
use_orig_params: true
merge_weights: true
The parameters model_id
, output_dir
, train_dataset_path
, and test_dataset_path
comply with the identical logic described for the manifest file and consult with the placement the place the FSx for Lustre quantity is mounted within the container, below the partition Knowledge-Scientist
created for the SageMaker Studio person profile.
When you may have completed the event of the fine-tuning script and outlined the coaching parameters for the workload, you possibly can deploy the workload with the next instructions:
$ kubectl apply -f pod-finetuning.yaml
service/etcd unchanged
deployment.apps/etcd unchanged
pytorchjob.kubeflow.org/deepseek-r1-qwen-14b-fine-tuning created
You’ll be able to discover the logs of the workload execution immediately from the SageMaker Studio IDE.

Determine 12: View the logs of the submitted coaching run immediately in your CodeEditor terminal
You’ll be able to monitor coaching and system metrics from the managed MLflow server in SageMaker Studio.

Determine 13: SageMaker Studio immediately integrates with a managed MLFlow server. You should utilize it to trace coaching and system metrics immediately out of your Studio Area
Within the SageMaker HyperPod cluster sections, you possibly can discover cluster metrics due to the combination of SageMaker Studio with SageMaker HyperPod observability.

Determine 14: You’ll be able to view further cluster degree/infrastructure metrics within the “Compute” -> “SageMaker HyperPod clusters” part, together with GPU utilization.
On the conclusion of the fine-tuning workload, you should utilize the identical cluster to run batch analysis workloads on the mannequin by deploying the manifest pod-evaluation.yaml
file to run an analysis on the fine-tuned mannequin through the use of ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated textual content and human-written reference textual content.
The analysis script makes use of the identical SageMaker HyperPod cluster and compares outcomes with the beforehand downloaded base mannequin.
Clear up
To scrub up your sources to keep away from incurring extra costs, comply with these steps:
- Delete unused SageMaker Studio sources.
- Optionally, delete the SageMaker Studio area.
- When you created a SageMaker HyperPod cluster, delete the cluster to cease incurring prices.
- When you created the networking stack from the SageMaker HyperPod workshop, delete the stack as effectively to wash up the digital personal cloud (VPC) sources and the FSx for Lustre quantity.
Conclusion
On this put up, we mentioned how SageMaker HyperPod and SageMaker Studio can enhance and pace up the event expertise of information scientists through the use of IDEs and tooling of SageMaker Studio and the scalability and resiliency of SageMaker HyperPod with Amazon EKS. The answer simplifies the setup for the system administrator of the centralized system through the use of the governance and safety capabilities supplied by the AWS companies.
We advocate beginning your journey by exploring the workshops Amazon EKS Help in Amazon SageMaker HyperPod and Amazon SageMaker HyperPod, and prototyping your custom-made giant language mannequin through the use of the sources out there within the awsome-distributed-training GitHub repository.
A particular due to our colleagues Nisha Nadkarni (Sr. WW Specialist SA GenAI), Anoop Saha (Sr. Specialist WW Basis Fashions), and Mair Hasco (Sr. WW GenAI/ML Specialist) within the AWS ML Frameworks workforce, for his or her assist within the publication of this put up.
Concerning the authors
Bruno Pistone is a Senior Generative AI and ML Specialist Options Architect for AWS based mostly in Milan. He works with giant prospects serving to them to deeply perceive their technical wants and design AI and Machine Studying options that make the very best use of the AWS Cloud and the Amazon Machine Studying stack. His experience embrace: Machine Studying finish to finish, Machine Studying Industrialization, and Generative AI. He enjoys spending time along with his associates and exploring new locations, in addition to travelling to new locations
Aman Shanbhag is a Specialist Options Architect on the ML Frameworks workforce at Amazon Net Providers (AWS), the place he helps prospects and companions with deploying ML coaching and inference options at scale. Earlier than becoming a member of AWS, Aman graduated from Rice College with levels in laptop science, arithmetic, and entrepreneurship.