Streamline machine studying workflows with SkyPilot on Amazon SageMaker HyperPod

This submit is co-written with Zhanghao Wu, co-creator of SkyPilot.

The fast development of generative AI and basis fashions (FMs) has considerably elevated computational useful resource necessities for machine studying (ML) workloads. Fashionable ML pipelines require environment friendly methods for distributing workloads throughout accelerated compute sources, whereas ensuring developer productiveness stays excessive. Organizations want infrastructure options that aren’t solely highly effective but additionally versatile, resilient, and simple to handle.

SkyPilot is an open supply framework that simplifies working ML workloads by offering a unified abstraction layer that helps ML engineers run their workloads on completely different compute sources with out managing underlying infrastructure complexities. It provides a easy, high-level interface for provisioning sources, scheduling jobs, and managing distributed coaching throughout a number of nodes.

Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not solely gives the pliability to create and use your personal software program stack, but additionally gives optimum efficiency via identical backbone placement of situations, in addition to built-in resiliency. Combining the resiliency of SageMaker HyperPod and the effectivity of SkyPilot gives a strong framework to scale up your generative AI workloads.

On this submit, we share how SageMaker HyperPod, in collaboration with SkyPilot, is streamlining AI improvement workflows. This integration makes our superior GPU infrastructure extra accessible to ML engineers, enhancing productiveness and useful resource utilization.

Challenges of orchestrating machine studying workloads

Kubernetes has grow to be standard for ML workloads as a result of its scalability and wealthy open supply tooling. SageMaker HyperPod orchestrated on Amazon Elastic Kubernetes Service (Amazon EKS) combines the facility of Kubernetes with the resilient setting of SageMaker HyperPod designed for coaching massive fashions. Amazon EKS help in SageMaker HyperPod strengthens resilience via deep well being checks, automated node restoration, and job auto-resume capabilities, offering uninterrupted coaching for large-scale and long-running jobs.

ML engineers transitioning from conventional VM or on-premises environments usually face a steep studying curve. The complexity of Kubernetes manifests and cluster administration can pose vital challenges, probably slowing down improvement cycles and useful resource utilization.

Moreover, AI infrastructure groups confronted the problem of balancing the necessity for superior administration instruments with the need to offer a user-friendly expertise for his or her ML engineers. They required an answer that might provide each high-level management and ease of use for day-to-day operations.

SageMaker HyperPod with SkyPilot

To deal with these challenges, we partnered with SkyPilot to showcase an answer that makes use of the strengths of each platforms. SageMaker HyperPod excels at managing the underlying compute sources and situations, offering the strong infrastructure crucial for demanding AI workloads. SkyPilot enhances this by providing an intuitive layer for job administration, interactive improvement, and crew coordination.

Via this partnership, we are able to provide our clients one of the best of each worlds: the highly effective, scalable infrastructure of SageMaker HyperPod, mixed with a user-friendly interface that considerably reduces the educational curve for ML engineers. For AI infrastructure groups, this integration gives superior administration capabilities whereas simplifying the expertise for his or her ML engineers, making a win-win state of affairs for all stakeholders.

SkyPilot helps AI groups run their workloads on completely different infrastructures with a unified high-level interface and highly effective administration of sources and jobs. An AI engineer can deliver of their AI framework and specify the useful resource necessities for the job; SkyPilot will intelligently schedule the workloads on one of the best infrastructure: discover the obtainable GPUs, provision the GPU, run the job, and handle its lifecycle.

Answer overview

Implementing this answer is simple, whether or not you’re working with current SageMaker HyperPod clusters or organising a brand new deployment. For current clusters, you’ll be able to join utilizing AWS Command Line Interface (AWS CLI) instructions to replace your kubeconfig and confirm the setup. For brand spanking new deployments, we information you thru organising the API server, creating clusters, and configuring high-performance networking choices like Elastic Material Adapter (EFA).

The next diagram illustrates the answer structure.

Within the following sections, we present easy methods to run SkyPilot jobs for multi-node distributed coaching on SageMaker HyperPod. We go over the method of making a SageMaker HyperPod cluster, putting in SkyPilot, making a SkyPilot cluster, and deploying a SkyPilot coaching job.

Stipulations

You need to have the next conditions:

An current SageMaker HyperPod cluster with Amazon EKS (to create one, confer with Deploy Your HyperPod Cluster). You need to provision a single ml.p5.48xlarge occasion for the code samples within the following sections.
Entry to the AWS CLI and kubectl command line instruments.
A Python setting for putting in SkyPilot.

Create a SageMaker HyperPod cluster

You’ll be able to create an EKS cluster with a single AWS CloudFormation stack following the directions in Utilizing CloudFormation, configured with a digital personal cloud (VPC) and storage sources.

To create and handle SageMaker HyperPod clusters, you should use both the AWS Administration Console or AWS CLI. For those who use the AWS CLI, specify the cluster configuration in a JSON file and select the EKS cluster created from the CloudFormation stack because the orchestrator of the SageMaker HyperPod cluster. You then create the cluster employee nodes with NodeRecovery set to Automated to allow automated node restoration, and for OnStartDeepHealthChecks, add InstanceStress and InstanceConnectivity to allow deep well being checks. See the next code:

cat > cluster-config.json << EOL
{
    "ClusterName": "hp-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 2,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://${BUCKET_NAME}",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ],
        },
  ....
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "$SECURITY_GROUP"
        ],
        "Subnets": [
            "$SUBNET_ID"
        ]
    },
    "ResilienceConfig": {
        "NodeRecovery": "Automated"
    }
}
EOL

You’ll be able to add InstanceStorageConfigs to provision and mount extra Amazon Elastic Block Retailer (Amazon EBS) volumes on SageMaker HyperPod nodes.

To create the cluster utilizing the SageMaker HyperPod APIs, run the next AWS CLI command:

aws sagemaker create-cluster  
--cli-input-json file://cluster-config.json

You at the moment are able to arrange SkyPilot in your SageMaker HyperPod cluster.

Hook up with your SageMaker HyperPod EKS cluster

Out of your AWS CLI setting, run the aws eks update-kubeconfig command to replace your native kube config file (situated at ~/.kube/config) with the credentials and configuration wanted to connect with your EKS cluster utilizing the kubectl command (present your particular EKS cluster identify):

aws eks update-kubeconfig --name $EKS_CLUSTER_NAME

You’ll be able to confirm that you’re linked to the EKS cluster by working the next command:

kubectl config current-context

Set up SkyPilot with Kubernetes help

Use the next code to put in SkyPilot with Kubernetes help utilizing pip:

pip set up skypilot[kubernetes]

This installs the most recent construct of SkyPilot, which incorporates the mandatory Kubernetes integrations.

Confirm SkyPilot’s connection to the EKS cluster

Verify if SkyPilot can connect with your Kubernetes cluster:

sky test k8s

The output ought to look just like the next code:

Checking credentials to allow clouds for SkyPilot.
Kubernetes: enabled [compute]

To allow a cloud, observe the hints above and rerun: sky test
If any issues stay, confer with detailed docs at: https://docs.skypilot.co/en/newest/getting-started/set up.html

🎉 Enabled clouds 🎉
Kubernetes [compute]
Lively context: arn:aws:eks:us-east-2:XXXXXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster

Utilizing SkyPilot API server: http://127.0.0.1:46580

If that is your first time utilizing SkyPilot with this Kubernetes cluster, you would possibly see a immediate to create GPU labels on your nodes. Comply with the directions by working the next code:

python -m sky.utils.kubernetes.gpu_labeler --context

This script helps SkyPilot determine what GPU sources can be found on every node in your cluster. The GPU labeling job would possibly take a couple of minutes relying on the variety of GPU sources in your cluster.

Uncover obtainable GPUs within the cluster

To see what GPU sources can be found in your SageMaker HyperPod cluster, use the next code:

sky show-gpus --cloud k8s

This may record the obtainable GPU sorts and their counts. We have now two p5.48xlarge situations, every geared up with 8 NVIDIA H100 GPUs:

 Kubernetes GPUs
GPU REQUESTABLE_QTY_PER_NODE TOTAL_GPUS TOTAL_FREE_GPUS
H100 1, 2, 4, 8 16 16

Kubernetes per node accelerator availability
NODE_NAME GPU_NAME TOTAL_GPUS FREE_GPUS
hyperpod-i-00baa178bc31afde3 H100 8 8
hyperpod-i-038beefa954efab84 H100 8 8

Launch an interactive improvement setting

With SkyPilot, you’ll be able to launch a SkyPilot cluster for interactive improvement:

sky launch -c dev --gpus H100

This command creates an interactive improvement setting (IDE) with a single H100 GPU and can sync the native working listing to the cluster. SkyPilot handles the pod creation, useful resource allocation, and setup of the IDE.

Thought of sources (1 node):
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CLOUD        INSTANCE            vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE                                                                 COST ($)   CHOSEN   
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Kubernetes   2CPU--8GB--H100:1   2       8         H100:1         arn:aws:eks:us-east-2:XXXXXXXXXX:cluster/sagemaker-hyperpod-eks-cluster   0.00          ✔     
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Launching a brand new cluster 'dev'. Proceed? [Y/n]: Y
• Launching on Kubernetes.
Pod is up.
✔ Cluster launched: dev. View logs: sky api logs -1 sky-2025-05-05-15-28-47-523797/provision. log
• Syncing recordsdata.
Run instructions not specified or empty.
Helpful Instructions
Cluster identify: dey
To log into the top VM:   ssh dev
To submit a job:           sky exec dev yaml_file
To cease the cluster:       sky cease dev
To teardown the cluster:   sky down dev

After it’s launched, you’ll be able to connect with your IDE:

ssh dev

This offers you an interactive shell in your IDE, the place you’ll be able to run your code, set up packages, and carry out ML experiments.

Run coaching jobs

With SkyPilot, you’ll be able to run distributed coaching jobs in your SageMaker HyperPod cluster. The next is an instance of launching a distributed coaching job utilizing a YAML configuration file.

First, create a file named prepare.yaml along with your coaching job configuration:

sources:
    accelerators: H100

num_nodes: 1

setup: |
    git clone --depth 1 https://github.com/pytorch/examples || true
    cd examples
    git filter-branch --prune-empty --subdirectory-filter distributed/minGPT-ddp
    # SkyPilot's default picture on AWS/GCP has CUDA 11.6 (Azure 11.5).
    uv venv --python 3.10
    supply .venv/bin/activate
    uv pip set up -r necessities.txt "numpy<2" "torch"

run: |
    cd examples
    supply .venv/bin/activate
    cd mingpt
    export LOGLEVEL=INFO

    MASTER_ADDR=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
    echo "Beginning distributed coaching, head node: $MASTER_ADDR"

    torchrun 
    --nnodes=$SKYPILOT_NUM_NODES 
    --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE 
    --master_addr=$MASTER_ADDR 
    --master_port=8008 
    --node_rank=${SKYPILOT_NODE_RANK} 
    most important.py

Then launch your coaching job:

sky launch -c prepare prepare.yaml

This creates a coaching job on a single p5.48xlarge nodes, geared up with 8 H100 NVIDIA GPUs. You’ll be able to monitor the output with the next command:

sky logs prepare

Working multi-node coaching jobs with EFA

Elastic Material Adapter (EFA) is a community interface for Amazon Elastic Compute Cloud (Amazon EC2) situations that lets you run functions requiring excessive ranges of inter-node communications at scale on AWS via its custom-built working system bypass {hardware} interface. This allows functions to speak straight with the community {hardware} whereas bypassing the working system kernel, considerably decreasing latency and CPU overhead. This direct {hardware} entry is especially helpful for distributed ML workloads the place frequent inter-node communication throughout gradient synchronization can grow to be a bottleneck. Through the use of EFA-enabled situations equivalent to p5.48xlarge or p6-b200.48xlarge, knowledge scientists can scale their coaching jobs throughout a number of nodes whereas sustaining the low-latency, high-bandwidth communication important for environment friendly distributed coaching, in the end decreasing coaching time and enhancing useful resource utilization for large-scale AI workloads.

The next code snippet exhibits easy methods to incorporate this into your SkyPilot job:

identify: nccl-test-efa

sources:
  cloud: kubernetes
  accelerators: H100:8
  image_id: docker:public.ecr.aws/hpc-cloud/nccl-tests:newest

num_nodes: 2

envs:
  USE_EFA: "true"

run: |
  if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
    echo "Head node"

    # Whole variety of processes, NP ought to be the overall variety of GPUs within the cluster
    NP=$(($SKYPILOT_NUM_GPUS_PER_NODE * $SKYPILOT_NUM_NODES))

    # Append :${SKYPILOT_NUM_GPUS_PER_NODE} to every IP as slots
    nodes=""
    for ip in $SKYPILOT_NODE_IPS; do
      nodes="${nodes}${ip}:${SKYPILOT_NUM_GPUS_PER_NODE},"
    completed
    nodes=${nodes::-1}
    echo "All nodes: ${nodes}"

    # Set setting variables
    export PATH=$PATH:/usr/native/cuda-12.2/bin:/decide/amazon/efa/bin:/usr/bin
    export LD_LIBRARY_PATH=/usr/native/cuda-12.2/lib64:/decide/amazon/openmpi/lib:/decide/nccl/construct/lib:/decide/amazon/efa/lib:/decide/aws-ofi-nccl/set up/lib:/usr/native/nvidia/lib:$LD_LIBRARY_PATH
    export NCCL_HOME=/decide/nccl
    export CUDA_HOME=/usr/native/cuda-12.2
    export NCCL_DEBUG=INFO
    export NCCL_BUFFSIZE=8388608
    export NCCL_P2P_NET_CHUNKSIZE=524288
    export NCCL_TUNER_PLUGIN=/decide/aws-ofi-nccl/set up/lib/libnccl-ofi-tuner.so

    if [ "${USE_EFA}" == "true" ]; then
      export FI_PROVIDER="efa"
    else
      export FI_PROVIDER=""
    fi

    /decide/amazon/openmpi/bin/mpirun 
      --allow-run-as-root 
      --tag-output 
      -H $nodes 
      -np $NP 
      -N $SKYPILOT_NUM_GPUS_PER_NODE 
      --bind-to none 
      -x FI_PROVIDER 
      -x PATH 
      -x LD_LIBRARY_PATH 
      -x NCCL_DEBUG=INFO 
      -x NCCL_BUFFSIZE 
      -x NCCL_P2P_NET_CHUNKSIZE 
      -x NCCL_TUNER_PLUGIN 
      --mca pml ^cm,ucx 
      --mca btl tcp,self 
      --mca btl_tcp_if_exclude lo,docker0,veth_def_agent 
      /decide/nccl-tests/construct/all_reduce_perf 
      -b 8 
      -e 2G 
      -f 2 
      -g 1 
      -c 5 
      -w 5 
      -n 100
  else
    echo "Employee nodes"
  fi

config:
  kubernetes:
    pod_config:
      spec:
        containers:
        - sources:
            limits:
              
              vpc.amazonaws.com/efa: 32
            requests:
              
              vpc.amazonaws.com/efa: 32

Clear up

To delete your SkyPilot cluster, run the next command:

sky down

To delete the SageMaker HyperPod cluster created on this submit, you’ll be able to person both the SageMaker AI console or the next AWS CLI command:

aws sagemaker delete-cluster --cluster-name

Cluster deletion will take a couple of minutes. You’ll be able to affirm profitable deletion after you see no clusters on the SageMaker AI console.

For those who used the CloudFormation stack to create sources, you’ll be able to delete it utilizing the next command:

aws cloudformation delete-stack --stack-name

Conclusion

By combining the strong infrastructure capabilities of SageMaker HyperPod with SkyPilot’s user-friendly interface, we’ve showcased an answer that helps groups deal with innovation somewhat than infrastructure complexity. This method not solely simplifies operations but additionally enhances productiveness and useful resource utilization throughout organizations of all sizes. To get began, confer with SkyPilot within the Amazon EKS Help in Amazon SageMaker HyperPod workshop.

In regards to the authors

Roy Allela is a Senior AI/ML Specialist Options Architect at AWS. He helps AWS clients—from small startups to massive enterprises—prepare and deploy basis fashions effectively on AWS. He’s captivated with computational optimization issues and enhancing the efficiency of AI workloads.

Zhanghao Wu is a co-creator of the SkyPilot open supply challenge and holds a PhD in laptop science from UC Berkeley. He works on SkyPilot core, client-server structure, managed jobs, and enhancing the AI expertise on various cloud infrastructure normally.

Ankit Anand is a Senior Basis Fashions Go-To-Market (GTM) Specialist at AWS. He companions with high generative AI mannequin builders, strategic clients, and AWS service groups to allow the following technology of AI/ML workloads on AWS. Ankit’s expertise contains product administration experience inside the monetary providers trade for high-frequency and low-latency buying and selling and enterprise improvement for Amazon Alexa.

Streamline machine studying workflows with SkyPilot on Amazon SageMaker HyperPod

Are You Being Unfair to LLMs?

Decreasing Time to Worth for Knowledge Science Tasks: Half 3

Decreasing Time to Worth for Knowledge Science Tasks: Half 3

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

The Journey from Jupyter to Programmer: A Fast-Begin Information

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts