Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

Basis mannequin (FM) coaching and inference has led to a major improve in computational wants throughout the business. These fashions require huge quantities of accelerated compute to coach and function successfully, pushing the boundaries of conventional computing infrastructure. They require environment friendly methods for distributing workloads throughout a number of GPU accelerated servers, and optimizing developer velocity in addition to efficiency.

Ray is an open supply framework that makes it easy to create, deploy, and optimize distributed Python jobs. At its core, Ray affords a unified programming mannequin that enables builders to seamlessly scale their purposes from a single machine to a distributed cluster. It offers a set of high-level APIs for duties, actors, and knowledge that summary away the complexities of distributed computing, enabling builders to concentrate on the core logic of their purposes. Ray promotes the identical coding patterns for each a easy machine studying (ML) experiment and a scalable, resilient manufacturing software. Ray’s key options embrace environment friendly job scheduling, fault tolerance, and computerized useful resource administration, making it a robust software for constructing a variety of distributed purposes, from ML fashions to real-time knowledge processing pipelines. With its rising ecosystem of libraries and instruments, Ray has change into a well-liked alternative for organizations trying to make use of the ability of distributed computing to deal with advanced and data-intensive issues.

Amazon SageMaker HyperPod is a purpose-built infrastructure to develop and deploy large-scale FMs. SageMaker HyperPod not solely offers the pliability to create and use your personal software program stack, but additionally offers optimum efficiency via similar backbone placement of cases, in addition to built-in resiliency. Combining the resiliency of SageMaker HyperPod and the effectivity of Ray offers a robust framework to scale up your generative AI workloads.

On this put up, we reveal the steps concerned in operating Ray jobs on SageMaker HyperPod.

Overview of Ray

This part offers a high-level overview of the Ray instruments and frameworks for AI/ML workloads. We primarily concentrate on ML coaching use circumstances.

Ray is an open-source distributed computing framework designed to run extremely scalable and parallel Python purposes. Ray manages, executes, and optimizes compute wants throughout AI workloads. It unifies infrastructure via a single, versatile framework—enabling AI workloads from knowledge processing, to mannequin coaching, to mannequin serving and past.

For distributed jobs, Ray offers intuitive instruments for parallelizing and scaling ML workflows. It permits builders to concentrate on their coaching logic with out the complexities of useful resource allocation, job scheduling, and inter-node communication.

At a excessive degree, Ray is made up of three layers:

Ray Core: The muse of Ray, offering primitives for parallel and distributed computing
Ray AI libraries:
- Ray Prepare – A library that simplifies distributed coaching by providing built-in help for in style ML frameworks like PyTorch, TensorFlow, and Hugging Face
- Ray Tune – A library for scalable hyperparameter tuning
- Ray Serve – A library for distributed mannequin deployment and serving
Ray clusters: A distributed computing platform the place employee nodes run consumer code as Ray duties and actors, usually within the cloud

On this put up, we dive deep into operating Ray clusters on SageMaker HyperPod. A Ray cluster consists of a single head node and quite a few linked employee nodes. The top node orchestrates job scheduling, useful resource allocation, and communication between nodes. The ray employee nodes execute the distributed workloads utilizing Ray duties and actors, akin to mannequin coaching or knowledge preprocessing.

Ray clusters and Kubernetes clusters pair effectively collectively. By operating a Ray cluster on Kubernetes utilizing the KubeRay operator, each Ray customers and Kubernetes directors profit from the sleek path from growth to manufacturing. For this use case, we use a SageMaker HyperPod cluster orchestrated via Amazon Elastic Kubernetes Service (Amazon EKS).

The KubeRay operator lets you run a Ray cluster on a Kubernetes cluster. KubeRay creates the next {custom} useful resource definitions (CRDs):

RayCluster – The first useful resource for managing Ray cases on Kubernetes. The nodes in a Ray cluster manifest as pods within the Kubernetes cluster.
RayJob – A single executable job designed to run on an ephemeral Ray cluster. It serves as a higher-level abstraction for submitting duties or batches of duties to be executed by the Ray cluster. A RayJob additionally manages the lifecycle of the Ray cluster, making it ephemeral by robotically spinning up the cluster when the job is submitted and shutting it down when the job is full.
RayService – A Ray cluster and a Serve software that runs on high of it right into a single Kubernetes manifest. It permits for the deployment of Ray purposes that must be uncovered for exterior communication, usually via a service endpoint.

For the rest of this put up, we don’t concentrate on RayJob or RayService; we concentrate on making a persistent Ray cluster to run distributed ML coaching jobs.

When Ray clusters are paired with SageMaker HyperPod clusters, Ray clusters unlock enhanced resiliency and auto-resume capabilities, which we’ll dive deeper into later on this put up. This mix offers an answer for dealing with dynamic workloads, sustaining excessive availability, and offering seamless restoration from node failures, which is essential for long-running jobs.

Overview of SageMaker HyperPod

On this part, we introduce SageMaker HyperPod and its built-in resiliency options to supply infrastructure stability.

Generative AI workloads akin to coaching, inference, and fine-tuning contain constructing, sustaining, and optimizing massive clusters of hundreds of GPU accelerated cases. For distributed coaching, the purpose is to effectively parallelize workloads throughout these cases with a purpose to maximize cluster utilization and decrease time to coach. For giant-scale inference, it’s necessary to attenuate latency, maximize throughput, and seamlessly scale throughout these cases for the most effective consumer expertise. SageMaker HyperPod is a purpose-built infrastructure to handle these wants. It removes the undifferentiated heavy lifting concerned in constructing, sustaining, and optimizing a big GPU accelerated cluster. It additionally offers flexibility to totally customise your coaching or inference setting and compose your personal software program stack. You should use both Slurm or Amazon EKS for orchestration with SageMaker HyperPod.

As a consequence of their huge measurement and the necessity to practice on massive quantities of information, FMs are sometimes educated and deployed on massive compute clusters composed of hundreds of AI accelerators akin to GPUs and AWS Trainium. A single failure in certainly one of these thousand accelerators can interrupt your complete coaching course of, requiring guide intervention to establish, isolate, debug, restore, and recuperate the defective node within the cluster. This workflow can take a number of hours for every failure and because the scale of the cluster grows, it’s widespread to see a failure each few days and even each few hours. SageMaker HyperPod offers resiliency towards infrastructure failures by making use of brokers that repeatedly run well being checks on cluster cases, repair the dangerous cases, reload the final legitimate checkpoint, and resume the coaching—with out consumer intervention. Because of this, you may practice your fashions as much as 40% sooner. You may as well SSH into an occasion within the cluster for debugging and collect insights on hardware-level optimization throughout multi-node coaching. Orchestrators like Slurm or Amazon EKS facilitate environment friendly allocation and administration of sources, present optimum job scheduling, monitor useful resource utilization, and automate fault tolerance.

Answer overview

This part offers an summary of the way to run Ray jobs for multi-node distributed coaching on SageMaker HyperPod. We go over the structure and the method of making a SageMaker HyperPod cluster, putting in the KubeRay operator, and deploying a Ray coaching job.

Though this put up offers a step-by-step information to manually create the cluster, be at liberty to take a look at the aws-do-ray undertaking, which goals to simplify the deployment and scaling of distributed Python software utilizing Ray on Amazon EKS or SageMaker HyperPod. It makes use of Docker to containerize the instruments essential to deploy and handle Ray clusters, jobs, and providers. Along with the aws-do-ray undertaking, we’d like to spotlight the Amazon SageMaker Hyperpod EKS workshop, which affords an end-to-end expertise for operating numerous workloads on SageMaker Hyperpod clusters. There are a number of examples of coaching and inference workloads from the GitHub repository awsome-distributed-training.

As launched earlier on this put up, KubeRay simplifies the deployment and administration of Ray purposes on Kubernetes. The next diagram illustrates the answer structure.

Create a SageMaker HyperPod cluster

Stipulations

Earlier than deploying Ray on SageMaker HyperPod, you want a HyperPod cluster:

When you favor to deploy HyperPod on an present EKS cluster, please observe the directions right here which embrace:

EKS cluster – You may affiliate SageMaker HyperPod compute to an present EKS cluster that satisfies the set of stipulations. Alternatively and advisable, you may deploy a ready-made EKS cluster with a single AWS CloudFormation template. Seek advice from the GitHub repo for directions on organising an EKS cluster.
Customized sources – Operating multi-node distributed coaching requires numerous sources, akin to system plugins, Container Storage Interface (CSI) drivers, and coaching operators, to be pre-deployed on the EKS cluster. You additionally must deploy further sources for the well being monitoring agent and deep well being examine. HyperPodHelmCharts simplify the method utilizing Helm, certainly one of mostly used bundle mangers for Kubernetes. Seek advice from Set up packages on the Amazon EKS cluster utilizing Helm for set up directions.

The next present an instance workflow for making a HyperPod cluster on an present EKS Cluster after deploying stipulations. That is for reference solely and never required for the fast deploy possibility.

cat > cluster-config.json << EOL
{
    "ClusterName": "ml-cluster",
    "Orchestrator": {
        "Eks": {
            "ClusterArn": "${EKS_CLUSTER_ARN}"
        }
    },
    "InstanceGroups": [
        {
            "InstanceGroupName": "worker-group-1",
            "InstanceType": "ml.p5.48xlarge",
            "InstanceCount": 4,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
            "OnStartDeepHealthChecks": [
                "InstanceStress",
                "InstanceConnectivity"
            ]
        },
        {
            "InstanceGroupName": "head-group",
            "InstanceType": "ml.m5.2xlarge",
            "InstanceCount": 1,
            "LifeCycleConfig": {
                "SourceS3Uri": "s3://amzn-s3-demo-bucket",
                "OnCreate": "on_create.sh"
            },
            "ExecutionRole": "${EXECUTION_ROLE}",
            "ThreadsPerCore": 1,
        }
    ],
    "VpcConfig": {
        "SecurityGroupIds": [
            "${SECURITY_GROUP_ID}"
        ],
        "Subnets": [
            "${SUBNET_ID}"
        ]
    },
    "NodeRecovery": "Computerized"
}
EOL

The offered configuration file accommodates two key highlights:

“OnStartDeepHealthChecks”: [“InstanceStress”, “InstanceConnectivity”] – Instructs SageMaker HyperPod to conduct a deep well being examine every time new GPU or Trainium cases are added
“NodeRecovery”: “Computerized” – Permits SageMaker HyperPod automated node restoration

You may create a SageMaker HyperPod compute with the next AWS Command Line Interface (AWS CLI) command (AWS CLI model 2.17.47 or newer is required):

aws sagemaker create-cluster 
    --cli-input-json file://cluster-config.json
{
"ClusterArn": "arn:aws:sagemaker:us-east-2:xxxxxxxxxx:cluster/wccy5z4n4m49"
}

To confirm the cluster standing, you need to use the next command:

aws sagemaker list-clusters --output desk

This command shows the cluster particulars, together with the cluster title, standing, and creation time:

------------------------------------------------------------------------------------------------------------------------------------------------------
|                                                                    ListClusters                                                                    |
+----------------------------------------------------------------------------------------------------------------------------------------------------+
||                                                                 ClusterSummaries                                                                 ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|
||                           ClusterArn                           |        ClusterName        | ClusterStatus  |           CreationTime             ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|
||  arn:aws:sagemaker:us-west-2:xxxxxxxxxxxx:cluster/zsmyi57puczf |         ml-cluster        |   InService     |  2025-03-03T16:45:05.320000+00:00  ||
|+----------------------------------------------------------------+---------------------------+----------------+------------------------------------+|

Alternatively, you may confirm the cluster standing on the SageMaker console. After a short interval, you may observe that the standing for the nodes transitions to Operating.

Create an FSx for Lustre shared file system

For us to deploy the Ray cluster, we want the SageMaker HyperPod cluster to be up and operating, and moreover we want a shared storage quantity (for instance, an Amazon FSx for Lustre file system). This can be a shared file system that the SageMaker HyperPod nodes can entry. This file system might be provisioned statically earlier than launching your SageMaker HyperPod cluster or dynamically afterwards.

Specifying a shared storage location (akin to cloud storage or NFS) is optionally available for single-node clusters, however it’s required for multi-node clusters. Utilizing an area path will increase an error throughout checkpointing for multi-node clusters.

The Amazon FSx for Lustre CSI driver makes use of IAM roles for service accounts (IRSA) to authenticate AWS API calls. To make use of IRSA, an IAM OpenID Join (OIDC) supplier must be related to the OIDC issuer URL that comes provisioned your EKS cluster.

Create an IAM OIDC identification supplier to your cluster with the next command:

eksctl utils associate-iam-oidc-provider --cluster $EKS_CLUSTER_NAME --approve

Deploy the FSx for Lustre CSI driver:

helm repo add aws-fsx-csi-driver https://kubernetes-sigs.github.io/aws-fsx-csi-driver
helm repo replace
helm improve --install aws-fsx-csi-driver aws-fsx-csi-driver/aws-fsx-csi-driver
  --namespace kube-system

This Helm chart features a service account named fsx-csi-controller-sa that will get deployed within the kube-system namespace.

Use the eksctl CLI to create an AWS Id and Entry Administration (IAM) function certain to the service account utilized by the motive force, attaching the AmazonFSxFullAccess AWS managed coverage:

eksctl create iamserviceaccount 
  --name fsx-csi-controller-sa 
  --override-existing-serviceaccounts 
  --namespace kube-system 
  --cluster $EKS_CLUSTER_NAME 
  --attach-policy-arn arn:aws:iam::aws:coverage/AmazonFSxFullAccess 
  --approve 
  --role-name AmazonEKSFSxLustreCSIDriverFullAccess 
  --region $AWS_REGION

The --override-existing-serviceaccounts flag lets eksctl know that the fsx-csi-controller-sa service account already exists on the EKS cluster, so it skips creating a brand new one and updates the metadata of the present service account as a substitute.

Annotate the motive force’s service account with the Amazon Useful resource Identify (ARN) of the AmazonEKSFSxLustreCSIDriverFullAccess IAM function that was created:

SA_ROLE_ARN=$(aws iam get-role --role-name AmazonEKSFSxLustreCSIDriverFullAccess --query 'Function.Arn' --output textual content)

kubectl annotate serviceaccount -n kube-system fsx-csi-controller-sa 
  eks.amazonaws.com/role-arn=${SA_ROLE_ARN} --overwrite=true

This annotation lets the motive force know what IAM function it ought to use to work together with the FSx for Lustre service in your behalf.

Confirm that the service account has been correctly annotated:

kubectl get serviceaccount -n kube-system fsx-csi-controller-sa -o yaml

Restart the fsx-csi-controller deployment for the modifications to take impact:

kubectl rollout restart deployment fsx-csi-controller -n kube-system

The FSx for Lustre CSI driver presents you with two choices for provisioning a file system:

Dynamic provisioning – This selection makes use of Persistent Quantity Claims (PVCs) in Kubernetes. You outline a PVC with desired storage specs. The CSI driver robotically provisions the FSx for Lustre file system for you primarily based on the PVC request. This enables for easy scaling and eliminates the necessity to manually create file methods.
Static provisioning – On this methodology, you manually create the FSx for Lustre file system earlier than utilizing the CSI driver. You will want to configure particulars like subnet ID and safety teams for the file system. Then, you need to use the motive force to mount this pre-created file system inside your container as a quantity.

For this instance, we use dynamic provisioning. Begin by making a storage class that makes use of the fsx.csi.aws.com provisioner:

cat < storageclass.yaml
form: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  title: fsx-sc
provisioner: fsx.csi.aws.com
parameters:
  subnetId: ${SUBNET_ID}
  securityGroupIds: ${SECURITYGROUP_ID}
  deploymentType: PERSISTENT_2
  automaticBackupRetentionDays: "0"
  copyTagsToBackups: "true"
  perUnitStorageThroughput: "250"
  dataCompressionType: "LZ4"
  fileSystemTypeVersion: "2.12"
mountOptions:
  - flock
EOF

kubectl apply -f storageclass.yaml

SUBNET_ID: The subnet ID that the FSx for Lustre filesystem. Needs to be the identical personal subnet that was used for HyperPod creation.
SECURITYGROUP_ID: The safety group IDs that can be connected to the file system. Needs to be the identical Safety Group ID that’s utilized in HyperPod and EKS.

Subsequent, create a PVC that makes use of the fsx-claim storage declare:

cat < pvc.yaml
apiVersion: v1
form: PersistentVolumeClaim
metadata:
  title: fsx-claim
spec:
  accessModes:
  - ReadWriteMany
  storageClassName: fsx-sc
  sources:
    requests:
      storage: 1200Gi
EOF

kubectl apply -f pvc.yaml

This PVC will begin the dynamic provisioning of an FSx for Lustre file system primarily based on the specs offered within the storage class.

Create the Ray cluster

Now that we’ve got each the SageMaker HyperPod cluster and the FSx for Lustre file system created, we are able to arrange the Ray cluster:

Arrange dependencies. We’ll create a brand new namespace in our Kubernetes cluster and set up the KubeRay operator utilizing a Helm chart.

We suggest utilizing KubeRay operator model 1.2.0 or increased, which helps computerized Ray Pod eviction and alternative in case of failures (for instance, {hardware} points on EKS or SageMaker HyperPod nodes).

# Create KubeRay namespace
kubectl create namespace kuberay
# Deploy the KubeRay operator with the Helm chart repository
helm repo add kuberay https://ray-project.github.io/kuberay-helm/
helm repo replace
#Set up each CRDs and Kuberay operator v1.2.0
helm set up kuberay-operator kuberay/kuberay-operator --version 1.2.0 --namespace kuberay
# Kuberay operator pod can be deployed onto head pod
kubectl get pods --namespace kuberay

Create a Ray Container Picture for the Ray Cluster manifest. With the current deprecation of the `rayproject/ray-ml` photos ranging from Ray model 2.31.0, it’s essential to create a {custom} container picture for our Ray cluster. Subsequently, we’ll construct on high of the `rayproject/ray:2.42.1-py310-gpu` picture, which has all mandatory Ray dependencies, and embrace our coaching dependencies to construct our personal {custom} picture. Please be at liberty to change this Dockerfile as you would like.

First, create a Dockerfile that builds upon the bottom Ray GPU picture and contains solely the mandatory dependencies:

cat < Dockerfile
 
FROM rayproject/ray:2.42.1-py310-gpu
# Set up Python dependencies for PyTorch, Ray, Hugging Face, and extra
RUN pip set up --no-cache-dir 
    torch torchvision torchaudio 
    numpy 
    pytorch-lightning 
    transformers datasets consider tqdm click on 
    ray[train] ray[air] 
    ray[train-torch] ray[train-lightning] 
    torchdata 
    torchmetrics 
    torch_optimizer 
    speed up 
    scikit-learn 
    Pillow==9.5.0 
    protobuf==3.20.3
 
RUN pip set up --upgrade datasets transformers
 
# Set the consumer
USER ray
WORKDIR /house/ray
 
# Confirm ray set up
RUN which ray && 
    ray –-version
  
# Default command
CMD [ "/bin/bash" ]
 
EOF

Then, construct and push the picture to your container registry (Amazon ECR) utilizing the offered script:

export AWS_REGION=$(aws configure get area)
export ACCOUNT=$(aws sts get-caller-identity --query Account --output textual content)
export REGISTRY=${ACCOUNT}.dkr.ecr.${AWS_REGION}.amazonaws.com/
 
echo "This course of could take 10-Quarter-hour to finish..."
 
echo "Constructing picture..."
 
docker construct --platform linux/amd64 -t ${REGISTRY}aws-ray-custom:newest .
 
# Create registry if wanted
REGISTRY_COUNT=$(aws ecr describe-repositories | grep "aws-ray-custom" | wc -l)
if [ "$REGISTRY_COUNT" == "0" ]; then
    aws ecr create-repository --repository-name aws-ray-custom
fi
 
# Login to registry
echo "Logging in to $REGISTRY ..."
aws ecr get-login-password --region $AWS_REGION| docker login --username AWS --password-stdin $REGISTRY
 
echo "Pushing picture to $REGISTRY ..."
 
# Push picture to registry
docker picture push ${REGISTRY}aws-ray-custom:newest

Now, our Ray container picture is in Amazon ECR with all mandatory Ray dependencies, in addition to code library dependencies.

Create a Ray cluster manifest. We use a Ray cluster to host our coaching jobs. The Ray cluster is the first useful resource for managing Ray cases on Kubernetes. It represents a cluster of Ray nodes, together with a head node and a number of employee nodes. The Ray cluster CRD determines how the Ray nodes are arrange, how they convey, and the way sources are allotted amongst them. The nodes in a Ray cluster manifest as pods within the EKS or SageMaker HyperPod cluster.

Observe that there are two distinct sections within the cluster manifest. Whereas the `headGroupSpec` defines the top node of the Ray Cluster, the `workerGroupSpecs` outline the employee nodes of the Ray Cluster. Whereas a job may technically run on the Head node as effectively, it is not uncommon to separate the top node from the precise employee nodes the place jobs are executed. Subsequently, the occasion for the top node can usually be a smaller occasion (i.e. we selected a m5.2xlarge). For the reason that head node additionally manages cluster-level metadata, it may be helpful to have it run on a non-GPU node to attenuate the chance of node failure (as GPU is usually a potential supply of node failure).

cat <<'EOF' > raycluster.yaml
apiVersion: ray.io/v1alpha1
form: RayCluster
metadata:
  title: rayml
  labels:
    controller-tools.k8s.io: "1.0"
spec:
  # Ray head pod template
  headGroupSpec:
    # The `rayStartParams` are used to configure the `ray begin` command.
    # See https://github.com/ray-project/kuberay/blob/grasp/docs/steerage/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
    # See https://docs.ray.io/en/newest/cluster/cli.html#ray-start for all out there choices in `rayStartParams`.
    rayStartParams:
      dashboard-host: '0.0.0.0'
    #pod template
    template:
      spec:
        #        nodeSelector:  
        #node.kubernetes.io/instance-type: "ml.m5.2xlarge"
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          fsGroup: 0
        containers:
        - title: ray-head
          picture: ${REGISTRY}aws-ray-custom:newest     ## IMAGE: Right here chances are you'll select which picture your head pod will run
          env:                                ## ENV: Right here is the place you may ship stuff to the top pod
            - title: RAY_GRAFANA_IFRAME_HOST   ## PROMETHEUS AND GRAFANA
              worth: http://localhost:3000
            - title: RAY_GRAFANA_HOST
              worth: http://prometheus-grafana.prometheus-system.svc:80
            - title: RAY_PROMETHEUS_HOST
              worth: http://prometheus-kube-prometheus-prometheus.prometheus-system.svc:9090
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          sources:
            limits:                                    ## LIMITS: Set useful resource limits to your head pod
              cpu: 1
              reminiscence: 8Gi
            requests:                                    ## REQUESTS: Set useful resource requests to your head pod
              cpu: 1
              reminiscence: 8Gi
          ports:
          - containerPort: 6379
            title: gcs-server
          - containerPort: 8265 # Ray dashboard
            title: dashboard
          - containerPort: 10001
            title: consumer
          - containerPort: 8000
            title: serve
          volumeMounts:                                    ## VOLUMEMOUNTS
          - title: fsx-storage
            mountPath: /fsx
          - title: ray-logs
            mountPath: /tmp/ray
        volumes:
          - title: ray-logs
            emptyDir: {}
          - title: fsx-storage
            persistentVolumeClaim:
              claimName: fsx-claim
  workerGroupSpecs:
  # the pod replicas on this group typed employee
  - replicas: 4                                    ## REPLICAS: What number of employee pods you need 
    minReplicas: 1
    maxReplicas: 10
    # logical group title, for this known as small-group, additionally might be useful
    groupName: gpu-group
    rayStartParams:
      num-gpus: "8"
    #pod template
    template:
      spec:
        #nodeSelector:
        # node.kubernetes.io/instance-type: "ml.p5.48xlarge"
        securityContext:
          runAsUser: 0
          runAsGroup: 0
          fsGroup: 0
        containers:
        - title: ray-worker
          picture: ${REGISTRY}aws-ray-custom:newest             ## IMAGE: Right here chances are you'll select which picture your head node will run
          env:
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh","-c","ray stop"]
          sources:
            limits:                                    ## LIMITS: Set useful resource limits to your employee pods
              nvidia.com/gpu: 8
              #vpc.amazonaws.com/efa: 32  
            requests:                                    ## REQUESTS: Set useful resource requests to your employee pods
              nvidia.com/gpu: 8
              #vpc.amazonaws.com/efa: 32
          volumeMounts:                                    ## VOLUMEMOUNTS
          - title: ray-logs
            mountPath: /tmp/ray
          - title: fsx-storage
            mountPath: /fsx
        volumes:
        - title: fsx-storage
          persistentVolumeClaim:
            claimName: fsx-claim
        - title: ray-logs
          emptyDir: {}
EOF

Deploy the Ray cluster:

envsubst < raycluster.yaml | kubectl apply -f -

Optionally, expose the Ray dashboard utilizing port forwarding:

# Will get title of kubectl service that runs the top pod
export SERVICEHEAD=$(kubectl get service | grep head-svc | awk '{print $1}' | head -n 1)
# Port forwards the dashboard from the top pod service
kubectl port-forward --address 0.0.0.0 service/${SERVICEHEAD} 8265:8265 > /dev/null 2>&1 &

Now, you may go to http://localhost:8265/ to go to the Ray Dashboard.

To launch a coaching job, there are a number of choices:
1. Use the Ray jobs submission SDK, the place you may submit jobs to the Ray cluster via the Ray dashboard port (8265 by default) the place Ray listens for job requests. To study extra, see Quickstart utilizing the Ray Jobs CLI.
2. Execute a Ray job within the head pod the place you exec straight into the top pod after which submit your job. To study extra, see RayCluster Quickstart.

For this instance, we use the primary methodology and submit the job via the SDK. Subsequently, we merely run from an area setting the place the coaching code is out there in --working-dir. Relative to this path, we specify the primary coaching Python script positioned at --train.py
Throughout the working-dir folder, we are able to additionally embrace further scripts we would must run the coaching.

The fsdp-ray.py instance is positioned in aws-do-ray/Container-Root/ray/raycluster/jobs/fsdp-ray/fsdp-ray.py within the aws-do-ray GitHub repo.

# Inside jobs/ folder
ray job submit --address http://localhost:8265 --working-dir "fsdp-ray" -- python3 fsdp-ray.py

For our Python coaching script to run, we want to verify our coaching scripts are accurately arrange to make use of Ray. This contains the next steps:

Configure a mannequin to run distributed and on the proper CPU/GPU system
Configure a knowledge loader to shard knowledge throughout the staff and place knowledge on the proper CPU or GPU system
Configure a coaching perform to report metrics and save checkpoints
Configure scaling and CPU or GPU useful resource necessities for a coaching job
Launch a distributed coaching job with a TorchTrainer class

For additional particulars on the way to regulate your present coaching script to get essentially the most out of Ray, consult with the Ray documentation.

The next diagram illustrates the whole structure you’ve got constructed after finishing these steps.

Implement coaching job resiliency with the job auto resume performance

Ray is designed with strong fault tolerance mechanisms to supply resilience in distributed methods the place failures are inevitable. These failures usually fall into two classes: application-level failures, which stem from bugs in consumer code or exterior system points, and system-level failures, attributable to node crashes, community disruptions, or inner bugs in Ray. To handle these challenges, Ray offers instruments and techniques that allow purposes to detect, recuperate, and adapt seamlessly, offering reliability and efficiency in distributed environments. On this part, we take a look at two of the commonest varieties of failures, and the way to implement fault tolerance in them that SageMaker HyperPod compliments: Ray Prepare employee failures and Ray employee node failures.

Ray Prepare employee – This can be a employee course of particularly used for coaching duties inside Ray Prepare, Ray’s distributed coaching library. These staff deal with particular person duties or shards of a distributed coaching job. Every employee is liable for processing a portion of the information, coaching a subset of the mannequin, or performing computation throughout distributed coaching. They’re coordinated by the Ray Prepare orchestration logic to collectively practice a mannequin.
Ray employee node – On the Ray degree, this can be a Ray node in a Ray cluster. It’s a part of the Ray cluster infrastructure and is liable for operating duties, actors, and different processes as orchestrated by the Ray head node. Every employee node can host a number of Ray processes that execute duties or handle distributed objects. On the Kubernetes degree, a Ray employee node is a Kubernetes pod that’s managed by a KubeRay operator. For this put up, we can be speaking concerning the Ray employee nodes on the Kubernetes degree, so we’ll consult with them as pods.

On the time of writing, there are not any official updates relating to head pod fault tolerance and auto resume capabilities. Although head pod failures are uncommon, within the unlikely occasion of such a failure, you will have to manually restart your coaching job. Nonetheless, you may nonetheless resume progress from the final saved checkpoint. To reduce the chance of hardware-related head pod failures, it’s suggested to put the top pod on a devoted, CPU-only SageMaker HyperPod node, as a result of GPU failures are a standard coaching job failure level.

Ray Prepare employee failures

Ray Prepare is designed with fault tolerance to deal with employee failures, akin to RayActorErrors. When a failure happens, the affected staff are stopped, and new ones are robotically began to keep up operations. Nonetheless, for coaching progress to proceed seamlessly after a failure, saving and loading checkpoints is crucial. With out correct checkpointing, the coaching script will restart, however all progress can be misplaced. Checkpointing is due to this fact a important part of Ray Prepare’s fault tolerance mechanism and must be carried out in your code.

Computerized restoration

When a failure is detected, Ray shuts down failed staff and provisions new ones. Though this occurs, we are able to inform the coaching perform to all the time preserve retrying till coaching can proceed. Every occasion of restoration from a employee failure is taken into account a retry. We will set the variety of retries via the max_failures attribute of the FailureConfig, which is about within the RunConfig handed to the Coach (for instance, TorchTrainer). See the next code:

from ray.practice import RunConfig, FailureConfig
# Tries to recuperate a run as much as this many occasions.
run_config = RunConfig(failure_config=FailureConfig(max_failures=2))
# No restrict on the variety of retries.
run_config = RunConfig(failure_config=FailureConfig(max_failures=-1))

For extra info, see Dealing with Failures and Node Preemption.

Checkpoints

A checkpoint in Ray Prepare is a light-weight interface representing a listing saved both domestically or remotely. For instance, a cloud-based checkpoint would possibly level to s3://my-bucket/checkpoint-dir, and an area checkpoint would possibly level to /tmp/checkpoint-dir. To study extra, see Saving checkpoints throughout coaching.

To save lots of a checkpoint within the coaching loop, you first want to jot down your checkpoint to an area listing, which might be short-term. When saving, you need to use checkpoint utilities from different frameworks like torch.save, pl.Coach.save_checkpoint, accelerator.save_model, save_pretrained, tf.keras.Mannequin.save, and extra. Then you definately create a checkpoint from the listing utilizing Checkpoint.from_directory. Lastly, report the checkpoint to Ray Prepare utilizing ray.practice.report(metrics, checkpoint=...). The metrics reported alongside the checkpoint are used to maintain observe of the best-performing checkpoints. Reporting will add the checkpoint to persistent storage.

When you save checkpoints with ray.practice.report(..., checkpoint=...) and run on a multi-node cluster, Ray Prepare will increase an error if NFS or cloud storage is just not arrange. It is because Ray Prepare expects all staff to have the ability to write the checkpoint to the identical persistent storage location.

Lastly, clear up the native short-term listing to unencumber disk house (for instance, by exiting the tempfile.TemporaryDirectory context). We will save a checkpoint each epoch or each few iterations.

The next diagram illustrates this setup.

The next code is an instance of saving checkpoints utilizing native PyTorch:

import os
import tempfile

import numpy as np
import torch
import torch.nn as nn
from torch.optim import Adam

import ray.practice.torch
from ray import practice
from ray.practice import Checkpoint, ScalingConfig
from ray.practice.torch import TorchTrainer


def train_func(config):
    n = 100
    # create a toy dataset
    # knowledge   : X - dim = (n, 4)
    # goal : Y - dim = (n, 1)
    X = torch.Tensor(np.random.regular(0, 1, measurement=(n, 4)))
    Y = torch.Tensor(np.random.uniform(0, 1, measurement=(n, 1)))
    # toy neural community : 1-layer
    # Wrap the mannequin in DDP
    mannequin = ray.practice.torch.prepare_model(nn.Linear(4, 1))
    criterion = nn.MSELoss()

    optimizer = Adam(mannequin.parameters(), lr=3e-4)
    for epoch in vary(config["num_epochs"]):
        y = mannequin.ahead(X)
        loss = criterion(y, Y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        metrics = {"loss": loss.merchandise()}

        with tempfile.TemporaryDirectory() as temp_checkpoint_dir:
            checkpoint = None

            should_checkpoint = epoch % config.get("checkpoint_freq", 1) == 0
            # In customary DDP coaching, the place the mannequin is similar throughout all ranks,
            # solely the worldwide rank 0 employee wants to avoid wasting and report the checkpoint
            if practice.get_context().get_world_rank() == 0 and should_checkpoint:
                torch.save(
                    mannequin.module.state_dict(),  # NOTE: Unwrap the mannequin.
                    os.path.be a part of(temp_checkpoint_dir, "mannequin.pt"),
                )
                checkpoint = Checkpoint.from_directory(temp_checkpoint_dir)

            practice.report(metrics, checkpoint=checkpoint)


coach = TorchTrainer(
    train_func,
    train_loop_config={"num_epochs": 5},
    scaling_config=ScalingConfig(num_workers=2),
)
outcome = coach.match()

Ray Prepare additionally comes with CheckpointConfig, a approach to configure checkpointing choices:

from ray.practice import RunConfig, CheckpointConfig
# Instance 1: Solely preserve the two *most up-to-date* checkpoints and delete the others.
run_config = RunConfig(checkpoint_config=CheckpointConfig(num_to_keep=2))
# Instance 2: Solely preserve the two *greatest* checkpoints and delete the others.
run_config = RunConfig(
    checkpoint_config=CheckpointConfig(
        num_to_keep=2,
        # *Finest* checkpoints are decided by these params:
        checkpoint_score_attribute="mean_accuracy",
        checkpoint_score_order="max",
    ),
    # This may retailer checkpoints on S3.
    storage_path="s3://remote-bucket/location",
)

To restore coaching state from a checkpoint in case your coaching job have been to fail and retry, you must modify your coaching loop to auto resume after which restore a Ray Prepare job. By pointing to the trail of your saved checkpoints, you may restore your coach and proceed coaching. Right here’s a fast instance:

from ray.practice.torch import TorchTrainer

restored_trainer = TorchTrainer.restore(
    path="~/ray_results/dl_trainer_restore",  # Can be a cloud storage path like S3
    datasets=get_datasets(),
)
outcome = restored_trainer.match()

To streamline restoration, you may add auto resume logic to your script. This checks if a legitimate experiment listing exists and restores the coach if out there. If not, it begins a brand new experiment:

experiment_path = "~/ray_results/dl_restore_autoresume"
if TorchTrainer.can_restore(experiment_path):
    coach = TorchTrainer.restore(experiment_path, datasets=get_datasets())
else:
    coach = TorchTrainer(
        train_loop_per_worker=train_loop_per_worker,
        datasets=get_datasets(),
        scaling_config=practice.ScalingConfig(num_workers=2),
        run_config=practice.RunConfig(
            storage_path="~/ray_results",
            title="dl_restore_autoresume",
        ),
    )
outcome = coach.match()

To summarize, to supply fault tolerance and auto resume when utilizing Ray Prepare libraries, set your max_failures parameter within the FailureConfig (we suggest setting it to -1 to verify it’ll preserve retrying till the SageMaker HyperPod node is rebooted or changed), and be sure to have enabled checkpointing in your code.

Ray employee pod failures

Along with the aforementioned mechanisms to recuperate from Ray Prepare employee failures, Ray additionally offers fault tolerance on the employee pod degree. When a employee pod fails (this contains eventualities wherein the raylet course of fails), the operating duties and actors on it’ll fail and the objects owned by employee processes of this pod can be misplaced. On this case, the duties, actors, and objects fault tolerance mechanisms will begin and attempt to recuperate the failures utilizing different employee pods.

These mechanisms can be implicitly dealt with by the Ray Prepare library. To study extra concerning the underlying fault tolerance on the duties, actors, objects (carried out on the Ray Core degree), see Fault Tolerance.

In apply, which means in case of a employee pod failure, the next happens:

If there’s a free employee pod within the Ray cluster, Ray will recuperate the failed employee pod by changing it with the free employee pod.
If there isn’t any free employee pod, however within the underlying SageMaker HyperPod cluster there are free SageMaker HyperPod nodes, Ray will schedule a brand new employee pod onto one of many free SageMaker HyperPod nodes. This pod will be a part of the operating Ray cluster and the failure can be recovered utilizing this new employee pod.

Within the context of KubeRay, Ray employee nodes are represented by Kubernetes pods, and failures at this degree can embrace points akin to pod eviction or preemption attributable to software-level components.

Nonetheless, one other important state of affairs to think about is {hardware} failures. If the underlying SageMaker HyperPod node turns into unavailable as a consequence of a {hardware} concern, akin to a GPU error, it will inevitably trigger the Ray employee pod operating on that node to fail as effectively. Now the fault tolerance and auto-healing mechanisms of your SageMaker HyperPod cluster begin and can reboot or exchange the defective node. After the brand new wholesome node is added into the SageMaker HyperPod cluster, Ray will schedule a brand new employee pod onto the SageMaker HyperPod node and recuperate the interrupted coaching. On this case, each the Ray fault tolerance mechanism and the SageMaker HyperPod resiliency options work collectively seamlessly and be sure that even in case of a {hardware} failure, your ML coaching workload can auto resume and choose up from the place it was interrupted.

As you’ve got seen, there are numerous built-in resiliency and fault-tolerance mechanisms that enable your Ray Prepare workload on SageMaker HyperPod to recuperate and auto resume. As a result of these mechanisms will primarily recuperate by restarting the coaching job, it’s essential that checkpointing is carried out within the coaching script. It’s also usually suggested to avoid wasting the checkpoints on a shared and chronic path, akin to an Amazon Easy Storage Service (Amazon S3) bucket or FSx for Lustre file system.

Clear up

To delete your SageMaker HyperPod cluster created on this put up, you may both use the SageMaker AI console or use the next AWS CLI command:

aws sagemaker delete-cluster --cluster-name

Cluster deletion will take a couple of minutes. You may verify profitable deletion after you see no clusters on the SageMaker AI console.

When you used the CloudFormation stack to create sources, you may delete it utilizing the next command:

aws cloudformation delete-stack --stack-name

Conclusion

This put up demonstrated the way to arrange and deploy Ray clusters on SageMaker HyperPod, highlighting key concerns akin to storage configuration and fault tolerance and auto resume mechanisms.

Operating Ray jobs on SageMaker HyperPod affords a robust answer for distributed AI/ML workloads, combining the pliability of Ray with the strong infrastructure of SageMaker HyperPod. This integration offers enhanced resiliency and auto resume capabilities, that are essential for long-running and resource-intensive duties. Through the use of Ray’s distributed computing framework and the built-in options of SageMaker HyperPod, you may effectively handle advanced ML workflows, particularly coaching workloads as coated on this put up. As AI/ML workloads proceed to develop in scale and complexity, the mixture of Ray and SageMaker HyperPod affords a scalable, resilient, and environment friendly platform for tackling essentially the most demanding computational challenges in machine studying.

To get began with SageMaker HyperPod, consult with the Amazon EKS Help in Amazon SageMaker HyperPod workshop and the Amazon SageMaker HyperPod Developer Information. To study extra concerning the aws-do-ray framework, consult with the GitHub repo.

Concerning the Authors

Mark Vinciguerra is an Affiliate Specialist Options Architect at Amazon Net Companies (AWS) primarily based in New York. He focuses on the Automotive and Manufacturing sector, specializing in serving to organizations architect, optimize, and scale synthetic intelligence and machine studying options, with specific experience in autonomous automobile applied sciences. Previous to AWS, he went to Boston College and graduated with a level in Laptop Engineering.

Florian Stahl is a Worldwide Specialist Options Architect at AWS, primarily based in Hamburg, Germany. He makes a speciality of Synthetic Intelligence, Machine Studying, and Generative AI options, serving to clients optimize and scale their AI/ML workloads on AWS. With a background as a Information Scientist, Florian focuses on working with clients within the Autonomous Automobile house, bringing deep technical experience to assist organizations design and implement subtle machine studying options. He works intently with clients worldwide to rework their AI initiatives and maximize the worth of their machine studying investments on AWS.

Anoop Saha is a Sr GTM Specialist at Amazon Net Companies (AWS) specializing in Gen AI mannequin coaching and inference. He’s partnering with high basis mannequin builders, strategic clients, and AWS service groups to allow distributed coaching and inference at scale on AWS and lead joint GTM motions. Earlier than AWS, Anoop has held a number of management roles at startups and enormous firms, primarily specializing in silicon and system structure of AI infrastructure.

Alex Iankoulski is a Principal Options Architect, ML/AI Frameworks, who focuses on serving to clients orchestrate their AI workloads utilizing containers and accelerated computing infrastructure on AWS. He’s additionally the creator of the open supply do framework and a Docker captain who loves making use of container applied sciences to speed up the tempo of innovation whereas fixing the world’s largest challenges.

Ray jobs on Amazon SageMaker HyperPod: scalable and resilient distributed AI

The Artwork of Noise | In the direction of Knowledge Science

Agentic GraphRAG for Industrial Contracts

Agentic GraphRAG for Industrial Contracts

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

Proton launches ‘Privacy-First’ AI Email Assistant to Compete with Google and Microsoft

From Scratch to Deep Quantile Forecasting | by Jinhang Jiang | Jul, 2024

About Us

Category

Recent Posts