Node drawback detection and restoration for AWS Neuron nodes inside Amazon EKS clusters

Implementing {hardware} resiliency in your coaching infrastructure is essential to mitigating dangers and enabling uninterrupted mannequin coaching. By implementing options equivalent to proactive well being monitoring and automatic restoration mechanisms, organizations can create a fault-tolerant surroundings able to dealing with {hardware} failures or different points with out compromising the integrity of the coaching course of.

Within the put up, we introduce the AWS Neuron node drawback detector and restoration DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). This part can shortly detect uncommon occurrences of points when Neuron gadgets fail by tailing monitoring logs. It marks the employee nodes in a faulty Neuron gadget as unhealthy, and promptly replaces them with new employee nodes. By accelerating the velocity of difficulty detection and remediation, it will increase the reliability of your ML coaching and reduces the wasted time and price attributable to {hardware} failure.

This answer is relevant when you’re utilizing managed nodes or self-managed node teams (which use Amazon EC2 Auto Scaling teams) on Amazon EKS. On the time of penning this put up, computerized restoration of nodes provisioned by Karpenter will not be but supported.

Resolution overview

The answer is predicated on the node drawback detector and restoration DaemonSet, a strong device designed to routinely detect and report varied node-level issues in a Kubernetes cluster.

The node drawback detector part will constantly monitor the kernel message (kmsg) logs on the employee nodes. If it detects error messages particularly associated to the Neuron gadget (which is the Trainium or AWS Inferentia chip), it should change NodeCondition to NeuronHasError on the Kubernetes API server.

The node restoration agent is a separate part that periodically checks the Prometheus metrics uncovered by the node drawback detector. When it finds a node situation indicating a difficulty with the Neuron gadget, it should take automated actions. First, it should mark the affected occasion within the related Auto Scaling group as unhealthy, which is able to invoke the Auto Scaling group to cease the occasion and launch a alternative. Moreover, the node restoration agent will publish Amazon CloudWatch metrics for customers to watch and alert on these occasions.

The next diagram illustrates the answer structure and workflow.

Within the following walkthrough, we create an EKS cluster with Trn1 employee nodes, deploy the Neuron plugin for the node drawback detector, and inject an error message into the node. We then observe the failing node being stopped and changed with a brand new one, and discover a metric in CloudWatch indicating the error.

Stipulations

Earlier than you begin, be sure to have put in the next instruments in your machine:

Deploy the node drawback detection and restoration plugin

Full the next steps to configure the node drawback detection and restoration plugin:

Create an EKS cluster utilizing the information on an EKS Terraform module:

git clone https://github.com/awslabs/data-on-eks.git

export TF_VAR_region=us-east-2
export TF_VAR_trn1_32xl_desired_size=4
export TF_VAR_trn1_32xl_min_size=4
cd data-on-eks/ai-ml/trainium-inferentia/ && chmod +x set up.sh
./set up.sh

aws eks --region us-east-2 describe-cluster --name trainium-inferentia

# Creates k8s config file to authenticate with EKS
aws eks --region us-east-2 update-kubeconfig --name trainium-inferentia

kubectl get nodes
NAME STATUS ROLES AGE VERSION
ip-100-64-161-213.us-east-2.compute.inner Prepared 31d v1.29.0-eks-5e0fdde
ip-100-64-227-31.us-east-2.compute.inner Prepared 31d v1.29.0-eks-5e0fdde
ip-100-64-70-179.us-east-2.compute.inner Prepared 31d v1.29.0-eks-5e0fdde

Set up the required AWS Identification and Entry Administration (IAM) position for the service account and the node drawback detector plugin.
Create a coverage as proven under. Replace the Useful resource key worth to match your node group ARN that accommodates the Trainium and AWS Inferentia nodes, and replace the ec2:ResourceTag/aws:autoscaling:groupName key worth to match the Auto Scaling group title.

You will get these values from the Amazon EKS console. Select Clusters within the navigation pane, open the trainium-inferentia cluster, select Node teams, and find your node group.

# To create the coverage, aws cli can be utilized as proven under the place npd-policy-trimmed.json is the coverage json constructed from the template above.

# Create npd-policy-trimmed.json
cat << EOF > npd-policy-trimmed.json
{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Action": [
                "autoscaling:SetInstanceHealth",
                "autoscaling:DescribeAutoScalingInstances"
            ],
            "Impact": "Enable",
            "Useful resource": 
        },
        {
            "Motion": [
                "ec2:DescribeInstances"
            ],
            "Impact": "Enable",
            "Useful resource": "*",
            "Situation": {
                "ForAllValues:StringEquals": {
                    "ec2:ResourceTag/aws:autoscaling:groupName": 
                }
            }
        },
        {
            "Motion": [
                "cloudwatch:PutMetricData"
            ],
            "Impact": "Enable",
            "Useful resource": "*",
            "Situation": {
                "StringEquals": {
                    "cloudwatch:Namespace": "NeuronHealthCheck"
                }
            }
        }
    ]
}
EOF

This part might be put in as a DaemonSet in your EKS cluster.

# To create the coverage, aws cli can be utilized as proven under the place npd-policy-trimmed.json is the coverage json constructed from the template above.

aws iam create-policy  
--policy-name NeuronProblemDetectorPolicy 
--policy-document file://npd-policy-trimmed.json

# Word the ARN

CLUSTER_NAME=trainium-inferentia # Your EKS Cluster Identify 
AWS_REGION=us-east-2
ACCOUNT_ID=$(aws sts get-caller-identity --query 'Account' --output textual content)
POLICY_ARN=arn:aws:iam::$ACCOUNT_ID:coverage/NeuronProblemDetectorPolicy

eksctl create addon --cluster $CLUSTER_NAME --name eks-pod-identity-agent 
  --region $AWS_REGION

eksctl create podidentityassociation 
    --cluster $CLUSTER_NAME 
    --namespace neuron-healthcheck-system 
    --service-account-name node-problem-detector 
    --permission-policy-arns="$POLICY_ARN" 
    --region $AWS_REGION
    
# Set up the Neuron NPD and restoration plugin 

kubectl create ns neuron-healthcheck-system
curl https://uncooked.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery.yml | kubectl apply -f - 
curl https://uncooked.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-rbac.yml | kubectl apply -f - 
curl https://uncooked.githubusercontent.com/aws-neuron/aws-neuron-sdk/215b421ac448d85f89be056e27e29842a6b03c9c/src/k8/neuron-problem-detector/k8s-neuron-problem-detector-and-recovery-config.yml | kubectl apply -f -

# Anticipated outcome (with 4 Neuron nodes in cluster):

kubectl get pod -n neuron-healthcheck-system
NAME READY STATUS RESTARTS AGE
node-problem-detector-49p6w 2/2 Operating 0 31s
node-problem-detector-j7wct 2/2 Operating 0 31s
node-problem-detector-qr6jm 2/2 Operating 0 31s
node-problem-detector-vwq8x 2/2 Operating 0 31s

The container photos within the Kubernetes manifests are saved in public repository equivalent to registry.k8s.io and public.ecr.aws. For manufacturing environments, it’s really helpful that clients restrict exterior dependencies that impression these areas and host container photos in a personal registry and sync from photos public repositories. For detailed implementation, please discuss with the weblog put up: Asserting pull by way of cache for registry.k8s.io in Amazon Elastic Container Registry.

By default, the node drawback detector is not going to take any actions on failed node. If you want the EC2 occasion to be terminated routinely by the agent, replace the DaemonSet as follows:

kubectl edit -n neuron-healthcheck-system ds/node-problem-detector

...
   env:
   - title: ENABLE_RECOVERY
     worth: "true"

Check the node drawback detector and restoration answer

After the plugin is put in, you’ll be able to see Neuron situations present up by working kubectl describe node. We simulate a tool error by injecting error logs within the occasion:

# Confirm node situations on any node. Neuron situations ought to present up.

kubectl describe node ip-100-64-58-151.us-east-2.compute.inner | grep Situations: -A7

Situations:
  Sort             Standing  LastHeartbeatTime                 LastTransitionTime                Cause                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  NeuronHealth     False   Fri, 29 Mar 2024 15:52:08 +0800   Thu, 28 Mar 2024 13:59:19 +0800   NeuronHasNoError             Neuron has no error
  MemoryPressure   False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientMemory   kubelet has adequate reminiscence accessible
  DiskPressure     False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasNoDiskPressure     kubelet has no disk stress
  PIDPressure      False   Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:58:39 +0800   KubeletHasSufficientPID      kubelet has adequate PID accessible
  Prepared            True    Fri, 29 Mar 2024 15:51:03 +0800   Thu, 28 Mar 2024 13:59:08 +0800   KubeletReady                 kubelet is posting prepared standing
# To get supplier id
kubectl describe node ip-100-64-58-151.us-east-2.compute.inner | grep -i supplier | sed -E 's/.*/([^/]+)$/1/'

i-0381404aa69eae3f6

# SSH into to the employee node and simulate the {hardware} error on the neuron gadget
aws ssm start-session --target i-0381404aa69eae3f6 --region us-east-2

Beginning session with SessionId: lindarr-0069460593240662a

sh-4.2$
sh-4.2$ sudo bash
[root@ip-192-168-93-211 bin]# echo "take a look at NEURON_HW_ERR=DMA_ERROR take a look at" >> /dev/kmsg

Round 2 minutes later, you’ll be able to see that the error has been recognized:

kubectl describe node ip-100-64-58-151.us-east-2.compute.inner | grep 'Situations:' -A7

Situations:
  Sort             Standing  LastHeartbeatTime                 LastTransitionTime                Cause                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  NeuronHealth     True    Fri, 29 Mar 2024 17:42:43 +0800   Fri, 29 Mar 2024 17:42:38 +0800   NeuronHasError_DMA_ERROR     take a look at NEURON_HW_ERR=DMA_ERROR take a look at

...

Occasions:
  Sort     Cause                    Age   From            Message
  ----     ------                    ----  ----            -------
  Warning  NeuronHasError_DMA_ERROR  36s   kernel-monitor  Node situation NeuronHealth is now: True, purpose: NeuronHasError_DMA_ERROR, message: "take a look at NEURON_HW_ERR=DMA_ERROR take a look at"

Now that the error has been detected by the node drawback detector, and the restoration agent has routinely taken the motion to set the node as unhealthy, Amazon EKS will cordon the node and evict the pods on the node:

# Confirm the Node scheduling is disabled.
kubectl get node 
NAME                                           STATUS                        ROLES    AGE    VERSION
ip-100-64-1-48.us-east-2.compute.inner      Prepared                            156m   v1.29.0-eks-5e0fdde
ip-100-64-103-26.us-east-2.compute.inner    Prepared                            94s    v1.29.0-eks-5e0fdde
ip-100-64-239-245.us-east-2.compute.inner   Prepared                            154m   v1.29.0-eks-5e0fdde
ip-100-64-52-40.us-east-2.compute.inner     Prepared                            156m   v1.29.0-eks-5e0fdde
ip-100-64-58-151.us-east-2.compute.inner    NotReady,SchedulingDisabled      27h    v1.29.0-eks-5e0fdde

You’ll be able to open the CloudWatch console and confirm the metric for NeuronHealthCheck. You’ll be able to see the CloudWatch NeuronHasError_DMA_ERROR metric has the worth 1.

After alternative, you’ll be able to see a brand new employee node has been created:

# The brand new node with age 28s is the brand new node

kubectl get node 
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-65-77.us-east-2.compute.inner    Prepared       28s   v1.29.0-eks-5e0fddev1.28.5-eks-5e0fdde
ip-192-168-81-176.us-east-2.compute.inner   Prepared       9d    v1.29.5-eks-5e0fdde
ip-192-168-91-218.us-east-2.compute.inner   Prepared       9d    v1.29.0-eks-5e0fdde
ip-192-168-94-83.us-east-2.compute.inner    Prepared       9d    v1.29.0-eks-5e0fdde

Let’s have a look at a real-world situation, by which you’re working a distributed coaching job, utilizing an MPI operator as outlined in Llama-2 on Trainium, and there’s an irrecoverable Neuron error in one of many nodes. Earlier than the plugin is deployed, the coaching job will grow to be caught, leading to wasted time and computational prices. With the plugin deployed, the node drawback detector will proactively take away the issue node from the cluster. Within the coaching scripts, it saves checkpoints periodically in order that the coaching will resume from the earlier checkpoint.

The next screenshot reveals instance logs from a distributed coaching job.

The coaching has been began. (You’ll be able to ignore loss=nan for now; it’s a identified difficulty and might be eliminated. For rapid use, discuss with the reduced_train_loss metric.)

The next screenshot reveals the checkpoint created at step 77.

Coaching stopped after one of many nodes has an issue at step 86. The error was injected manually for testing.

After the defective node was detected and changed by the Neuron plugin for node drawback and restoration, the coaching course of resumed at step 77, which was the final checkpoint.

Though Auto Scaling teams will cease unhealthy nodes, they could encounter points stopping the launch of alternative nodes. In such instances, coaching jobs will stall and require handbook intervention. Nonetheless, the stopped node is not going to incur additional fees on the related EC2 occasion.

If you wish to take customized actions along with stopping cases, you’ll be able to create CloudWatch alarms watching the metrics NeuronHasError_DMA_ERROR,NeuronHasError_HANG_ON_COLLECTIVES, NeuronHasError_HBM_UNCORRECTABLE_ERROR, NeuronHasError_SRAM_UNCORRECTABLE_ERROR, and NeuronHasError_NC_UNCORRECTABLE_ERROR, and use a CloudWatch Metrics Insights question like SELECT AVG(NeuronHasError_DMA_ERROR) FROM NeuronHealthCheck to sum up these values to guage the alarms. The next screenshots present an instance.

Clear up

To scrub up all of the provisioned sources for this put up, run the cleanup script:

# neuron-problem-detector-role-$CLUSTER_NAME
eksctl delete podidentityassociation 
--service-account-name node-problem-detector 
--namespace neuron-healthcheck-system 
--cluster $CLUSTER_NAME 
--region $AWS_REGION

# delete the EKS Cluster
cd data-on-eks/ai-ml/trainium-inferentia
./cleanup.sh

Conclusion

On this put up, we confirmed how the Neuron drawback detector and restoration DaemonSet for Amazon EKS works for EC2 cases powered by Trainium and AWS Inferentia. Should you’re working Neuron primarily based EC2 cases and utilizing managed nodes or self-managed node teams, you’ll be able to deploy the detector and restoration DaemonSet in your EKS cluster and profit from improved reliability and fault tolerance of your machine studying coaching workloads within the occasion of node failure.

In regards to the authors

Harish Rao is a senior options architect at AWS, specializing in large-scale distributed AI coaching and inference. He empowers clients to harness the ability of AI to drive innovation and remedy advanced challenges. Exterior of labor, Harish embraces an lively life-style, having fun with the tranquility of climbing, the depth of racquetball, and the psychological readability of mindfulness practices.

Ziwen Ning is a software program growth engineer at AWS. He presently focuses on enhancing the AI/ML expertise by way of the combination of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys difficult himself with badminton, swimming and different varied sports activities, and immersing himself in music.

Geeta Gharpure is a senior software program developer on the Annapurna ML engineering staff. She is targeted on working giant scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to Audible in her free time.

Darren Lin is a Cloud Native Specialist Options Architect at AWS who focuses on domains equivalent to Linux, Kubernetes, Container, Observability, and Open Supply Applied sciences. In his spare time, he likes to work out and have enjoyable together with his household.

Node drawback detection and restoration for AWS Neuron nodes inside Amazon EKS clusters

How To Write As If You Purpose To Trick Others Into Believing That You Are Generative AI

IBM Analysis Advances Explainable AI with New Instruments and Visualizations

IBM Analysis Advances Explainable AI with New Instruments and Visualizations

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts