Implementing {hardware} resiliency in your coaching infrastructure is essential to mitigating dangers and enabling uninterrupted mannequin coaching. By implementing options equivalent to proactive well being monitoring and automatic restoration mechanisms, organizations can create a fault-tolerant surroundings able to dealing with {hardware} failures or different points with out compromising the integrity of the coaching course of.
Within the put up, we introduce the AWS Neuron node drawback detector and restoration DaemonSet for AWS Trainium and AWS Inferentia on Amazon Elastic Kubernetes Service (Amazon EKS). This part can shortly detect uncommon occurrences of points when Neuron gadgets fail by tailing monitoring logs. It marks the employee nodes in a faulty Neuron gadget as unhealthy, and promptly replaces them with new employee nodes. By accelerating the velocity of difficulty detection and remediation, it will increase the reliability of your ML coaching and reduces the wasted time and price attributable to {hardware} failure.
This answer is relevant when you’re utilizing managed nodes or self-managed node teams (which use Amazon EC2 Auto Scaling teams) on Amazon EKS. On the time of penning this put up, computerized restoration of nodes provisioned by Karpenter will not be but supported.
Resolution overview
The answer is predicated on the node drawback detector and restoration DaemonSet, a strong device designed to routinely detect and report varied node-level issues in a Kubernetes cluster.
The node drawback detector part will constantly monitor the kernel message (kmsg
) logs on the employee nodes. If it detects error messages particularly associated to the Neuron gadget (which is the Trainium or AWS Inferentia chip), it should change NodeCondition
to NeuronHasError
on the Kubernetes API server.
The node restoration agent is a separate part that periodically checks the Prometheus metrics uncovered by the node drawback detector. When it finds a node situation indicating a difficulty with the Neuron gadget, it should take automated actions. First, it should mark the affected occasion within the related Auto Scaling group as unhealthy, which is able to invoke the Auto Scaling group to cease the occasion and launch a alternative. Moreover, the node restoration agent will publish Amazon CloudWatch metrics for customers to watch and alert on these occasions.
The next diagram illustrates the answer structure and workflow.
Within the following walkthrough, we create an EKS cluster with Trn1 employee nodes, deploy the Neuron plugin for the node drawback detector, and inject an error message into the node. We then observe the failing node being stopped and changed with a brand new one, and discover a metric in CloudWatch indicating the error.
Stipulations
Earlier than you begin, be sure to have put in the next instruments in your machine:
Deploy the node drawback detection and restoration plugin
Full the next steps to configure the node drawback detection and restoration plugin:
- Create an EKS cluster utilizing the information on an EKS Terraform module:
- Set up the required AWS Identification and Entry Administration (IAM) position for the service account and the node drawback detector plugin.
- Create a coverage as proven under. Replace the
Useful resource
key worth to match your node group ARN that accommodates the Trainium and AWS Inferentia nodes, and replace theec2:ResourceTag/aws:autoscaling:groupName
key worth to match the Auto Scaling group title.
You will get these values from the Amazon EKS console. Select Clusters within the navigation pane, open the trainium-inferentia
cluster, select Node teams, and find your node group.
This part might be put in as a DaemonSet in your EKS cluster.
The container photos within the Kubernetes manifests are saved in public repository equivalent to registry.k8s.io
and public.ecr.aws
. For manufacturing environments, it’s really helpful that clients restrict exterior dependencies that impression these areas and host container photos in a personal registry and sync from photos public repositories. For detailed implementation, please discuss with the weblog put up: Asserting pull by way of cache for registry.k8s.io in Amazon Elastic Container Registry.
By default, the node drawback detector is not going to take any actions on failed node. If you want the EC2 occasion to be terminated routinely by the agent, replace the DaemonSet as follows:
kubectl edit -n neuron-healthcheck-system ds/node-problem-detector
...
env:
- title: ENABLE_RECOVERY
worth: "true"
Check the node drawback detector and restoration answer
After the plugin is put in, you’ll be able to see Neuron situations present up by working kubectl describe node
. We simulate a tool error by injecting error logs within the occasion:
Round 2 minutes later, you’ll be able to see that the error has been recognized:
kubectl describe node ip-100-64-58-151.us-east-2.compute.inner | grep 'Situations:' -A7
Now that the error has been detected by the node drawback detector, and the restoration agent has routinely taken the motion to set the node as unhealthy, Amazon EKS will cordon the node and evict the pods on the node:
You’ll be able to open the CloudWatch console and confirm the metric for NeuronHealthCheck
. You’ll be able to see the CloudWatch NeuronHasError_DMA_ERROR
metric has the worth 1
.
After alternative, you’ll be able to see a brand new employee node has been created:
Let’s have a look at a real-world situation, by which you’re working a distributed coaching job, utilizing an MPI operator as outlined in Llama-2 on Trainium, and there’s an irrecoverable Neuron error in one of many nodes. Earlier than the plugin is deployed, the coaching job will grow to be caught, leading to wasted time and computational prices. With the plugin deployed, the node drawback detector will proactively take away the issue node from the cluster. Within the coaching scripts, it saves checkpoints periodically in order that the coaching will resume from the earlier checkpoint.
The next screenshot reveals instance logs from a distributed coaching job.
The coaching has been began. (You’ll be able to ignore loss=nan for now; it’s a identified difficulty and might be eliminated. For rapid use, discuss with the reduced_train_loss metric.)
The next screenshot reveals the checkpoint created at step 77.
Coaching stopped after one of many nodes has an issue at step 86. The error was injected manually for testing.
After the defective node was detected and changed by the Neuron plugin for node drawback and restoration, the coaching course of resumed at step 77, which was the final checkpoint.
Though Auto Scaling teams will cease unhealthy nodes, they could encounter points stopping the launch of alternative nodes. In such instances, coaching jobs will stall and require handbook intervention. Nonetheless, the stopped node is not going to incur additional fees on the related EC2 occasion.
If you wish to take customized actions along with stopping cases, you’ll be able to create CloudWatch alarms watching the metrics NeuronHasError_DMA_ERROR
,NeuronHasError_HANG_ON_COLLECTIVES
, NeuronHasError_HBM_UNCORRECTABLE_ERROR
, NeuronHasError_SRAM_UNCORRECTABLE_ERROR
, and NeuronHasError_NC_UNCORRECTABLE_ERROR
, and use a CloudWatch Metrics Insights question like SELECT AVG(NeuronHasError_DMA_ERROR) FROM NeuronHealthCheck
to sum up these values to guage the alarms. The next screenshots present an instance.
Clear up
To scrub up all of the provisioned sources for this put up, run the cleanup script:
Conclusion
On this put up, we confirmed how the Neuron drawback detector and restoration DaemonSet for Amazon EKS works for EC2 cases powered by Trainium and AWS Inferentia. Should you’re working Neuron primarily based EC2 cases and utilizing managed nodes or self-managed node teams, you’ll be able to deploy the detector and restoration DaemonSet in your EKS cluster and profit from improved reliability and fault tolerance of your machine studying coaching workloads within the occasion of node failure.
In regards to the authors
Harish Rao is a senior options architect at AWS, specializing in large-scale distributed AI coaching and inference. He empowers clients to harness the ability of AI to drive innovation and remedy advanced challenges. Exterior of labor, Harish embraces an lively life-style, having fun with the tranquility of climbing, the depth of racquetball, and the psychological readability of mindfulness practices.
Ziwen Ning is a software program growth engineer at AWS. He presently focuses on enhancing the AI/ML expertise by way of the combination of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys difficult himself with badminton, swimming and different varied sports activities, and immersing himself in music.
Geeta Gharpure is a senior software program developer on the Annapurna ML engineering staff. She is targeted on working giant scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to Audible in her free time.
Darren Lin is a Cloud Native Specialist Options Architect at AWS who focuses on domains equivalent to Linux, Kubernetes, Container, Observability, and Open Supply Applied sciences. In his spare time, he likes to work out and have enjoyable together with his household.