Coaching and deploying giant AI fashions requires superior distributed computing capabilities, however managing these distributed techniques shouldn’t be advanced for knowledge scientists and machine studying (ML) practitioners. The command line interface (CLI) and software program improvement equipment (SDK) for Amazon SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS) orchestration simplify the way you handle cluster infrastructure and use the service’s distributed coaching and inference capabilities.
The SageMaker HyperPod CLI offers knowledge scientists with an intuitive command-line expertise, abstracting away the underlying complexity of distributed techniques. Constructed on prime of the SageMaker HyperPod SDK, the CLI gives simple instructions for managing HyperPod clusters and customary workflows like launching coaching or fine-tuning jobs, deploying inference endpoints, and monitoring cluster efficiency. This makes it best for fast experimentation and iteration.
A layered structure for simplicity
The HyperPod CLI and SDK comply with a multi-layered, shared structure. The CLI and the Python module function user-facing entry factors and are each constructed on prime of frequent SDK parts to offer constant habits throughout interfaces. For infrastructure automation, the SDK orchestrates cluster lifecycle administration by way of a mix of AWS CloudFormation stack provisioning and direct AWS API interactions. Coaching and inference workloads and built-in improvement environments (IDEs) (Areas) are expressed as Kubernetes Customized Useful resource Definitions (CRDs), which the SDK manages by way of the Kubernetes API.
On this publish, we reveal how you can use the CLI and the SDK to create and handle SageMaker HyperPod clusters in your AWS account. We stroll by way of a sensible instance and dive deeper into the consumer workflow and parameter decisions.
This publish focuses on cluster creation and administration. For a deep dive into utilizing the HyperPod CLI and SDK to submit coaching jobs and deploy inference endpoints, see our companion publish: Prepare and deploy fashions on Amazon SageMaker HyperPod utilizing the brand new HyperPod CLI and SDK.
Conditions
To comply with the examples on this publish, you should have the next conditions:
Set up the SageMaker HyperPod CLI
First, set up the newest model of the SageMaker HyperPod CLI and SDK. The examples on this publish are based mostly on model 3.5.0. Out of your native atmosphere, run the next command, you may alternatively set up the CLI in a Python digital atmosphere:
This command units up the instruments wanted to work together with SageMaker HyperPod clusters. For an present set up, ensure you have the newest model of the bundle put in (SageMaker HyperPod 3.5.0 or later) to have the ability to use the related set of options described on this publish. To confirm if the CLI is put in appropriately, run the hyp command and verify the outputs:
The output shall be just like the next, and consists of directions on how you can use the CLI:
For extra data on CLI utilization and the accessible instructions and respective parameters, see the CLI reference documentation.
The HyperPod CLI offers instructions to handle the complete lifecycle of HyperPod clusters. The next sections clarify how you can create new clusters, monitor their creation, modify occasion teams, and delete clusters.
Creating a brand new HyperPod cluster
HyperPod clusters could be created by way of the AWS Administration Console or the HyperPod CLI, each of which offer streamlined experiences for cluster creation. The console gives the simplest and most guided strategy, whereas the CLI is very helpful for patrons preferring a programmatic expertise—for instance, to allow reproducibility or to construct automation round cluster creation. Each strategies use the identical underlying CloudFormation template, which is accessible within the SageMaker HyperPod cluster setup GitHub repository. For a walkthrough of the console-based expertise, see the cluster creation expertise weblog publish.
Creating a brand new cluster by way of the HyperPod CLI follows a configuration-based workflow: the CLI first generates configuration information, that are then edited to match the meant cluster specs. These information are subsequently submitted as a CloudFormation stack that creates the HyperPod cluster together with the required assets, comparable to a VPC and FSx for Lustre filesystem, amongst others.To initialize a brand new cluster configuration by operating the next command:hyp init cluster-stack
This initializes a brand new cluster configuration within the present listing and generates a config.yaml file that you need to use to specify the configuration of the cluster stack. Moreover it’s going to create a README.md with details about the performance and workflow along with a template for the CloudFormation stack parameters in cfn_params.jinja.
The cluster stack’s configuration variables are outlined in config.yaml. The next is an excerpt from the file:
The resource_name_prefix parameter serves as the first identifier for the AWS assets created throughout deployment. Every deployment should use a novel useful resource identify prefix to keep away from conflicts. The worth of the prefix parameter is routinely appended with a novel identifier throughout cluster creation to offer useful resource uniqueness.
The configuration could be edited both straight by opening config.yaml in an editor of your selection or by operating the hyp configure command. The next instance reveals how you can specify the Kubernetes model of the Amazon EKS cluster that shall be created by the stack:
hyp configure --kubernetes-version 1.33
Updating variables by way of the CLI instructions offers added safety by performing validation towards the outlined schema earlier than setting the worth in config.yaml.
Apart from the Kubernetes model and the useful resource identify prefix, some examples of serious parameters are listed under:
There are two necessary nuances when updating the configuration values by way of hyp configure instructions:
- Underscores (
_) in variable names insideconfig.yamldevelop into hyphens (-) within the CLI instructions. Thuskubernetes_versioninconfig.yamlis configured by way ofhyp configure --kubernetes-versionwithin the CLI. - Variables that comprise lists of entries inside
config.yamlare configured as JSON lists within the CLI command. For instance, a number of occasion teams are configured insideconfig.yamlas the next:
Which interprets to the next CLI command:
After you’re finished making the specified adjustments, validate your configuration file by operating the next command:hyp validate
This may validate the parameters in config.yaml towards the outlined schema. If profitable, the CLI will output the next:
The cluster creation stack could be submitted to CloudFormation by operating the next command:hyp create --region
The hyp create command performs validation and injects values from config.yaml into the cfn_params.jinja template. If no AWS Area is explicitly supplied, the command makes use of the default Area out of your AWS credentials configuration. The resolved configuration file and CloudFormation template values are saved to a timestamped subdirectory below the ./run/ listing, offering a light-weight native versioning mechanism to trace which configuration was used to create a cluster at a given cut-off date. It’s also possible to select to commit these artifacts to your model management system to enhance reproducibility and auditability. If profitable, the command outputs the CloudFormation stack ID:
Monitoring the HyperPod cluster creation course of
You possibly can listing the present CloudFormation stacks by operating the next command:hyp listing cluster-stack --region
You possibly can optionally filter the output by stack standing by including the next flag: --status "['CREATE_COMPLETE', 'UPDATE_COMPLETE']".
The output of this command will look just like the next:
Relying on the configuration in config.yaml, a number of nested stacks are created that cowl totally different elements of the HyperPod cluster setup such because the EKSClusterStack, FsxStack and the VPCStack.
You need to use the describe command to view particulars about any of the person stacks:hyp describe cluster-stack
The output for an exemplary substack, S3EndpointStack, will seem like the next:
If any of the stacks present CREATE_FAILED, ROLLBACK_* or DELETE_*, open the CloudFormation web page within the console to research the foundation trigger. Failed cluster creation stacks are sometimes associated to inadequate service quotas for the cluster itself, the occasion teams, or the community parts comparable to VPCs or NAT gateways. Examine the corresponding SageMaker HyperPod Quotas to study extra concerning the required quotas for SageMaker HyperPod.
Connecting to a cluster
After the cluster stack has efficiently created the required assets and the standing has modified to CREATE_COMPLETE, you may configure the CLI and your native Kubernetes atmosphere to work together with the HyperPod cluster.
hyp set-cluster-context --cluster-name
The --cluster-name possibility specifies the identify of the HyperPod cluster to connect with and the --region possibility specifies the Area the place the cluster has been created. Optionally, a selected namespace could be configured utilizing the --namespace parameter. The command updates your native Kubernetes config in ./kube/config, so as to use each the HyperPod CLI and Kubernetes utilities comparable to kubectl to handle the assets in your HyperPod cluster.
See our companion weblog publish for additional details about how you can use the CLI to submit coaching jobs and inference deployments to your newly created HyperPod cluster: Prepare and deploy fashions on Amazon SageMaker HyperPod utilizing the brand new HyperPod CLI and SDK.
Modifying an present HyperPod cluster
The HyperPod CLI offers a command to change the occasion teams and node restoration mode of an present HyperPod cluster by way of the hyp replace cluster command. This may be helpful if you want to scale your cluster by including or eradicating employee nodes, or if you wish to change the occasion varieties utilized by the node teams.
To replace the occasion teams, run the next command, tailored together with your cluster identify and desired occasion group settings:
Observe that the entire fields within the previous command are required to run the replace command, even when, for instance, solely the occasion rely is modified. You possibly can listing the present cluster and occasion group configurations to acquire the required values by operating the hyp describe cluster command.
The output of the replace command will seem like the next:
The --node-recovery possibility enables you to configure the node restoration habits, which could be set to both Computerized or None. For details about the SageMaker HyperPod computerized node restoration function, see Computerized node restoration.
Deleting an present HyperPod cluster
To delete an present HyperPod cluster, run the next command. Observe that this motion is not reversible:
hyp delete cluster-stack
This command removes the required CloudFormation stack and the related AWS assets. You need to use the optionally available --retain-resources flag to specify a comma-separated listing of logical useful resource IDs to retain in the course of the deletion course of. It’s necessary to rigorously contemplate which assets you want to retain, as a result of the delete operation can’t be undone.
The output of this command will seem like the next, asking you to substantiate the useful resource deletion:
SageMaker HyperPod SDK
SageMaker HyperPod additionally features a Python SDK for programmatic entry to the options described earlier. The Python SDK is utilized by the CLI instructions and is put in once you set up the sagemaker-hyperpod Python bundle as described at first of this publish. The HyperPod CLI is finest fitted to customers preferring a streamlined, interactive expertise for frequent HyperPod administration duties like creating and monitoring clusters, coaching jobs, and inference endpoints. It’s notably useful for fast prototyping, experimentation, and automating repetitive HyperPod workflows by way of scripts or steady integration and supply (CI/CD) pipelines. In distinction, the HyperPod SDK offers extra programmatic management and adaptability, making it the popular selection when you want to embed HyperPod performance straight into your utility, combine with different AWS or third-party companies, or construct advanced, personalized HyperPod administration workflows. Take into account the complexity of your use case, the necessity for automation and integration, and your group’s familiarity with programming languages when deciding whether or not to make use of the HyperPod CLI or SDK.
The SageMaker HyperPod CLI GitHub repository reveals examples of how cluster creation and administration could be applied utilizing the Python SDK.
Conclusion
The SageMaker HyperPod CLI and SDK simplify cluster creation and administration. With the examples on this publish, we’ve demonstrated how these instruments present worth by way of:
- Simplified lifecycle administration – From preliminary configuration to cluster updates and cleanup, the CLI aligns with how groups handle long-running coaching and inference environments and abstracts away pointless complexity.
- Declarative management when wanted – The SDK exposes the underlying configuration mannequin, in order that groups can codify cluster specs, occasion teams, storage filesystems, and extra.
- Built-in observability – Visibility into CloudFormation stacks is accessible with out switching instruments, supporting easy iteration throughout improvement and operation.
Getting began with these instruments is as simple as putting in the SageMaker HyperPod bundle. The SageMaker HyperPod CLI and SDK present the best stage of abstraction for each knowledge scientists seeking to rapidly experiment with distributed coaching and ML engineers constructing manufacturing techniques.
For those who’re serious about how you can use the HyperPod CLI and SDK for submitting coaching jobs and deploying fashions to your new cluster, ensure that to verify our companion weblog publish: Prepare and deploy fashions on Amazon SageMaker HyperPod utilizing the brand new HyperPod CLI and SDK.
Concerning the authors


