Maximize HyperPod Cluster utilization with HyperPod process governance fine-grained quota allocation

We’re excited to announce the final availability of fine-grained compute and reminiscence quota allocation with HyperPod process governance. With this functionality, prospects can optimize Amazon SageMaker HyperPod cluster utilization on Amazon Elastic Kubernetes Service (Amazon EKS), distribute truthful utilization, and help environment friendly useful resource allocation throughout totally different groups or tasks. For extra info, see HyperPod process governance greatest practices for maximizing the worth of SageMaker HyperPod process governance.

Compute quota administration is an administrative mechanism that units and controls compute useful resource limits throughout customers, groups, and tasks. It controls truthful useful resource distribution, stopping a single entity from monopolizing cluster assets, thereby optimizing general computational effectivity.

Due to price range constraints, prospects may need to allocate compute assets throughout a number of groups pretty. For instance, a knowledge scientist may want some GPUs (for instance, 4 H100 GPUs) for mannequin growth, however not all the occasion’s compute capability. In different circumstances, prospects have restricted compute assets however many groups, and so they need to pretty share compute assets throughout these groups, in order that no idle capability is left unused.

With HyperPod process governance, directors can now allocate granular GPU, vCPU, and vCPU reminiscence to groups and tasks—along with all the occasion assets—primarily based on their most popular technique. Key capabilities embrace GPU-level quota allocation by occasion sort and household, or {hardware} sort—supporting each Trainium and NVIDIA GPUs—and elective CPU and reminiscence allocation for fine-tuned useful resource management. Directors also can outline the burden (or precedence degree) a workforce is given for fair-share idle compute allocation.

“With all kinds of frontier AI knowledge experiments and manufacturing pipelines, with the ability to maximize SageMaker HyperPod Cluster utilization is extraordinarily excessive affect. This requires truthful and managed entry to shared assets like state-of-the-art GPUs, granular {hardware} allocation, and extra. That is precisely what HyperPod process governance is constructed for, and we’re excited to see AWS pushing environment friendly cluster utilization for quite a lot of AI use circumstances.”

– Daniel Xu, Director of Product at Snorkel AI, whose AI knowledge expertise platform empowers enterprises to construct specialised AI functions by leveraging their organizational experience at scale.

On this submit, we dive deep into the best way to outline quotas for groups or tasks primarily based on granular or instance-level allocation. We talk about totally different strategies to outline such insurance policies, and the way knowledge scientists can schedule their jobs seamlessly with this new functionality.

Resolution overview

Conditions

To comply with the examples on this weblog submit, you must meet the next stipulations:

To schedule and execute the instance jobs within the Submitting Duties part, additionally, you will want:

An area surroundings (both your native machine or a cloud-based compute surroundings), from which to run the HyperPod CLI and kubectl instructions, configured as follows:
HyperPod Coaching Operator put in within the cluster

Allocating granular compute and reminiscence quota utilizing the AWS console

Directors are the first persona interacting with SageMaker HyperPod process governance and are chargeable for managing cluster compute allocation in alignment with the group’s strategic priorities and targets.

Implementing this characteristic follows the acquainted compute allocation creation workflow of HyperPod process governance. To get began, check in to the AWS Administration Console and navigate to Cluster Administration beneath HyperPod Clusters within the Amazon SageMaker AI console. After choosing your HyperPod cluster, choose the Insurance policies tab within the cluster element web page. Navigate to Compute allocations and select Create.

As with present performance, you’ll be able to allow process prioritization and fair-share useful resource allocation by cluster insurance policies that prioritize crucial workloads and distribute idle compute throughout groups. By utilizing HyperPod process governance, you’ll be able to outline queue admission insurance policies (first-come-first-serve by default or process rating) and idle compute allocation strategies (first-come-first-serve or fair-share by default). Within the Compute allocation part, you’ll be able to create and edit allocations to distribute assets amongst groups, allow lending and borrowing of idle compute, configure preemption of low-priority duties, and assign fair-share weights.

The important thing innovation is within the Allocations part proven within the following determine, the place you’ll now discover fine-grained choices for useful resource allocation. Along with the prevailing instance-level quotas, now you can immediately specify GPU quotas by occasion sort and household or by {hardware} sort. Once you outline GPU allocations, HyperPod process governance intelligently calculates applicable default values for vCPUs and reminiscence that are set proportionally.

For instance, when allocating 2 GPUs from a single p5.48xlarge occasion (which has 8 GPUs, 192 vCPUs, and a pair of TiB reminiscence) in your HyperPod cluster, HyperPod process governance assigns 48 vCPUs and 512 GiB reminiscence as default values—which is equal to 1 quarter of the occasion’s whole assets. Equally, in case your HyperPod cluster comprises 2 ml.g5.2xlarge cases (every with 1 GPU, 8 vCPUs, and 32 GiB reminiscence), allocating 2 GPUs would routinely assign 16 vCPUs and 64 GiB reminiscence from each cases as proven within the following picture.

You’ll be able to both proceed with these routinely calculated default values or customise the allocation by manually adjusting the vCPUs and vCPU reminiscence fields as seen within the following picture.

Amazon SageMaker HyperPod helps clusters that embrace CPU-based cases, GPU-based cases, and AWS Neuron-based {hardware} (AWS Inferentia and AWS Trainium chips). You’ll be able to specify useful resource allocation in your workforce by cases, GPUs, vCPUs, vCPU reminiscence, or Neuron gadgets, as proven within the following picture.

Quota allocation will be greater than capability. Assets added to the compute allocation coverage that aren’t presently obtainable within the cluster symbolize planning for future capability upgrades. Jobs that require these unprovisioned assets will likely be routinely queued and stay in a pending state till the required assets grow to be obtainable. It’s necessary to grasp that in SageMaker HyperPod, compute allocations operate as quotas, that are verified throughout workload scheduling to grasp if a workload ought to be admitted or not, no matter precise capability availability. When useful resource requests are inside these outlined allocation limits and present utilization, the Kubernetes scheduler (kube-scheduler) handles the precise distribution and placement of pods throughout the HyperPod cluster nodes.

Allocating granular compute and reminiscence quota utilizing AWS CLI

You can even create or replace compute quotas utilizing the AWS CLI. The next is an instance for making a compute quota with solely GPU rely specification utilizing the AWS CLI:

aws sagemaker 
create-compute-quota 
--region  
--name "only-gpu-quota" 
--cluster-arn "arn:aws:sagemaker: ::cluster/" 
--description "take a look at description" 
--compute-quota-config "ComputeQuotaResources=[{InstanceType=ml.g6.12xlarge,Accelerators=2}],ResourceSharingConfig={Technique=LendAndBorrow,BorrowLimit=10}" 
--activation-state "Enabled" 
--compute-quota-target "TeamName=onlygputeam2,FairShareWeight=10"

Compute quotas may also be created with blended quota varieties, together with a sure variety of cases and granular compute assets, as proven within the following instance:

aws sagemaker 
create-compute-quota 
--region  
--name "mix-quota-type" 
--cluster-arn "arn:aws:sagemaker:::cluster/" 
--description "Blended quota allocation" 
--compute-quota-config "ComputeQuotaResources=[{InstanceType=ml.g6.12xlarge,Accelerators=2}, {InstanceType=ml.p5.48xlarge,Count=3}, {InstanceType=ml.c5.2xlarge,VCpu=2}],ResourceSharingConfig={Technique=LendAndBorrow,BorrowLimit=10}" 
--activation-state "Enabled" 
--compute-quota-target "TeamName=mixquotatype,FairShareWeight=10"

HyperPod process governance deep dive

SageMaker HyperPod process governance permits allocation of GPU, CPU, and reminiscence assets by integrating with Kueue, a Kubernetes-native system for job queueing.

Kueue doesn’t exchange present Kubernetes scheduling parts, however moderately integrates with the kube-scheduler, such that Kueue decides whether or not a workload ought to be admitted primarily based on the useful resource quotas and present utilization, after which the kube-scheduler takes care of pod placement on the nodes.

When a workload requests particular assets, Kueue selects an applicable useful resource taste primarily based on availability, node affinity, and job precedence. The scheduler then injects the corresponding node labels and tolerations into the PodSpec, permitting Kubernetes to put the pod on nodes with the requested {hardware} configuration. This helps exact useful resource governance and environment friendly allocation for multi-tenant clusters.

When a SageMaker HyperPod process governance compute allocation is created, Kueue creates ClusterQueues that outline useful resource quotas and scheduling insurance policies, together with ResourceFlavors for the chosen occasion varieties with their distinctive useful resource traits.

For instance, the next compute allocation coverage allocates ml.g6.12xlarge cases with 2 GPUs and 48 vCPUs to the onlygputeam workforce, implementing a LendAndBorrow technique with an as much as 50% borrowing restrict. This configuration permits versatile useful resource sharing whereas sustaining precedence by a justifiable share weight of 10 and the power to preempt decrease precedence duties from different groups.

aws sagemaker describe-compute-quota                        
--region  
--compute-quota-id 

#output
{
    "ComputeQuotaArn": "arn:aws:sagemaker:::compute-quota/",
    "ComputeQuotaId": "",
    "Title": "only-gpu-quota",
    "Description": "Solely GPU quota allocation",
    "ComputeQuotaVersion": 1,
    "Standing": "Created",
    "ClusterArn": "arn:aws:sagemaker:::cluster/",
    "ComputeQuotaConfig": {
        "ComputeQuotaResources": [
            {
                "InstanceType": "ml.g6.12xlarge",
                "Accelerators": 2,
                "VCpu": 48.0
            }
        ],
        "ResourceSharingConfig": {
            "Technique": "LendAndBorrow",
            "BorrowLimit": 50
        },
        "PreemptTeamTasks": "LowerPriority"
    },
    "ComputeQuotaTarget": {
        "TeamName": "onlygputeam",
        "FairShareWeight": 10
    },
    "ActivationState": "Enabled",
    "CreationTime": "2025-07-24T11:12:12.021000-07:00",
    "CreatedBy": {},
    "LastModifiedTime": "2025-07-24T11:15:45.205000-07:00",
    "LastModifiedBy": {}
}

The corresponding Kueue ClusterQueue is configured with the ml.g6.12xlarge taste, offering quotas for two NVIDIA GPUs, 48 CPU cores, and 192 Gi reminiscence.

kubectl describe clusterqueue hyperpod-ns-onlygputeam-clusterqueue

# output
Title:         hyperpod-ns-onlygputeam-clusterqueue
Namespace:
Labels:       sagemaker.amazonaws.com/quota-allocation-id=onlygputeam
              sagemaker.amazonaws.com/sagemaker-managed-queue=true
Annotations:  
API Model:  kueue.x-k8s.io/v1beta1
Variety:         ClusterQueue
Metadata:
  ...
Spec:
  Cohort:  shared-pool
  Honest Sharing:
    Weight:  10
  Taste Fungibility:
    When Can Borrow:   TryNextFlavor
    When Can Preempt:  TryNextFlavor
  Namespace Selector:
    Match Labels:
      kubernetes.io/metadata.title:  hyperpod-ns-onlygputeam
  Preemption:
    Borrow Inside Cohort:
      Coverage:               LowerPriority
    Reclaim Inside Cohort:  Any
    Inside Cluster Queue:   LowerPriority
  Queueing Technique:        BestEffortFIFO
  Useful resource Teams:
    Lined Assets:
      nvidia.com/gpu
      aws.amazon.com/neurondevice
      cpu
      reminiscence
      vpc.amazonaws.com/efa
    Flavors:
      Title:  ml.g6.12xlarge
      Assets:
        Borrowing Restrict:  1
        Title:             nvidia.com/gpu
        Nominal Quota:    2
        Borrowing Restrict:  0
        Title:             aws.amazon.com/neurondevice
        Nominal Quota:    0
        Borrowing Restrict:  24
        Title:             cpu
        Nominal Quota:    48
        Borrowing Restrict:  96Gi
        Title:             reminiscence
        Nominal Quota:    192Gi
        Borrowing Restrict:  0
        Title:             vpc.amazonaws.com/efa
        Nominal Quota:    1
    ...

A Kueue LocalQueue will likely be additionally created, and can reference the corresponding ClusterQueue. The LocalQueue acts because the namespace-scoped useful resource by which customers can submit workloads, and these workloads are then admitted and scheduled in line with the quotas and insurance policies outlined within the ClusterQueue.

kubectl describe localqueue hyperpod-ns-onlygputeam-localqueue -n hyperpod-ns-onlygputeam

# output
Title:         hyperpod-ns-onlygputeam-localqueue
Namespace:    hyperpod-ns-onlygputeam
Labels:       sagemaker.amazonaws.com/quota-allocation-id=onlygputeam
              sagemaker.amazonaws.com/sagemaker-managed-queue=true
Annotations:  
API Model:  kueue.x-k8s.io/v1beta1
Variety:         LocalQueue
Metadata:
    ...
Spec:
  Cluster Queue:  hyperpod-ns-onlygputeam-clusterqueue
  Cease Coverage:    None
Standing:
  Admitted Workloads:  0

Submitting duties

There are two methods to submit duties on Amazon EKS orchestrated SageMaker HyperPod clusters: the SageMaker HyperPod CLI and the Kubernetes command-line device, kubectl. With each choices, knowledge scientists must reference their workforce’s namespace and process precedence class—along with the requested GPU and vCPU compute and reminiscence assets—to make use of their granular allotted quota with applicable prioritization. If the consumer doesn’t specify a precedence class, then SageMaker HyperPod process governance will routinely assume the bottom precedence. The particular GPU sort comes from an occasion sort choice, as a result of knowledge scientists need to use GPUs with sure capabilities (for instance, H100 as an alternative of H200) to carry out their duties effectively.

HyperPod CLI

The HyperPod CLI was created to summary the complexities of working with kubectl and in order that builders utilizing SageMaker HyperPod can iterate quicker with customized instructions.The next is an instance of a job submission with the HyperPod CLI requesting each compute and reminiscence assets:

hyp create hyp-pytorch-job 
--job-name sample-job1 
--image  .dkr.ecr..amazonaws.com/: 
--pull-policy "All the time" 
--tasks-per-node 1 
--max-retry 1 
--priority high-priority 
--namespace hyperpod-ns-team1 
--queue-name hyperpod-ns-team1-localqueue 
--instance-type ml.g5.8xlarge 
--accelerators 1 
--vcpu 4 
--memory 1 
--accelerators-limit 1 
--vcpu-limit 5 
--memory-limit 2

The highlighted parameters allow requesting granular compute and reminiscence assets. The HyperPod CLI requires to put in the HyperPod Coaching Operator within the cluster after which construct a container picture that features the HyperPod Elastic Agent. For additional directions on the best way to construct such container picture, please seek advice from the HyperPod Coaching Operator documentation.

For extra info on the supported HyperPod CLI arguments and associated description, see the SageMaker HyperPod CLI reference documentation.

Kubectl

The next is an instance of a kubectl command to submit a job to the HyperPod cluster utilizing the required queue. It is a easy instance of a PyTorch job that may examine for GPU availability after which sleep for five minutes. Compute and reminiscence assets are requested utilizing the usual Kubernetes useful resource administration constructs.

apiVersion: batch/v1 
type: Job 
metadata: 
  title: gpu-training-job 
  namespace: hyperpod-ns-team1 
spec: 
  parallelism: 1 
  completions: 1 
  droop: true 
  template: 
    metadata: 
      labels: 
        kueue.x-k8s.io/queue-name: hyperpod-ns-team1-localqueue
        kueue.x-k8s.io/priority-class: high-priority
    spec: 
      containers: 
      - title: training-container 
        picture: pytorch/pytorch:2.0.0-cuda11.7-cudnn8-runtime
        command: 
        - "python"
        - "-c"
        - "import torch; print('GPU obtainable:', torch.cuda.is_available()); import time; time.sleep(15)" 
        assets: 
          requests:  
            nvidia.com/gpu: 1  
            cpu: "4"  
            reminiscence: "1Gi"  
          limits:  
            nvidia.com/gpu: 1         
      restartPolicy: By no means

Pattern instructions

Following is a brief reference information for useful instructions when interacting with SageMaker HyperPod process governance:

Describing cluster coverage with the AWS CLI – This AWS CLI command is beneficial for viewing the cluster coverage settings in your cluster.
Checklist compute quota allocations with the AWS CLI – Use this AWS CLI command to view the totally different groups and arrange process governance and their respective quota allocation settings.
HyperPod CLI – The HyperPod CLI abstracts frequent kubectl instructions used to work together with SageMaker HyperPod clusters equivalent to submitting, itemizing, and cancelling duties. See the SageMaker HyperPod CLI reference documentation for a full listing of instructions.
kubectl – You can even use kubectl to work together with process governance; some helpful instructions are:

kubectl get workloads -n hyperpod-ns- kubectl describe workload -n hyperpod-ns-. These instructions present the workloads operating in your cluster per namespace and supply detailed reasonings on Kueue admission. You should use these instructions to reply questions equivalent to “Why was my process preempted?” or “Why did my process get admitted?”

Frequent situations

A standard use case for extra granular allocation of GPU compute is fine-tuning small and medium sized massive language fashions (LLMs). A single H100 or H200 GPU is likely to be enough to deal with such a use case (additionally relying on the chosen batch measurement and different elements), and machine studying (ML) platform directors can select to allocate a single GPU to every knowledge scientist or ML researcher to optimize the utilization of an occasion like ml.p5.48xlarge, which comes with 8 H100 GPUs onboard.

Small language fashions (SLMs) have emerged as a big development in generative AI, providing decrease latency, decreased deployment prices, and enhanced privateness capabilities whereas sustaining spectacular efficiency on focused duties, making them more and more important for agentic workflows and edge computing situations. The brand new SageMaker HyperPod process governance with fine-grained GPU, CPU, and reminiscence allocation considerably enhances SLM growth by enabling exact matching of assets to mannequin necessities, permitting groups to effectively run a number of experiments concurrently with totally different architectures. This useful resource optimization is especially helpful as organizations develop specialised SLMs for domain-specific functions, with priority-based scheduling in order that crucial mannequin coaching jobs obtain assets first whereas maximizing general cluster utilization. By offering precisely the fitting assets on the proper time, HyperPod accelerates the event of specialised, domain-specific SLMs that may be deployed as environment friendly brokers in advanced workflows, enabling extra responsive and cost-effective AI options throughout industries.

With the rising reputation of SLMs, organizations can use granular quota allocation to create focused quota insurance policies that prioritize GPU assets, addressing the budget-sensitive nature of ML infrastructure the place GPUs symbolize essentially the most important price and efficiency issue. Organizations can now selectively apply CPU and reminiscence limits the place wanted, making a granular useful resource administration strategy that effectively helps various machine studying workloads no matter mannequin measurement.

Equally, to help inference workloads, a number of groups won’t require a complete occasion to deploy their fashions, serving to to keep away from having whole cases outfitted with a number of GPUs allotted to every workforce and leaving GPU compute sitting idle.

Lastly, throughout experimentation and algorithm growth, knowledge scientists and ML researchers can select to deploy a container internet hosting their most popular IDE on HyperPod, like JupyterLab or Code-OSS (Visible Studio Code open supply). On this situation, they typically experiment with smaller batch sizes earlier than scaling to multi-GPU configurations, therefore not needing whole multi-GPU cases to be allotted.Comparable concerns apply to CPU cases; for instance, an ML platform administrator may resolve to make use of CPU cases for IDE deployment, as a result of knowledge scientists want to scale their coaching or fine-tuning with jobs moderately than experimenting with the native IDE compute. In such circumstances, relying on the cases of selection, partitioning CPU cores throughout the workforce is likely to be helpful.

Conclusion

The introduction of fine-grained compute quota allocation in SageMaker HyperPod represents a big development in ML infrastructure administration. By enabling GPU-level useful resource allocation alongside instance-level controls, organizations can now exactly tailor their compute assets to match their particular workloads and workforce constructions.

This granular strategy to useful resource governance addresses crucial challenges confronted by ML groups immediately, balancing price range constraints, maximizing costly GPU utilization, and making certain truthful entry throughout knowledge science groups of all sizes. Whether or not fine-tuning SLMs that require single GPUs, operating inference workloads with diversified useful resource wants, or supporting growth environments that don’t require full occasion energy, this versatile functionality helps make sure that no compute assets sit idle unnecessarily.

ML workloads proceed to diversify of their useful resource necessities and SageMaker HyperPod process governance now offers the adaptability organizations must optimize their GPU capability investments. To study extra, go to the SageMaker HyperPod product web page and HyperPod process governance documentation.

Give this a attempt within the Amazon SageMaker AI console and depart your feedback right here.

In regards to the authors

Siamak Nariman is a Senior Product Supervisor at AWS. He’s centered on AI/ML expertise, ML mannequin administration, and ML governance to enhance general organizational effectivity and productiveness. He has intensive expertise automating processes and deploying varied applied sciences.

Zhenshan Jin is a Senior Software program Engineer at Amazon Internet Providers (AWS), the place he leads software program growth for process governance on SageMaker HyperPod. In his function, he focuses on empowering prospects with superior AI capabilities whereas fostering an surroundings that maximizes engineering workforce effectivity and productiveness.

Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Internet Providers. With a number of years of software program engineering and an ML background, he works with prospects of any measurement to grasp their enterprise and technical wants and design AI and ML options that make the very best use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on tasks in numerous domains, together with MLOps, laptop imaginative and prescient, and NLP, involving a broad set of AWS providers. In his free time, Giuseppe enjoys taking part in soccer.

Sindhura Palakodety is a Options Architect at AWS. She is enthusiastic about serving to prospects construct enterprise-scale Nicely-Architected options on the AWS platform and specializes within the knowledge analytics area.

Maximize HyperPod Cluster utilization with HyperPod process governance fine-grained quota allocation

The Finish-to-Finish Information Scientist’s Immediate Playbook

Implementing the Gaussian Problem in Python

Implementing the Gaussian Problem in Python

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Journey from Jupyter to Programmer: A Fast-Begin Information

The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

About Us

Category

Recent Posts