More and more, organizations throughout industries are turning to generative AI basis fashions (FMs) to boost their functions. To attain optimum efficiency for particular use circumstances, prospects are adopting and adapting these FMs to their distinctive area necessities. This want for personalisation has turn out to be much more pronounced with the emergence of latest fashions, corresponding to these launched by DeepSeek.
Nonetheless, customizing DeepSeek fashions successfully whereas managing computational sources stays a big problem. Tuning mannequin structure requires technical experience, coaching and fine-tuning parameters, and managing distributed coaching infrastructure, amongst others. This typically forces firms to decide on between mannequin efficiency and sensible implementation constraints, making a essential want for extra accessible and streamlined mannequin customization options.
On this two-part collection, we talk about how one can cut back the DeepSeek mannequin customization complexity by utilizing the pre-built fine-tuning workflows (additionally known as “recipes”) for each DeepSeek-R1 mannequin and its distilled variations, launched as a part of Amazon SageMaker HyperPod recipes.
On this first put up, we are going to construct an answer structure for fine-tuning DeepSeek-R1 distilled fashions and display the strategy by offering a step-by-step instance on customizing the DeepSeek-R1 Distill Qwen 7b mannequin utilizing recipes, attaining a mean of 25% on all of the Rouge scores, with a most of 49% on Rouge 2 rating with each SageMaker HyperPod and SageMaker coaching jobs. The second a part of the collection will concentrate on fine-tuning the DeepSeek-R1 671b mannequin itself.
On the time of this writing, the DeepSeek-R1 mannequin and its distilled variations for Llama and Qwen have been the most recent launched recipe. Try sagemaker-hyperpod-recipes on GitHub for the most recent launched recipes, together with assist for fine-tuning the DeepSeek-R1 671b parameter mannequin.
Amazon SageMaker HyperPod recipes
At re:Invent 2024, we introduced the overall availability of Amazon SageMaker HyperPod recipes. SageMaker HyperPod recipes assist information scientists and builders of all ability units to get began coaching and fine-tuning widespread publicly out there generative AI fashions in minutes with state-of-the-art coaching efficiency. These recipes embody a coaching stack validated by Amazon Internet Companies (AWS), which removes the tedious work of experimenting with totally different mannequin configurations, minimizing the time it takes for iterative analysis and testing. They automate a number of essential steps, corresponding to loading coaching datasets, making use of distributed coaching strategies, automating checkpoints for quicker restoration from faults, and managing the end-to-end coaching loop.
Recipes, paired with the resilient infrastructure of AWS, (Amazon SageMaker HyperPod and Amazon SageMaker Mannequin Coaching) present a resilient coaching surroundings for fine-tuning FMs corresponding to DeepSeek-R1 with out-of-the-box customization.
To assist prospects rapidly use DeepSeek’s highly effective and cost-efficient fashions to speed up generative AI innovation, we launched new recipes to fine-tune six DeepSeek fashions, together with DeepSeek-R1 distilled Llama and Qwen fashions utilizing supervised fine-tuning (SFT), Quantized Low-Rank Adaptation (QLoRA), Low-Rank Adaptation (LoRA) strategies. On this put up, we introduce these new recipes and stroll you thru an answer to fine-tune a DeepSeek Qwen 7b mannequin for a sophisticated medical reasoning use case.
Resolution overview
At its core, as depicted within the following diagram, the recipe structure implements a hierarchical workflow that begins with a recipe specification that covers a complete configuration defining the coaching parameters, mannequin structure, and distributed coaching methods. These recipes are processed via the HyperPod recipe launcher, which serves because the orchestration layer accountable for launching a job on the corresponding structure. The launcher interfaces with underlying cluster administration programs corresponding to SageMaker HyperPod (Slurm or Kubernetes) or coaching jobs, which deal with useful resource allocation and scheduling. It’s a well-recognized NeMo-style launcher with which you’ll select a recipe and run it in your infrastructure of selection (SageMaker HyperPod or coaching).
For instance, after selecting your recipe, you may pre-train or fine-tune a mannequin by working python3 fundamental.py recipes=recipe-name
. Alternatively, you should use a launcher script, which is a bash script that’s preconfigured to run the chosen coaching or fine-tuning job in your cluster. You’ll be able to try fundamental.py (NeMo model launcher) and launcher scripts for DeepSeek on the GitHub repository internet hosting SageMaker HyperPod recipes.
A key element of this structure is the HyperPod coaching adapter for NeMo, which is constructed on the NVIDIA NeMo framework and Neuronx Distributed coaching bundle, which hundreds information, creates fashions, and facilitates environment friendly information parallelism, mannequin parallelism, and hybrid parallelism methods, which permits optimum utilization of computational sources throughout the distributed infrastructure. The structure’s modular design permits for scalability and adaptability, making it notably efficient for coaching LLMs that require distributed computing capabilities.
You’ll be able to run these recipes utilizing SageMaker HyperPod or as SageMaker coaching jobs. For organizations that require granular management over coaching infrastructure and intensive customization choices, SageMaker HyperPod is the perfect selection. SageMaker coaching jobs, then again, is tailor-made for organizations that desire a totally managed expertise for his or her coaching workflows. To be taught extra particulars about these service options, consult with Generative AI basis mannequin coaching on Amazon SageMaker.
Within the subsequent sections, we go over the answer structure for these providers earlier than presenting a step-by-step implementation instance for every.
SageMaker HyperPod
To submit jobs utilizing SageMaker HyperPod, you should use the HyperPod recipes launcher, which supplies an simple mechanism to run recipes on each Slurm and Kubernetes. After you select your orchestrator, you may select your recipe’s launcher and have it run in your HyperPod cluster. The launcher will interface along with your cluster with Slurm or Kubernetes native constructs. For this put up, we use the HyperPod recipes launcher mechanism to run the coaching on a Slurm cluster. The next picture reveals the answer structure for SageMaker HyperPod.
SageMaker coaching jobs
The workflow for SageMaker coaching jobs begins with an API request that interfaces with the SageMaker management aircraft, which manages the orchestration of coaching sources. The system makes use of the coaching jobs launcher to effectively run workloads on a managed cluster.
The structure makes use of Amazon Elastic Container Registry (Amazon ECR) for container picture administration. Coaching jobs are executed throughout a distributed cluster, with seamless integration to a number of storage options, together with Amazon Easy Storage Service (Amazon S3), Amazon Elastic File Storage (Amazon EFS), and Amazon FSx for Lustre. All of this runs beneath the SageMaker managed surroundings, offering optimum useful resource utilization and safety.
This design simplifies the complexity of distributed coaching whereas sustaining the flexibleness wanted for numerous machine studying (ML) workloads, making it a super answer for enterprise AI growth. The next picture reveals the answer structure for SageMaker coaching jobs.
Resolution walkthrough
For this answer, contemplate a use case for a healthcare business startup that goals to create an correct, medically verified chat assistant utility that bridges complicated medical data with patient-friendly explanations. By fine-tuning DeepSeek-R1 Distill Qwen 7b utilizing the FreedomIntelligence/medical-o1-reasoning-SFT dataset, you should use its medical reasoning capabilities to supply content material that maintains scientific accuracy.
Stipulations
It’s worthwhile to full the next stipulations earlier than you may run the DeepSeek-R1 Distill Qwen 7B mannequin fine-tuning pocket book.
- Make the next quota improve requests for SageMaker. It’s worthwhile to request a minimal of 1
p4d.24xlarge
occasion (with 8 x NVIDIA A100 GPUs) ranging to a most of twop4d.24xlarge
situations (relying on time-to-train and cost-to-train trade-offs to your use case).
On the Service Quotas console, request the next SageMaker quotas:
-
- P4 situations (
p4d.24xlarge
) for coaching job utilization: 1–2 - P4 situations (
p4d.24xlarge
) for HyperPod clusters (“ml.p4d.24xlarge
for cluster utilization“): 1-2
- P4 situations (
- If you happen to select to make use of HyperPod clusters to run your coaching, arrange a HyperPod Slurm cluster following the documentation at Tutuorial for getting began with SageMaker HyperPod. Alternatively, you should use the AWS CloudFormation template offered within the AWS Workshop Studio at Amazon SageMaker HyperPod Personal Account and comply with the directions to arrange a cluster and a growth surroundings to entry and submit jobs to the cluster.
- (Non-obligatory) If you happen to select to make use of SageMaker coaching jobs, you may create an Amazon SageMaker Studio area (consult with Use fast setup for Amazon SageMaker AI) to entry Jupyter notebooks with the previous position. (You should utilize JupyterLab in your native setup, too.)
- Clone the GitHub repository with the property for this deployment. This repository consists of a pocket book that references coaching property:
Subsequent, we run the model_trainer_deepseek_r1_recipe_lora.ipynb pocket book to fine-tune the DeepSeek-R1 mannequin utilizing QLoRA on SageMaker.
Put together the dataset
To arrange the dataset, it’s worthwhile to load the FreedomIntelligence/medical-o1-reasoning-SFT dataset
, tokenize and chunk the dataset, and configure the info channels for SageMaker coaching on Amazon S3. Full the next steps:
- Format the dataset by making use of the immediate format for DeepSeek-R1 Distill Qwen 7B:
def generate_prompt(data_point):
full_prompt = f"""
Beneath is an instruction that describes a job, paired with an enter that gives additional context.
Write a response that appropriately completes the request.
Earlier than answering, consider carefully concerning the query and create a step-by-step chain of ideas to make sure a logical and correct response.
### Instruction:
You're a medical knowledgeable with superior information in scientific reasoning, diagnostics, and therapy planning.
Please reply the next medical query.
### Query:
{data_point["Question"]}
### Response:
{data_point["Complex_CoT"]}
"""
return {"immediate": full_prompt.strip()}
- Load the FreedomIntelligence/medical-o1-reasoning-SFT dataset and break up it into coaching and validation datasets:
# Load dataset from the hub
train_set = load_dataset(dataset_name, 'en', break up="practice[5%:]")
test_set = load_dataset(dataset_name, 'en', break up="practice[:5%]")
...
train_dataset = train_set.map(
generate_and_tokenize_prompt,
remove_columns=columns_to_remove,
batched=False
)
test_dataset = test_set.map(
generate_and_tokenize_prompt,
remove_columns=columns_to_remove,
batched=False
)
- Load the DeepSeek-R1 Distill Qwen 7B tokenizer from the Hugging Face Transformers library and generate tokens for the practice and validation datasets:
model_id = "deepseek-ai/DeepSeek-R1-Distill-Qwen-7B"
max_seq_length=1024
# Initialize a tokenizer by loading a pre-trained tokenizer configuration, utilizing the quick tokenizer implementation if out there.
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
...
train_dataset = train_dataset.map(tokenize, remove_columns=["prompt"])
test_dataset = test_dataset.map(tokenize, remove_columns=["prompt"])
- Put together the coaching and validation datasets for SageMaker coaching by saving them as
arrow
recordsdata, which is required by SageMaker HyperPod recipes, and establishing the S3 paths the place these recordsdata will likely be uploaded:
train_dataset_s3_path = f"s3://{bucket_name}/{input_path}/practice"
val_dataset_s3_path = f"s3://{bucket_name}/{input_path}/check"
train_dataset.save_to_disk(train_dataset_s3_path)
val_dataset.save_to_disk(val_dataset_s3_path)
The dataset above will likely be used within the examples for each SageMaker coaching jobs and SageMaker HyerPod.
Choice A: Advantageous-tune utilizing SageMaker coaching jobs
To fine-tune the mannequin utilizing SageMaker coaching jobs with recipes, this instance makes use of the ModelTrainer class.
The ModelTrainer class is a more recent and extra intuitive strategy to mannequin coaching that considerably enhances person expertise and helps distributed coaching, Construct Your Personal Container (BYOC), and recipes. For extra details about ModelTrainer, you may consult with Speed up your ML lifecycle utilizing the brand new and improved Amazon SageMaker Python SDK – Half 1: ModelTrainer
To arrange the fine-tuning workload, full the next steps:
- Choose the occasion kind, the container picture for the coaching job, and outline the checkpoint path the place the mannequin will likely be saved:
instance_type = "ml.p4d.24xlarge"
image_uri = (
f"658645717510.dkr.ecr.{sagemaker_session.boto_session.region_name}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
)
checkpoint_s3_path = f"s3://{bucket_name}/deepseek-r1-distilled-qwen-7b-recipe-lora/checkpoints"
- Create the ModelTrainer operate to encapsulate the coaching setup from a particular recipe:
from sagemaker.modules.configs import CheckpointConfig, Compute, InputData, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.practice import ModelTrainer
instance_count = 1
# Working override for customized dataset
recipe_overrides = {
...
"coach": {
"num_nodes": instance_count,
...
},
...
"use_smp_model": False, # Required for PEFT
"mannequin": {
"hf_model_name_or_path": model_id,
"information": {
"train_dir": "/choose/ml/enter/information/practice",
"val_dir": "/choose/ml/enter/information/check",
},
},
}
# Outline the compute
compute_configs = Compute(
instance_type=instance_type,
instance_count=instance_count,
keep_alive_period_in_seconds=0
)
model_trainer = ModelTrainer.from_recipe(
training_image=image_uri,
training_recipe="fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora",
recipe_overrides=recipe_overrides,
necessities="./necessities.txt",
compute=compute_configs,
...
checkpoint_config=CheckpointConfig(
s3_uri=f"{checkpoint_s3_path}/{job_prefix}"
),
)
You’ll be able to level to the particular recipe with the training_recipe
argument and override the recipe arguments by offering a dictionary as argument of recipe_overrides
. Within the earlier instance:
num_nodes
: Signifies the variety of situations that will likely be used for the fine-tuning executioncheckpoint_dir
: Location within the container the place the job will save mannequin checkpoints
The ModelTrainer class simplifies the expertise by encapsulating code and coaching setup instantly from the chosen recipe. On this instance:
training_recipe
:hf_deepseek_r1_distilled_qwen_7b_seq8k_gpu_lora
is defining fine-tuning setup for the LoRA approach
- Arrange the enter channels for ModelTrainer by creating an InputData objects from the offered S3 bucket paths for the coaching and check and validation datasets
- Submit the coaching job:
# beginning the practice job with our uploaded datasets as enter
model_trainer.practice(input_data_config=information, wait=True)
Choice B: Advantageous-tune utilizing SageMaker HyperPod with Slurm
To fine-tune the mannequin utilizing HyperPod, make certain your cluster is up and prepared by following the stipulations. To entry the login or head node of the HyperPod Slurm cluster out of your growth surroundings, comply with the login directions at Log in to your cluster within the Amazon SageMaker HyperPod workshop.
Alternatively, you may as well use AWS Programs Supervisor and run a command like the next to begin the session. You will discover the cluster ID, occasion group title, and occasion ID on the Amazon SageMaker console.
aws ssm start-session --target sagemaker-cluster:[cluster-id]_[instance-group-name]-[instance-id] --region region_name
- Within the cluster’s login or head node, run the next instructions to arrange the surroundings. Run
sudo su - ubuntu
to run the remaining instructions as the foundation person except you’ve a particular person ID to entry the cluster and your POSIX person is created via a lifecycle script on the cluster. Discuss with the multi-user setup for extra particulars.
# create a digital surroundings
python3 -m venv ${PWD}/venv
supply venv/bin/activate
# clone the recipes repository and arrange the surroundings
git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
cd sagemaker-hyperpod-recipes
pip3 set up -r necessities.txt
- Create a squash file utilizing Enroot to run the job on the cluster. Enroot runtime gives GPU acceleration, rootless container assist, and seamless integration with excessive efficiency computing (HPC) environments, making it superb for working our workflows securely.
# create a squash file utilizing Enroot
REGION=
IMAGE="658645717510.dkr.ecr.${REGION}.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121"
aws ecr get-login-password --region "${REGION}" | docker login --username AWS --password-stdin 658645717510.dkr.ecr.${REGION}.amazonaws.com
enroot import -o $PWD/smdistributed-modelparallel.sqsh dockerd://${IMAGE}
- After you’ve created the squash file, replace the
recipes_collection/config.yaml
file with absolutely the path to the squash file (created within the previous step), and replace theinstance_type
if wanted. The ultimate config file ought to have the next parameters:
...
cluster_type: slurm
...
instance_type: p4d.24xlarge
...
container: /fsx/.sqsh
...
- Obtain the ready dataset that you just uploaded to S3 into the FSx for Lustre quantity connected to the cluster. Run the next instructions to obtain the recordsdata from Amazon S3:
aws s3 cp s3://{bucket_name}/{input_path}/practice /fsx/ubuntu/deepseek/information/practice --recursive
aws s3 cp s3://{bucket_name}/{input_path}/check /fsx/ubuntu/deepseek/information/check --recursive
- Replace the launcher script for fine-tuning the DeepSeek-R1 Distill Qwen 7B mannequin. The launcher scripts function handy wrappers for executing the coaching script
fundamental.py
file), which streamlines the method of fine-tuning and parameter adjustment. For fine-tuning the DeepSeek-R1 Qwen 7B mannequin, you’ll find the particular script at:
launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_7b_seq16k_gpu_fine_tuning.sh
- Earlier than working the script, it’s worthwhile to modify the situation of the coaching and validation recordsdata and replace the HuggingFace mannequin ID and optionally the entry token for personal fashions and datasets. The script ought to seem like the next (replace
recipes.coach.num_nodes
for those who’re utilizing a multi-node cluster):
SAGEMAKER_TRAINING_LAUNCHER_DIR=${SAGEMAKER_TRAINING_LAUNCHER_DIR:-"$(pwd)"}
HF_MODEL_NAME_OR_PATH="deepseek-ai/DeepSeek-R1-Distill-Qwen-7B" # HuggingFace pretrained mannequin title or path
HF_ACCESS_TOKEN="hf_xxxx" # Non-obligatory HuggingFace entry token
TRAIN_DIR="/fsx/ubuntu/deepseek/information/practice" # Location of coaching dataset
VAL_DIR="/fsx/ubuntu/deepseek/information/check" # Location of validation dataset
EXP_DIR="/fsx/ubuntu/deepseek/outcomes" # Location to avoid wasting experiment information together with logging, checkpoints, and so on
HYDRA_FULL_ERROR=1 python3 "${SAGEMAKER_TRAINING_LAUNCHER_DIR}/fundamental.py"
recipes=fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_fine_tuning
base_results_dir="${SAGEMAKER_TRAINING_LAUNCHER_DIR}/outcomes"
recipes.run.title="hf-deepseek-r1-distilled-qwen-7b-fine-tuning"
recipes.exp_manager.exp_dir="$EXP_DIR"
recipes.coach.num_nodes=1
recipes.mannequin.information.train_dir="$TRAIN_DIR"
recipes.mannequin.information.val_dir="$VAL_DIR"
recipes.mannequin.hf_model_name_or_path="$HF_MODEL_NAME_OR_PATH"
recipes.mannequin.hf_access_token="$HF_ACCESS_TOKEN"
You’ll be able to view the recipe for this fine-tuning job beneath, overriding any extra parameters as wanted:
recipes_collection/recipes/fine-tuning/deepseek/hf_deepseek_r1_distilled_qwen_7b_seq16k_gpu_fine_tuning.yaml
- Submit the job by working the launcher script:
bash launcher_scripts/deepseek/run_hf_deepseek_r1_qwen_7b_seq16k_gpu_fine_tuning.sh
You’ll be able to monitor the job utilizing Slurm instructions corresponding to squeue
and scontrol
present to view the standing of the job and the corresponding logs. After the job is full, the skilled mannequin may also be out there within the outcomes folder, as proven within the following code:
cd outcomes
ls -R
.:
checkpoints experiment
./checkpoints:
full
./checkpoints/full:
steps_50
./checkpoints/full/steps_50:
config.json pytorch_model.bin
./experiment:
...
- Add the fine-tuned mannequin checkpoint to Amazon S3 for evaluating the mannequin utilizing the validation information:
aws s3 cp /fsx/ s3://{bucket_name}/{model_prefix}/qwen7b --recursive
Consider the fine-tuned mannequin
To objectively consider your fine-tuned mannequin, you may run an analysis job on the validation portion of the dataset.
You’ll be able to run a SageMaker coaching job and use ROUGE metrics (ROUGE-1, ROUGE-2, ROUGE-L, and ROUGE-L-Sum), which measure the similarity between machine-generated textual content and human-written reference textual content. The SageMaker coaching job will compute ROUGE metrics for each the bottom DeepSeek-R1 Distill Qwen 7B mannequin and the fine-tuned one. You’ll be able to entry the code pattern for ROUGE analysis within the sagemaker-distributed-training-workshop on GitHub. Please refer this pocket book for particulars.
Full the next steps:
- Outline the S3 path the place the fine-tuned checkpoints are saved, the instance_type, and the picture uri to make use of within the coaching job:
trained_model =
instance_type = "ml.p4d.24xlarge"
image_uri = sagemaker.image_uris.retrieve(
framework="pytorch",
area=sagemaker_session.boto_session.region_name,
model="2.4",
instance_type=instance_type,
image_scope="coaching"
)
#763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:2.4-gpu-py311
- Create the ModelTrainer operate to encapsulate the analysis script and outline the enter information:
from sagemaker.modules.configs import Compute, InputData, OutputDataConfig, SourceCode, StoppingCondition
from sagemaker.modules.distributed import Torchrun
from sagemaker.modules.practice import ModelTrainer
# Outline the script to be run
source_code = SourceCode(
source_dir="./scripts",
necessities="necessities.txt",
entry_script="evaluate_recipe.py",
)
# Outline the compute
...
# Outline the ModelTrainer
model_trainer = ModelTrainer(
training_image=image_uri,
source_code=source_code,
compute=compute_configs,
...
hyperparameters={
"model_id": model_id, # Hugging Face mannequin id
"dataset_name": dataset_name
}
)
# Go the enter information
train_input = InputData(
channel_name="adapterdir",
data_source=trained_model,
)
test_input = InputData(
channel_name="testdata",
data_source=test_dataset_s3_path, # S3 path the place coaching information is saved
)
# Test enter channels configured
information = [train_input, test_input]
- Submit the coaching job:
# beginning the practice job with our uploaded datasets as enter
model_trainer.practice(input_data_config=information, wait=True)
The next desk reveals the duty output for the fine-tuned mannequin and the bottom mannequin.
Mannequin | Rouge 1 | Rouge 2 | Rouge L | Rouge L Sum |
Base | 0.36362 | 0.08739 | 0.16345 | 0.3204 |
Advantageous-tuned | 0.44232 | 0.13022 | 0.17769 | 0.38989 |
% Distinction | 21.64207 | 49.01703 | 8.7121 | 21.68871 |
Our fine-tuned mannequin demonstrates outstanding effectivity, attaining about 22% general enchancment on the reasoning job after just one coaching epoch. Essentially the most important achieve seems in Rouge 2 scores—which measure bigram overlap—with about 49% improve, indicating higher alignment between generated and reference summaries.
Notably, preliminary experiments recommend these outcomes might be additional enhanced by extending the coaching length. Rising the variety of epochs reveals promising potential for extra efficiency positive aspects whereas sustaining computational effectivity.
Clear up
To wash up your sources to keep away from incurring any extra fees, comply with these steps:
- Delete any unused SageMaker Studio sources
- (Non-obligatory) Delete the SageMaker Studio area
- Confirm that your coaching job isn’t working anymore. To take action, in your SageMaker console, select Coaching and verify Coaching jobs.
- If you happen to created a HyperPod cluster, delete the cluster to cease incurring prices. If you happen to created the networking stack from the HyperPod workshop, delete the stack as properly to scrub up the digital non-public cloud (VPC) sources and the FSx for Lustre quantity.
Conclusion
Within the first put up of this two-part DeepSeek-R1 collection, we mentioned how SageMaker HyperPod recipes present a robust but accessible answer for organizations to scale their AI mannequin coaching capabilities with massive language fashions (LLMs) together with DeepSeek. The structure streamlines complicated distributed coaching workflows via its intuitive recipe-based strategy, decreasing setup time from weeks to minutes.
We suggest beginning your LLM customization journey by exploring our pattern recipes within the Amazon SageMaker HyperPod documentation. The AWS AI/ML neighborhood gives intensive sources, together with workshops and technical steering, to assist your implementation journey.
To start utilizing the SageMaker HyperPod recipes, go to the sagemaker-hyperpod-recipes repo on GitHub for complete documentation and instance implementations. Our crew continues to broaden the recipe ecosystem based mostly on buyer suggestions and rising ML traits, ensuring that you’ve got the instruments wanted for profitable AI mannequin coaching.
In our second put up, we talk about how these recipes may additional be used to fine-tune DeepSeek-R1 671b mannequin. Keep tuned!
Concerning the Authors
Kanwaljit Khurmi is a Principal Worldwide Generative AI Options Architect at AWS. He collaborates with AWS product groups, engineering departments, and prospects to supply steering and technical help, serving to them improve the worth of their hybrid machine studying options on AWS. Kanwaljit focuses on aiding prospects with containerized functions and high-performance computing options.
Bruno Pistone is a Senior World Large Generative AI/ML Specialist Options Architect at AWS based mostly in Milan, Italy. He works with AWS product groups and huge prospects to assist them totally perceive their technical wants and design AI and Machine Studying options that take full benefit of the AWS cloud and Amazon Machine Studying stack. His experience contains: Finish-to-end Machine Studying, mannequin customization, and generative AI. He enjoys spending time with associates, exploring new locations, and touring to new locations.
Arun Kumar Lokanatha is a Senior ML Options Architect with the Amazon SageMaker crew. He focuses on massive language mannequin coaching workloads, serving to prospects construct LLM workloads utilizing SageMaker HyperPod, SageMaker coaching jobs, and SageMaker distributed coaching. Outdoors of labor, he enjoys working, mountaineering, and cooking.
Durga Sury is a Senior Options Architect on the Amazon SageMaker crew. Over the previous 5 years, she has labored with a number of enterprise prospects to arrange a safe, scalable AI/ML platform constructed on SageMaker.
Aman Shanbhag is an Affiliate Specialist Options Architect on the ML Frameworks crew at Amazon Internet Companies, the place he helps prospects and companions with deploying ML coaching and inference options at scale. Earlier than becoming a member of AWS, Aman graduated from Rice College with levels in laptop science, arithmetic, and entrepreneurship.
Anirudh Viswanathan is a Sr Product Supervisor, Technical – Exterior Companies with the SageMaker AI Coaching crew. He holds a Masters in Robotics from Carnegie Mellon College, an MBA from the Wharton Faculty of Enterprise, and is known as inventor on over 40 patents. He enjoys long-distance working, visiting artwork galleries, and Broadway reveals.