Scale LLM fine-tuning with Hugging Face and Amazon SageMaker AI

Enterprises are more and more shifting from relying solely on massive, general-purpose language fashions to growing specialised massive language fashions (LLMs) fine-tuned on their very own proprietary information. Though basis fashions (FMs) supply spectacular basic capabilities, they usually fall brief when utilized to the complexities of enterprise environments—the place accuracy, safety, compliance, and domain-specific data are non-negotiable.

To fulfill these calls for, organizations are adopting cost-efficient fashions tailor-made to their inside information and workflows. By fine-tuning on proprietary paperwork and domain-specific terminology, enterprises are constructing fashions that perceive their distinctive context—leading to extra related outputs, tighter information governance, and less complicated deployment throughout inside instruments.

This shift can be a strategic transfer to cut back operational prices, enhance inference latency, and keep larger management over information privateness. Because of this, enterprises are redefining their AI technique as personalized, right-sized fashions aligned to their enterprise wants.

Scaling LLM fine-tuning for enterprise use instances presents actual technical and operational hurdles, that are being overcome by means of the highly effective partnership between Hugging Face and Amazon SageMaker AI.

Many organizations face fragmented toolchains and rising complexity when adopting superior fine-tuning methods like Low-Rank Adaptation (LoRA), QLoRA, and Reinforcement Studying with Human Suggestions (RLHF). Moreover, the useful resource calls for of enormous mannequin coaching—together with reminiscence limitations and distributed infrastructure challenges—usually decelerate innovation and strains inside groups.

To beat this, SageMaker AI and Hugging Face have joined forces to simplify and scale mannequin customization. By integrating the Hugging Face Transformers libraries into SageMaker’s totally managed infrastructure, enterprises can now:

Run distributed fine-tuning jobs out of the field, with built-in help for parameter-efficient tuning strategies
Use optimized compute and storage configurations that scale back coaching prices and enhance GPU utilization
Speed up time to worth through the use of acquainted open supply libraries in a production-grade surroundings

This collaboration helps companies deal with constructing domain-specific, right-sized LLMs, unlocking AI worth quicker whereas sustaining full management over their information and fashions.

On this submit, we present how this built-in method transforms enterprise LLM fine-tuning from a fancy, resource-intensive problem right into a streamlined, scalable answer for reaching higher mannequin efficiency in domain-specific purposes. We use the meta-llama/Llama-3.1-8B mannequin, and execute a Supervised Tremendous-Tuning (SFT) job to enhance the mannequin’s reasoning capabilities on the MedReason dataset through the use of distributed coaching and optimization methods, resembling Absolutely-Sharded Knowledge Parallel (FSDP) and LoRA with the Hugging Face Transformers library, executed with Amazon SageMaker Coaching Jobs.

Understanding the core ideas

The Hugging Face Transformers library is an open-source toolkit designed to fine-tune LLMs by enabling seamless experimentation and deployment with fashionable transformer fashions.

The Transformers library helps quite a lot of strategies for aligning LLMs to particular goals, together with:

Hundreds of pre-trained fashions – Entry to an unlimited assortment of fashions like BERT, Meta Llama, Qwen, T5, and extra, which can be utilized for duties resembling textual content classification, translation, summarization, query answering, object detection, and speech recognition.
Pipelines API – Simplifies widespread duties (resembling sentiment evaluation, summarization, and picture segmentation) by dealing with tokenization, inference, and output formatting in a single name.
Coach API – Gives a high-level interface for coaching and fine-tuning fashions, supporting options like blended precision, distributed coaching, and integration with fashionable {hardware} accelerators.
Tokenization instruments – Environment friendly and versatile tokenizers for changing uncooked textual content into model-ready inputs, supporting a number of languages and codecs.

SageMaker Coaching Jobs is a totally managed, on-demand machine studying (ML) service that runs remotely on AWS infrastructure to coach a mannequin utilizing your information, code, and chosen compute assets. This service abstracts away the complexities of provisioning and managing the underlying infrastructure, so you possibly can deal with growing and fine-tuning your ML and basis fashions. Key capabilities provided by SageMaker coaching jobs are:

Absolutely managed – SageMaker handles useful resource provisioning, scaling, and administration in your coaching jobs, so that you don’t must manually arrange servers or clusters.
Versatile enter – You need to use built-in algorithms, pre-built containers, or convey your personal customized coaching scripts and Docker containers, to execute coaching workloads with hottest frameworks such because the Hugging Face Transformers library.
Scalable – It helps single-node or distributed coaching throughout a number of cases, making it appropriate for each small and large-scale ML workloads.
Integration with a number of information sources – Coaching information will be saved in Amazon Easy Storage Service (Amazon S3), Amazon FSx, and Amazon Elastic Block Retailer (Amazon EBS), and output mannequin artifacts are saved again to Amazon S3 after coaching is full.
Customizable – You possibly can specify hyperparameters, useful resource sorts (resembling GPU or CPU cases), and different settings for every coaching job.
Price-efficient choices – Options like managed Spot Cases, versatile coaching plans, and heterogeneous clusters assist optimize coaching prices.

Answer overview

The next diagram illustrates the answer workflow of utilizing the Hugging Face Transformers library with a SageMaker Coaching job.

The workflow consists of the next steps:

The person prepares the dataset by formatting it with the precise immediate type used for the chosen mannequin.
The person prepares the coaching script through the use of the Hugging Face Transformers library to start out the coaching workload, by specifying the configuration for the distribution choice chosen, resembling Distributed Knowledge Parallel (DDP) or Absolutely-Sharded Knowledge Parallel (FSDP).
The person submits an API request to SageMaker AI, passing the placement of the coaching script, the Hugging Face Coaching container URI, and the coaching configurations required, resembling distribution algorithm, occasion kind, and occasion depend.
SageMaker AI makes use of the coaching job launcher script to run the coaching workload on a managed compute cluster. Based mostly on the chosen configuration, SageMaker AI provisions the required infrastructure, orchestrates distributed coaching, and upon completion, robotically decommissions the cluster.

This streamlined structure delivers a totally managed person expertise, serving to you rapidly develop your coaching code, outline coaching parameters, and choose your most popular infrastructure. SageMaker AI handles the end-to-end infrastructure administration with a pay-as-you-go pricing mannequin that payments just for the online coaching time in seconds.

Conditions

You should full the next stipulations earlier than you possibly can run the Meta Llama 3.1 8B fine-tuning pocket book:

Make the next quota improve requests for SageMaker AI. For this use case, you’ll need to request a minimal of 1 p4d.24xlarge occasion (with 8 x NVIDIA A100 GPUs) and scale to extra p4d.24xlarge cases (relying on time-to-train and cost-to-train trade-offs in your use case). To assist decide the correct cluster dimension for the fine-tuning workload, you need to use instruments like VRAM Calculator or “Can it run LLM“. On the Service Quotas console, request the next SageMaker AI quotas:
- P4D cases (p4.24xlarge) for coaching job utilization: 1
Create an AWS Identification and Entry Administration (IAM) function with managed insurance policies AmazonSageMakerFullAccess and AmazonS3FullAccess to provide required entry to SageMaker AI to run the examples.

Assign the next coverage as a belief relationship to your IAM function:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "sagemaker.amazonaws.com"
                ]
            },
            "Motion": "sts:AssumeRole"
        }
    ]
}

(Optionally available) Create an Amazon SageMaker Studio area (consult with Use fast setup for Amazon SageMaker AI) to entry Jupyter notebooks with the previous function. You too can use JupyterLab in your native setup

These permissions grant broad entry and usually are not beneficial to be used in manufacturing environments. See the SageMaker Developer Information for steerage on defining extra fine-grained permissions.

Put together the dataset

To organize the dataset, you need to load the UCSC-VLAA/MedReason dataset. MedReason is a large-scale, high-quality medical reasoning dataset designed to allow devoted and explainable medical problem-solving in LLMs. The next desk reveals an instance of the info.

dataset_name	id_in_dataset	query	reply	reasoning	choices
medmcqa	7131	Urogenital Diaphragm is made up of the next…	Colle’s fascia. Rationalization: Colle’s fascia do…	Discovering reasoning paths:n1. Urogenital diaphr…	Reply Decisions:nA. Deep transverse Perineusn…
medmcqa	7133	Little one with Sort I Diabetes. What’s the advise…	After 5 years. Rationalization: Screening for diab…	Discovering reasoning paths:nn1. Sort 1 Diab…	Reply Decisions:nA. After 5 yearsnB. After 2 …
medmcqa	7134	Most delicate take a look at for H pylori is-	Biopsy urease take a look at. Rationalization: Davidson&…	Discovering reasoning paths:nn1. Take into account th…	Reply Decisions:nA. Fecal antigen testnB. Bio…

dataset_name

id_in_dataset

query

reasoning

choices

medmcqa

7131

Urogenital Diaphragm is made up of the next…

Colle’s fascia. Rationalization: Colle’s fascia do…

Discovering reasoning paths:n1. Urogenital diaphr…

Reply Decisions:nA. Deep transverse Perineusn…

medmcqa

7133

Little one with Sort I Diabetes. What’s the advise…

After 5 years. Rationalization: Screening for diab…

**Discovering reasoning paths:**nn1. Sort 1 Diab…

Reply Decisions:nA. After 5 yearsnB. After 2 …

medmcqa

7134

Most delicate take a look at for H pylori is-

Biopsy urease take a look at. Rationalization:

Davidson&…

**Discovering reasoning paths:**nn1. Take into account th…

Reply Decisions:nA. Fecal antigen testnB. Bio…

We wish to use the next columns for making ready our dataset:

query – The query being posed
reply – The right reply to the query
reasoning – An in depth, step-by-step logical clarification of find out how to arrive on the appropriate reply

We are able to use the next steps to format the enter within the correct type used for Meta Llama 3.1, and configure the info channels for SageMaker coaching jobs on Amazon S3:

Load the UCSC-VLAA/MedReason dataset, utilizing the primary 10,000 rows of the unique dataset:

from datasets import load_dataset
dataset = load_dataset("UCSC-VLAA/MedReason", break up="prepare[:10000]")

Apply the correct chat template to the dataset through the use of the apply_chat_template technique of the Tokenizer:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_id)

def prepare_dataset(pattern):

    system_text = (
        "You're a deep-thinking AI assistant.nn" 
        "For each person query, first write your ideas and reasoning inside ... tags, then present your reply."
    )

    messages = []

    messages.append({"function": "system", "content material": system_text})
    messages.append({"function": "person", "content material": pattern["question"]})
    messages.append(
        {
            "function": "assistant",
            "content material": f"n{pattern['reasoning']}nn{pattern['answer']}",
        }
    )

    # Apply chat template
    pattern["text"] = tokenizer.apply_chat_template(
        messages, tokenize=False
    )

    return pattern

The perform prepare_dataset will iterate over the weather of the dataset, and use the apply_chat_template perform to have a immediate template within the following type:

system
{{SYSTEM_PROMPT}}
person
{{QUESTION}}
assistant

{{REASONING}}


{{FINAL_ANSWER}}

The next code is an instance of the formatted immediate:

<|begin_of_text|><|start_header_id|>system<|end_header_id|> 
You're a deep-thinking AI assistant. 
For each person query, first write your ideas and reasoning inside ... tags, then present your reply.
<|eot_id|><|start_header_id|>person<|end_header_id|> 
A 66-year-old man presents to the emergency room with blurred imaginative and prescient, lightheadedness, and chest ache that began half-hour in the past. The affected person is awake and alert. 
His historical past is important for uncontrolled hypertension, coronary artery illness, and he beforehand underwent percutaneous coronary intervention. 
He's afebrile. The center fee is 102/min, the blood strain is 240/135 mm Hg, and the O2 saturation is 100% on room air. 
An ECG is carried out and reveals no acute modifications. A speedy intravenous infusion of a drug that will increase peripheral venous capacitance is began. 
This drug has an onset of motion that's lower than 1 minute with speedy serum clearance than necessitates a steady infusion. What's the most extreme aspect impact of this remedy?
<|eot_id|><|start_header_id|>assistant<|end_header_id|> 
 
### Discovering Reasoning Paths: 
1. **Blurred imaginative and prescient, lightheadedness, and chest ache** → Malignant hypertension → Fast IV antihypertensive remedy. 
2. **Uncontrolled hypertension and coronary artery illness** → Malignant hypertension → Fast IV antihypertensive remedy. 
3. **Extreme hypertension (BP 240/135 mm Hg)** → Danger of end-organ harm → Malignant hypertension → Fast IV antihypertensive remedy. 
4. **Chest ache and historical past of coronary artery illness** → Danger of myocardial ischemia → Malignant hypertension → Fast IV antihypertensive remedy. --- 

### Reasoning Course of: 
1. **Scientific Presentation and Prognosis**:  - The affected person presents with blurred imaginative and prescient...
...
 

Cyanide poisoning
<|eot_id|><|end_of_text|>

Break up the dataset into prepare, validation, and take a look at datasets:

from datasets import Dataset, DatasetDict
from random import randint

train_dataset = Dataset.from_pandas(prepare)
val_dataset = Dataset.from_pandas(val)
test_dataset = Dataset.from_pandas(take a look at)

dataset = DatasetDict({"prepare": train_dataset, "val": val_dataset})
train_dataset = dataset["train"].map(
    prepare_dataset, remove_columns=record(train_dataset.options)
)

val_dataset = dataset["val"].map(
    prepare_dataset, remove_columns=record(val_dataset.options)
)

Put together the coaching and validation datasets for the SageMaker coaching job by saving them as JSON information and setting up the S3 paths the place these information will probably be uploaded:

...
 
train_dataset.to_json("./information/prepare/dataset.jsonl")
val_dataset.to_json("./information/val/dataset.jsonl")

 
s3_client.upload_file(
    "./information/prepare/dataset.jsonl", bucket_name, f"{input_path}/prepare/dataset.jsonl"
)
s3_client.upload_file(
    "./information/val/dataset.jsonl", bucket_name, f"{input_path}/val/dataset.jsonl"
)

Put together the coaching script

To fine-tune meta-llama/Llama-3.1-8B with a SageMaker Coaching job, we ready the prepare.py file, which serves because the entry level of the coaching job to execute the fine-tuning workload.

The coaching course of can use Coach or SFTTrainer lessons to fine-tune our mannequin. This simplifies the method of continued pre-training for LLMs. This method makes fine-tuning environment friendly for adapting pre-trained fashions to particular duties or domains.

The Coach and SFTTrainer lessons each facilitate mannequin coaching with Hugging Face transformers. The Coach class is the usual high-level API for coaching and evaluating transformer fashions on a variety of duties, together with textual content classification, sequence labeling, and textual content technology. The SFTTrainer is a subclass constructed particularly for supervised fine-tuning of LLMs, significantly for instruction-following or conversational duties.

To speed up the mannequin fine-tuning, we distribute the coaching workload through the use of the FSDP approach. It’s a sophisticated parallelism approach designed to coach massive fashions which may not match within the reminiscence of a single GPU, with the next advantages:

Parameter sharding – As a substitute of replicating the complete mannequin on every GPU, FSDP splits (shards) mannequin parameters, optimizer states, and gradients throughout GPUs
Reminiscence effectivity – By sharding, FSDP drastically reduces the reminiscence footprint on every gadget, enabling coaching of bigger fashions or bigger batch sizes
Synchronization – Throughout coaching, FSDP gathers solely the mandatory parameters for every computation step, then releases reminiscence instantly after, additional saving assets
CPU offload – Optionally, FSDP can offload some information to CPUs to save lots of much more GPU reminiscence

In our instance, we use the Coach class and outline the required TrainingArguments to execute the FSDP distributed workload:

from transformers import (
    Coach,
    TrainingArguments
)

coach = Coach(
    mannequin=mannequin,
    train_dataset=train_ds,
    eval_dataset=test_ds if test_ds just isn't None else None,
    args=transformers.TrainingArguments(
        **training_args, 
    ),
    callbacks=callbacks,
    data_collator=transformers.DataCollatorForLanguageModeling(
        tokenizer, multilevel marketing=False
    )
)

To additional optimize the fine-tuning workload, we use the QLoRA approach, which quantizes a pre-trained language mannequin to 4 bits and attaches small Low-Rank Adapters, that are fine-tuned:

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained(script_args.model_id)

# Outline PAD token
tokenizer.pad_token = tokenizer.eos_token

# Configure quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_storage=torch.bfloat16
)

# Load the mannequin
mannequin = AutoModelForCausalLM.from_pretrained(
    script_args.model_id,
    trust_remote_code=True,
    quantization_config=bnb_config,
    use_cache=not training_args.gradient_checkpointing,
    cache_dir="/tmp/.cache",
    **model_configs,
)

The script_args and training_args are offered as hyperparameters for the SageMaker Coaching job in a configuration recipe .yaml file and parsed within the prepare.py file through the use of the TrlParser class offered by Hugging Face TRL:

model_id: "meta-llama/Llama-3.1-8B-Instruct"      # Hugging Face mannequin id
# sagemaker particular parameters
output_dir: "/choose/ml/mannequin"                       # path to the place SageMaker will add the mannequin 
checkpoint_dir: "/choose/ml/checkpoints/"            # path to the place SageMaker will add the mannequin checkpoints
train_dataset_path: "/choose/ml/enter/information/prepare/"   # path to the place S3 saves prepare dataset
val_dataset_path: "/choose/ml/enter/information/val/"       # path to the place S3 saves take a look at dataset
save_steps: 100                                   # Save checkpoint each this many steps
token: ""
# coaching parameters
lora_r: 32
lora_alpha:64
lora_dropout: 0.1                 
learning_rate: 2e-4                    # studying fee scheduler
num_train_epochs: 2                    # variety of coaching epochs
per_device_train_batch_size: 4         # batch dimension per gadget throughout coaching
per_device_eval_batch_size: 2          # batch dimension for analysis
gradient_accumulation_steps: 4         # variety of steps earlier than performing a backward/replace go
gradient_checkpointing: true           # use gradient checkpointing
bf16: true                             # use bfloat16 precision
tf32: false                            # use tf32 precision
fsdp: "full_shard auto_wrap offload"   #FSDP configurations
fsdp_config: 
    backward_prefetch: "backward_pre"
    cpu_ram_efficient_loading: true
    offload_params: true
    forward_prefetch: false
    use_orig_params: true
warmup_steps: 100
weight_decay: 0.01
merge_weights: true                    # merge weights within the base mannequin

For the applied use case, we determined to fine-tune the adapter with the next values:

lora_r: 32 – Permits the adapter to seize extra complicated reasoning transformations.
lora_alpha: 64 – Given the reasoning job we are attempting to enhance, this worth permits the adapter to have a big impression to the bottom.
lora_dropout: 0.05 – We wish to protect reasoning connection by avoiding breaking vital ones.
warmup_steps: 100 – Steadily will increase the training fee to the desired worth. For this reasoning job, we would like the mannequin to be taught a brand new construction with out forgetting the earlier data.
weight_decay: 0.01 – Maintains mannequin generalization.

Put together the configuration file for the SageMaker Coaching job by saving them as JSON information and setting up the S3 paths the place these information will probably be uploaded:

import os

if default_prefix:
    input_path = f"{default_prefix}/datasets/llm-fine-tuning-modeltrainer-sft"
else:
    input_path = f"datasets/llm-fine-tuning-modeltrainer-sft"

train_config_s3_path = f"s3://{bucket_name}/{input_path}/config/args.yaml"

# add the mannequin yaml file to s3
model_yaml = "args.yaml"
s3_client.upload_file(model_yaml, bucket_name, f"{input_path}/config/args.yaml")
os.take away("./args.yaml")

print(f"Coaching config uploaded to:")
print(train_config_s3_path)

SFT coaching utilizing a SageMaker Coaching job

To run a fine-tuning workload utilizing the SFT coaching script and SageMaker Coaching jobs, we use the ModelTrainer class.

The ModelTrainer class is a and extra intuitive method to mannequin coaching that considerably enhances person expertise and helps distributed coaching, Construct Your Personal Container (BYOC), and recipes. For added info consult with the SageMaker Python SDK documentation.

Arrange the fine-tuning workload with the next steps:

Specify the occasion kind, the container picture for the coaching job, and the checkpoint path the place the mannequin will probably be saved:

instance_type = "ml.p4d.24xlarge"
instance_count = 1

image_uri = image_uris.retrieve(
    framework="huggingface",
    area=sagemaker_session.boto_session.region_name,
    model="4.56.2",
    base_framework_version="pytorch2.8.0",
    instance_type=instance_type,
    image_scope="coaching",
)

Outline the supply code configuration by pointing to the created prepare.py:

from sagemaker.prepare.configs import SourceCode

source_code = SourceCode(
    source_dir="./scripts",
    necessities="necessities.txt",
    entry_script="prepare.py",
)

Configure the coaching compute by optionally offering the parameter keep_alive_period_in_seconds to make use of managed heat swimming pools, to retain and reuse the cluster through the experimentation part:
```
from sagemaker.prepare.configs Compute

compute_configs = Compute(
    instance_type=instance_type,
    instance_count=instance_count,
    keep_alive_period_in_seconds=0,
)
```

Create the ModelTrainer perform by offering the required coaching setup, and outline the argument distributed=Torchrun() to make use of torchrun as a launcher to execute the coaching job in a distributed method throughout the accessible GPUs within the chosen occasion:

from sagemaker.prepare.configs import (
    CheckpointConfig,
    OutputDataConfig,
    StoppingCondition,
)
from sagemaker.prepare.distributed import Torchrun
from sagemaker.prepare.model_trainer import ModelTrainer


# outline Coaching Job Title
job_name = f"train-{model_id.break up('/')[-1].change('.', '-')}-sft"

# outline OutputDataConfig path
output_path = f"s3://{bucket_name}/{job_name}"

# Outline the ModelTrainer
model_trainer = ModelTrainer(
    training_image=image_uri,
    source_code=source_code,
    base_job_name=job_name,
    compute=compute_configs,
    distributed=Torchrun(),
    stopping_condition=StoppingCondition(max_runtime_in_seconds=18000),
    hyperparameters={
        "config": "/choose/ml/enter/information/config/args.yaml"  # path to TRL config which was uploaded to s3
    },
    output_data_config=OutputDataConfig(s3_output_path=output_path),
    checkpoint_config=CheckpointConfig(
        s3_uri=output_path + "/checkpoint", local_path="/choose/ml/checkpoints"
    ),
)

Arrange the enter channels for the ModelTrainer by creating InputData objects from the offered S3 bucket paths for the coaching and validation dataset, and for the configuration parameters:

from sagemaker.prepare.configs import InputData
# Cross the enter information
train_input = InputData(
    channel_name="prepare",
    data_source=train_dataset_s3_path, # S3 path the place coaching information is saved
)
val_input = InputData(
    channel_name="val",
    data_source=val_dataset_s3_path, # S3 path the place validation information is saved
)
config_input = InputData(
    channel_name="config",
    data_source=train_config_s3_path, # S3 path the place configurations are saved
)
# Test enter channels configured
information = [train_input, val_input, config_input]

Submit the coaching job:

model_trainer.prepare(input_data_config=information, wait=False)

The coaching job with Flash Consideration 2 for one epoch with a dataset of 10,000 samples takes roughly 18 minutes to finish.

Deploy and take a look at fine-tuned Meta Llama 3.1 8B on SageMaker AI

To guage your fine-tuned mannequin, you’ve gotten a number of choices. You need to use an extra SageMaker Coaching job to guage the mannequin with Hugging Face Lighteval on SageMaker AI, or you possibly can deploy the mannequin to a SageMaker real-time endpoint and interactively take a look at the mannequin through the use of methods like LLM as decide to match generated content material with floor reality content material. For a extra complete analysis that demonstrates the impression of fine-tuning on mannequin efficiency, you need to use the MedReason analysis script to match the bottom meta-llama/Llama-3.1-8B mannequin along with your fine-tuned model.

On this instance, we use the deployment method, iterating over the take a look at dataset and evaluating the mannequin on these samples utilizing a easy loop.

Choose the occasion kind and the container picture for the endpoint:

import boto3

sm_client = boto3.shopper("sagemaker", region_name=sess.boto_region_name)

image_uri = "763104351884.dkr.ecr.us-east-1.amazonaws.com/vllm:0.13-gpu-py312"

Create the SageMaker Mannequin utilizing the container URI for vLLM and the S3 path to your mannequin. Set your vLLM configuration, together with the variety of GPUs and max enter tokens. For a full record of configuration choices, see vLLM engine arguments.

env = {
    "SM_VLLM_MODEL": "/choose/ml/mannequin",
    "SM_VLLM_DTYPE": "bfloat16",
    "SM_VLLM_GPU_MEMORY_UTILIZATION": "0.8",
    "SM_VLLM_MAX_MODEL_LEN": json.dumps(1024 * 16),
    "SM_VLLM_MAX_NUM_SEQS": "1",
    "SM_VLLM_ENABLE_CHUNKED_PREFILL": "true",
    "SM_VLLM_KV_CACHE_DTYPE": "auto",
    "SM_VLLM_TENSOR_PARALLEL_SIZE": "4",
}

model_response = sm_client.create_model(
    ModelName=f"{model_id.break up('/')[-1].change('.', '-')}-model",
    ExecutionRoleArn=function,
    PrimaryContainer={
        "Picture": image_uri,
        "Atmosphere": env,
        "ModelDataSource": {
            "S3DataSource": {
                "S3Uri": f"s3://{bucket_name}/{job_prefix}/{job_name}/output/mannequin.tar.gz",
                "S3DataType": "S3Prefix",
                "CompressionType": "Gzip",
            }
        },
    },
)

Create the endpoint configuration by specifying the sort and variety of cases:

instance_count = 1
instance_type = "ml.g5.12xlarge"
health_check_timeout = 700

endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=f"{model_id.break up('/')[-1].change('.', '-')}-config",
    ProductionVariants=[
        {
            "VariantName": "AllTraffic",
            "ModelName": f"{model_id.split('/')[-1].change('.', '-')}-model",
            "InstanceType": instance_type,
            "InitialInstanceCount": instance_count,
            "ModelDataDownloadTimeoutInSeconds": health_check_timeout,
            "ContainerStartupHealthCheckTimeoutInSeconds": health_check_timeout,
            "InferenceAmiVersion": "al2-ami-sagemaker-inference-gpu-3-1",
        }
    ],
)

Deploy the mannequin:

endpoint_response = sm_client.create_endpoint(
    EndpointName=f"{model_id.break up('/')[-1].change('.', '-')}-sft", 
    EndpointConfigName=f"{model_id.break up('/')[-1].change('.', '-')}-config",
)

SageMaker AI will now create the endpoint and deploy the mannequin to it. This will take 5–10 minutes. Afterwards, you possibly can take a look at the mannequin by sending some instance inputs to the endpoint. You need to use the invoke_endpoint technique of the sagemaker-runtime shopper to ship the enter to the mannequin and get the output:

import json
import pandas as pd

eval_dataset = []

for index, el in enumerate(test_dataset, 1):
    print("Processing merchandise ", index)

    payload = {
        "messages": [
            {
                "role": "system",
                "content": "You are a deep-thinking AI assistant.nnFor every user question, first write your thoughts and reasoning inside ... tags, then provide your answer.",
            },
            {"role": "user", "content": el["question"]},
        ],
        "max_tokens": 4096,
        "cease": ["<|eot_id|>", "<|end_of_text|>"],
        "temperature": 0.4,
        "top_p": 0.9,
        "repetition_penalty": 1.15,
        "no_repeat_ngram_size": 3,
        "do_sample": True,
    }

    response = predictor.invoke_endpoint(
        EndpointName=endpoint_name,
        ContentType="software/json",
        Physique=json.dumps(payload),
    )

    outcome = json.masses(response["Body"].learn().decode())
    eval_dataset.append([el["question"], outcome["choices"][0]["message"]["content"]])

    print("**********************************************")

eval_dataset_df = pd.DataFrame(
    eval_dataset, columns=["question", "answer"]
)

eval_dataset_df.to_json(
    "./eval_dataset_results.jsonl", orient="data", traces=True
)

The next are some examples of generated solutions:

Query: "Perl's stain or prussion blue take a look at is for:"

Reply Tremendous-tuned: """

The Perl's stain or Prussian blue take a look at is used to detect the presence of iron in organic samples. 
It includes including potassium ferrocyanide (K4[Fe(CN)6]) to the pattern, 
which reacts with the iron ions current in it to type a darkish blue-colored compound generally known as ferric ferrocyanide. 
This response will be noticed visually, permitting researchers to find out if iron is current within the pattern.


In less complicated phrases, the Perl's stain or Prussian blue take a look at is used to determine iron in organic samples.
"""

The fine-tuned mannequin reveals sturdy reasoning capabilities by offering structured, detailed explanations with clear thought processes, breaking down the ideas step-by-step earlier than arriving on the remaining reply. This instance showcases the effectiveness of our fine-tuning method utilizing Hugging Face Transformers and a SageMaker Coaching job.

Clear up

To wash up your assets to keep away from incurring further prices, comply with these steps:

Delete any unused SageMaker Studio assets.
(Optionally available) Delete the SageMaker Studio area.
Confirm that your coaching job isn’t operating anymore. To take action, on the SageMaker console, beneath Coaching within the navigation pane, select Coaching jobs.
Delete the SageMaker endpoint.

Conclusion

On this submit, we demonstrated how enterprises can effectively scale fine-tuning of each small and enormous language fashions through the use of the combination between the Hugging Face Transformers library and SageMaker Coaching jobs. This highly effective mixture transforms historically complicated and resource-intensive processes into streamlined, scalable, and production-ready workflows.

Utilizing a sensible instance with the meta-llama/Llama-3.1-8B mannequin and the MedReason dataset, we demonstrated find out how to apply superior methods like FSDP and LoRA to cut back coaching time and value—with out compromising mannequin high quality.

This answer highlights how enterprises can successfully tackle widespread LLM fine-tuning challenges resembling fragmented toolchains, excessive reminiscence and compute necessities, and multi-node scaling inefficiencies and GPU underutilization.

Through the use of the built-in Hugging Face and SageMaker structure, companies can now construct and deploy personalized, domain-specific fashions quicker—with larger management, cost-efficiency, and scalability.

To get began with your personal LLM fine-tuning undertaking, discover the code samples offered in our GitHub repository.

Concerning the Authors

Florent Gbelidji is a Machine Studying Engineer for Buyer Success at Hugging Face. Based mostly in Paris, France, Florent joined Hugging Face 3.5 years in the past as an ML Engineer within the Professional Acceleration Program, serving to firms construct options with open supply AI. He’s now the Cloud Partnership Tech Lead for the AWS account, driving integrations between the Hugging Face surroundings and AWS providers.

Bruno Pistone is a Senior Worldwide Generative AI/ML Specialist Options Architect at AWS based mostly in Milan, Italy. He works with AWS product groups and enormous prospects to assist them totally perceive their technical wants and design AI and machine studying options that take full benefit of the AWS cloud and Amazon ML stack. His experience contains distributed coaching and inference workloads, mannequin customization, generative AI, and end-to-end ML. He enjoys spending time with pals, exploring new locations, and touring to new locations.

Louise Ping is a Senior Worldwide GenAI Specialist, the place she helps companions construct go-to-market methods and leads cross-functional initiatives to increase alternatives and drive adoption. Drawing from her numerous AWS expertise throughout Storage, APN Accomplice Advertising, and AWS Market, she works carefully with strategic companions like Hugging Face to drive technical collaborations. When not working at AWS, she makes an attempt dwelling enchancment tasks—ideally with restricted mishaps.

Safir Alvi is a Worldwide GenAI/ML Go-To-Market Specialist at AWS based mostly in New York. He focuses on advising strategic world prospects on scaling their mannequin coaching and inference workloads on AWS, and driving adoption of Amazon SageMaker AI Coaching Jobs and Amazon SageMaker HyperPod. He makes a speciality of optimizing and fine-tuning generative AI and machine studying fashions throughout numerous industries, together with monetary providers, healthcare, automotive, and manufacturing.

Scale LLM fine-tuning with Hugging Face and Amazon SageMaker AI

Not All RecSys Issues Are Created Equal

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts