PEFT high-quality tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Coaching massive language fashions (LLMs) fashions has change into a major expense for companies. For a lot of use circumstances, corporations need to use LLM basis fashions (FM) with their domain-specific information. Nevertheless, corporations are discovering that performing full high-quality tuning for these fashions with their information isn’t value efficient. To cut back prices whereas persevering with to make use of the facility of AI, many corporations have shifted to high-quality tuning LLMs on their domain-specific information utilizing Parameter-Environment friendly Effective Tuning (PEFT). PEFT is a set of strategies designed to adapt pre-trained LLMs to particular duties whereas minimizing the variety of parameters that must be up to date. Methods similar to Low-Rank Adaptation (LoRA) and Weighted-Decomposed Low Rank Adaptation (DoRA), considerably decreasing the variety of trainable parameters and leading to decrease prices for high-quality tuning.

Along with value, performing high-quality tuning for LLMs at scale presents vital technical challenges. The method of establishing and configuring a distributed coaching surroundings may be advanced, requiring experience in server administration, cluster configuration, networking and distributed computing. Manually managing such complexity can usually be counter-productive and take away beneficial sources out of your companies AI improvement. To simplify infrastructure setup and speed up distributed coaching, AWS launched Amazon SageMaker HyperPod in late 2023.

On this weblog publish, we showcase how one can carry out environment friendly supervised high-quality tuning for a Meta Llama 3 mannequin utilizing PEFT on AWS Trainium with SageMaker HyperPod. We use HuggingFace’s Optimum-Neuron software program improvement package (SDK) to use LoRA to fine-tuning jobs, and use SageMaker HyperPod as the first compute cluster to carry out distributed coaching on Trainium. Utilizing LoRA supervised fine-tuning for Meta Llama 3 fashions, you possibly can additional cut back your value to high-quality tune fashions by as much as 50% and cut back the coaching time by 70%.

Resolution overview

SageMaker HyperPod is designed to assist cut back the time required to coach generative AI FMs by offering a purpose-built infrastructure for distributed coaching at scale. When utilizing SageMaker HyperPod for coaching, SageMaker will actively monitor the cluster’s well being, robotically changing defective nodes and resuming mannequin coaching from checkpoints. The clusters come pre-configured with SageMaker distributed coaching libraries that allow you to separate your coaching information and mannequin throughout 1000’s of compute nodes, permitting information to be processed in parallel whereas totally using the cluster’s compute and community infrastructure. You may also customise your distributed coaching. The structure diagram that follows gives a excessive degree overview of those numerous elements:

Compute cluster: This comprises a head node that orchestrates computation throughout a cluster of employee nodes. As a result of the pinnacle node is barely facilitating the coaching, it’s sometimes a a lot smaller occasion. On this publish, we use Amazon Elastic Compute Cloud (Amazon EC2) Trn1 situations for the employee nodes and a single Amazon EC2 C5 occasion for the pinnacle node.
Shared Quantity: FSx for Lustre is used because the shared storage quantity throughout nodes to maximise information throughput. It’s mounted at /fsx on the pinnacle and compute nodes.
Exterior storage: Amazon Easy Storage Service (Amazon S3) is used to retailer the cluster’s lifecycle scripts, configuration information, datasets, and checkpoints.
Scheduler: SLURM is used because the job scheduler for the cluster.

Trainium chips are purpose-built for deep studying coaching of 100 billion and bigger parameter fashions. Mannequin coaching on Trainium is supported by the AWS Neuron SDK, which gives compiler, runtime, and profiling instruments that unlock high-performance and cost-effective deep studying acceleration. To study extra about Trainium chips and the Neuron SDK, see Welcome to AWS Neuron.

To combine Trainium chips with current fashions and instruments offered by means of the transformers bundle, Hugging Face’s Optimum-Neuron bundle features as an interface with Neuron. With Optimum-Neuron, customers can apply strategies similar to LoRA to their fine-tuning jobs, streamlining the method of adapting LLMs for particular duties whereas capitalizing on the efficiency good points offered by the AWS infrastructure.

Conventional high-quality tuning entails modifying all of the parameters of a mannequin, which may be computationally costly and reminiscence intensive. PEFT approaches similar to LoRA concentrate on introducing a smaller set of trainable parameters, usually within the type of low-rank matrices that regulate the mannequin’s habits whereas preserving most of its parameters frozen. The benefit of LoRA lies in its capacity to take care of the efficiency of the bottom mannequin whereas considerably decreasing the computational burden and useful resource necessities. The Neuron 2.20 launch helps mannequin coaching with LoRA on Trainium.

Within the subsequent part, we’ll stroll by means of the code in three steps for PEFT on Trainium with HyperPod:

Organising and deploying a HyperPod cluster for distributed coaching.
Effective tuning a Meta Llama 3-8B mannequin on Trainium occasion with the dolly 15k dataset.
Mannequin weights consolidation and inference.

Amazon SageMaker HyperPod cluster setup

On this first part, you’ll start establishing your Amazon SageMaker HyperPod compute surroundings for high-quality tuning.

Conditions

The next are the conditions for configuring and deploying a SageMaker HyperPod cluster for high-quality tuning:

Step 1: Infrastructure setup

After finishing the conditions, deploy an AWS CloudFormation stack that comprises the required infrastructure elements for distributed coaching by means of SageMaker HyperPod. The default Area specified within the template is us-west-2, however you possibly can modify that. Additionally, you will must specify the Availability Zone the place your subnets can be deployed. The template configures your surroundings with an Amazon Digital Non-public Cloud (Amazon VPC) and corresponding private and non-private subnets for community isolation. It establishes further elements inside your VPC together with an S3 bucket for lifecycle scripts and FSx for Lustre, a file system shared throughout the pinnacle and compute nodes of the HyperPod cluster.

Step 2: Cluster configuration

Configure and deploy the HyperPod cluster. Start by defining your infrastructure’s surroundings variables by means of the create_config script. This script makes use of the AWS CLI to extract infrastructure part variables out of your CloudFormation stack together with Area, useful resource IDs, and Amazon Useful resource Identify (ARN).

# Set area
export AWS_REGION=us-west-2

# Fetch create_config script
curl 'https://static.us-east-1.prod.workshops.aws/public/05a78a77-24f9-4f29-867c-64c9687646e1/static/scripts/create_config.sh' --output create_config.sh

# Set surroundings variables
bash create_config.sh
supply env_vars

After setting your surroundings variables, obtain the lifecycle scripts required for bootstrapping the compute nodes in your SageMaker HyperPod cluster and outline its configuration settings earlier than importing the scripts to your S3 bucket.

# Obtain Lifecycle scripts
git clone --depth=1 https://github.com/aws-samples/awsome-distributed-training/

# add scripts to s3
aws s3 cp --recursive awsome-distributed-training/1.architectures/5.sagemaker-hyperpod/LifecycleScripts/base-config/ s3://${BUCKET}/src

After importing the Lifecycle scripts to Amazon S3, create your cluster and file system configurations. See the Create Cluster part of the SageMaker HyperPod workshop to create these information. After producing the cluster-config.json and provisioning_parameters.json configuration information, validate them and add the FSx for Lustre configuration file to Amazon S3.

# validate and test config for recognized points
curl -O https://uncooked.githubusercontent.com/aws-samples/awsome-distributed-training/most important/1.architectures/5.sagemaker-hyperpod/validate-config.py
python3 validate-config.py --cluster-config cluster-config.json --provisioning-parameters provisioning_parameters.json

# Add FSx configuration to S3
aws s3 cp provisioning_parameters.json s3://${BUCKET}/src/

Step 3: Cluster deployment

Now that the cluster’s configuration is outlined, you possibly can create the cluster.

aws sagemaker create-cluster 
--cli-input-json file://cluster-config.json 
--region $AWS_REGION

It is best to have the ability to see your cluster by navigating to SageMaker Hyperpod within the AWS Administration Console and see a cluster named ml-cluster listed. After a couple of minutes, its standing ought to change from Creating to InService.

If you choose your cluster, it is possible for you to to see the main points of your compute cluster together with the pinnacle and employee nodes.

After putting in the Techniques Supervisor Session Supervisor plugin, you possibly can ssh into your cluster’s head node utilizing the easy-ssh script to start coaching.

# Modify permissions and ssh
chmod +x easy-ssh.sh
./easy-ssh.sh -c controller-machine ml-cluster

# Change to ubuntu consumer
sudo su - ubuntu

# Change listing
cd /fsx

Now that your cluster is working and accessible by means of ssh, you possibly can start importing the mannequin coaching scripts to the shared file system by means of both curl or the AWS CLI. For extra directions on establishing your cluster, see the SageMaker HyperPod workshop.

Effective tuning

Now that your SageMaker HyperPod cluster is deployed, you can begin getting ready to execute your high-quality tuning job.

Information preparation

The muse of profitable language mannequin high-quality tuning lies in correctly structured and ready coaching information. This implementation focuses on instruction-tuned datasets, which kind the spine of recent language mannequin adaptation. These datasets work collectively to create significant coaching examples by means of three important elements:

Directions that information the mannequin’s activity.
Optionally available context that gives background data.
Responses that characterize the specified output.

Coaching begins by loading your dataset and formatting your dataset examples with this construction. Loading your dataset may be completed by means of the Hugging Face datasets library, which gives an easy interface for accessing and managing coaching information. Hugging Face additionally gives this format perform for the databricks-dolly-15k dataset. Notice that the format perform must be embedded in your prepare.py file (as proven within the following pattern). It’s referenced by the NeuronSFTTrainer to format your dataset throughout high-quality tuning.

# Load dataset
dataset = load_dataset(args.dataset, cut up="prepare")

def format_dolly(examples):
    output_text = []
    for i in vary(len(examples["instruction"])):
        instruction = f"### Instructionn{examples['instruction'][i]}"
        context = f"### Contextn{examples['context'][i]}" if examples["context"][i] else None
        response = f"### Answern{examples['response'][i]}"
        immediate = "nn".be part of([i for i in [instruction, context, response] if i is just not None])
        output_text.append(immediate)
    return output_text

The formatting perform employs delimiter tokens ("###") to create clear boundaries between totally different elements of every coaching instance. This separation is vital as a result of it helps the mannequin distinguish between totally different elements of the enter throughout coaching. The perform handles circumstances the place context could be lacking, ensuring that the ultimate format stays constant no matter whether or not all elements are current. Double newlines between sections present further structural readability that helps the mannequin acknowledge the pure breaks within the enter.

Tokenization

After formatting your dataset, the following step is tokenization—the method of changing your textual content information right into a numerical format that your mannequin can perceive. Tokenization serves because the bridge between your human-readable textual content and the mathematical operations that drive your mannequin’s understanding of language. To start, you utilize Hugging Face’s AutoTokenizer to load your mannequin’s tokenizer.

tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
tokenizer.pad_token = tokenizer.eos_token

The AutoTokenizer class robotically selects the suitable tokenizer to your mannequin, loading not simply the vocabulary, but in addition the principles and particular tokens that match your coaching configuration. The task of the padding token to match the end-of-sequence token is especially vital for causal language modeling, as a result of it verifies the constant dealing with of your variable-length sequences.

The tokenization course of itself operates in a number of phases. First, it breaks down your enter textual content into tokens based mostly on its vocabulary. These tokens are then transformed to numerical IDs that your mannequin can course of. Throughout this course of, your tokenizer additionally handles particular tokens that mark the start and finish of sequences, along with padding tokens that guarantee that the sequences in your batch have the identical size.

When working with tokenizers, your sequence size administration turns into a essential consideration. Your most sequence size should stability between preserving sufficient data to your mannequin to grasp the context and staying inside your mannequin’s architectural limitations. Too quick, and also you threat shedding vital context; too lengthy, and also you may exceed reminiscence constraints or introduce pointless computational overhead.

Mannequin compilation and high-quality tuning

For this answer, you created a SageMaker HyperPod cluster with the controller node and one employee node. The employee node comprises one ml.trn1.32xlarge occasion which has 32 Neuron cores. You may conduct distributed high-quality tuning utilizing all 32 Neuron cores throughout the employee node.

Step 1: Setting setup

You first want to put in the required Python packages for high-quality tuning. The next is the bash script for the Python surroundings setup. Notice that the answer makes use of probably the most not too long ago launched Neuron SDK. From the HOME listing, create a file contact surroundings.sh with the next code and run it with sbatch ./surroundings.sh. You may want to switch the permissions of the shell scripts all through this publish earlier than working them with the command chmod +x surroundings.sh.

#!/usr/bin/env bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/surroundings.out

sudo apt set up -y python3.8-venv git
python3.8 -m venv $HOME/peft_ft/env_llama3_8B_peft
supply $HOME/peft_ft/env_llama3_8B_peft/bin/activate
pip set up -U pip

python3 -m pip config set international.extra-index-url "https://pip.repos.neuron.amazonaws.com"
python3 -m pip set up torch-neuronx==2.1.2.2.3.0 neuronx-cc==2.15.128.0 neuronx_distributed==0.9.0 torchvision
python3 -m pip set up datasets transformers peft huggingface_hub trl PyYAML
python3 -m pip set up git+https://github.com/huggingface/optimum-neuron.git

Together with your surroundings created, swap to your fine-tuning listing earlier than continuing to the following step: cd $HOME/peft_ft.

Step 1: Obtain the bottom Llama 3 8B mannequin and tokenizer from Hugging Face

Obtain the bottom Meta Llama 3 8B mannequin and the corresponding tokenizer from Hugging Face. You will want to first request entry for the mannequin from Meta on Hugging Face after which use your Hugging Face entry token to obtain the mannequin. The next is the Python code for the get_model.py script to obtain the mannequin and tokenizer. Create this file with contact get_model.py and duplicate the next code to this file earlier than shifting on to the following step.

import os
import argparse
from transformers import AutoTokenizer, LlamaForCausalLM

def download_model_and_tokenizer(model_id: str, model_output_path: str, tokenizer_output_path: str, huggingface_token: str = None) -> None:
    huggingface_token = os.environ.get("HUGGINGFACE_TOKEN", None)
    mannequin = LlamaForCausalLM.from_pretrained(model_id, token=huggingface_token)
    mannequin.save_pretrained(model_output_path)
    tokenizer = AutoTokenizer.from_pretrained(model_id, token=huggingface_token)
    tokenizer.save_pretrained(tokenizer_output_path)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_id", sort=str, required=True, assist="Hugging Face Mannequin id")
    parser.add_argument("--model_output_path", sort=str, required=True, assist="Path to avoid wasting mannequin/weights file")
    parser.add_argument("--tokenizer_output_path", sort=str, required=True, assist="Path to avoid wasting tokenizer file")
    args, _ = parser.parse_known_args()
    download_model_and_tokenizer(model_id=args.model_id, model_output_path=args.model_output_path, tokenizer_output_path=args.tokenizer_output_path)

Subsequent, create the bash script contact get_model.sh with the code that follows and run it with the command sbatch ./get_model.sh. This may set off the get_model.py script to obtain the mannequin and tokenizer utilizing Slurm. Since you’re utilizing the Llama 3 8B mannequin, Hugging Face requires you to authenticate with an entry token previous to obtain. Make sure you add your entry token to get_model.sh earlier than working the script.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/get_model.out

export OMP_NUM_THREADS=1
export HUGGINGFACE_TOKEN=""
supply $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 $HOME/peft_ft/get_model.py 
--model_id meta-llama/Meta-Llama-3-8B-Instruct 
--model_output_path $HOME/peft_ft/model_artifacts/llama3-8B 
--tokenizer_output_path $HOME/peft_ft/tokenizer/llama3-8B

Step 2: Pre-compile mannequin

Coaching deep studying fashions on Trainium requires mannequin compilation. To do this, use the neuron_parallel_compile CLI utility, which can extract graphs from a trial run of your script, and carry out parallel pre-compilation of the computation graphs. Notice that the scripts for mannequin pre-compilation are an identical to these for the precise coaching, apart from max_steps. It’s because pre-compilation doesn’t require the completion of the whole coaching cycle; fairly, it necessitates roughly 10 coaching steps to extract the graphs. Earlier than compiling the mannequin, that you must create the coaching script, contact prepare.py which is used for each pre-compilation and mannequin high-quality tuning steps. Add the next code after creating the file, together with the format perform beforehand talked about.

import os
import torch
import argparse
from datasets import load_dataset
from peft import LoraConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from optimum.neuron import NeuronSFTConfig, NeuronSFTTrainer
from optimum.neuron.distributed import lazy_load_for_parallelism
import torch_xla.core.xla_model as xm

# add format_dolly perform right here

def training_function(args):
    dataset = load_dataset(args.dataset, cut up="prepare")    
    tokenizer = AutoTokenizer.from_pretrained(args.tokenizer_path)
    tokenizer.pad_token = tokenizer.eos_token
    with lazy_load_for_parallelism(tensor_parallel_size=args.tp_size):
        mannequin = AutoModelForCausalLM.from_pretrained(
            args.model_path, 
            low_cpu_mem_usage=True, 
            torch_dtype=torch.bfloat16 if args.bf16 else torch.float32
        )

    lora_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        target_modules=["q_proj", "v_proj"],
        bias="none",
        task_type="CAUSAL_LM",
    )
        
    training_args = NeuronSFTConfig(
        output_dir=args.model_checkpoint_path,
        overwrite_output_dir=True,
        num_train_epochs=args.epochs,
        per_device_train_batch_size=args.train_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        learning_rate=args.learning_rate,
        weight_decay=args.weight_decay,
        warmup_steps=args.warmup_steps,
        bf16=args.bf16,
        tensor_parallel_size=args.tp_size,
        pipeline_parallel_size=args.pp_size,
        save_steps=args.checkpoint_frequency,
        logging_steps=100,
        max_steps=args.max_steps,
        )

    coach = NeuronSFTTrainer(
        args=training_args,
        mannequin=mannequin,
        peft_config=lora_config,
        tokenizer=tokenizer,
        train_dataset=dataset,
        formatting_func=format_dolly,
    )

    coach.prepare()
    coach.save_model(args.model_final_path)
    
if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model_path", sort=str)
    parser.add_argument("--tokenizer_path", sort=str)
    parser.add_argument("--epochs", sort=int)
    parser.add_argument("--train_batch_size", sort=int)
    parser.add_argument("--learning_rate", sort=float)
    parser.add_argument("--weight_decay", sort=float)
    parser.add_argument("--bf16", sort=bool)
    parser.add_argument("--tp_size", sort=int)
    parser.add_argument("--pp_size", sort=int)
    parser.add_argument("--gradient_accumulation_steps", sort=int)
    parser.add_argument("--warmup_steps", sort=int)
    parser.add_argument("--early_stopping_patience", sort=int)
    parser.add_argument("--checkpoint_frequency", sort=int)
    parser.add_argument("--dataset", sort=str)
    parser.add_argument("--max_steps", sort=int)
    parser.add_argument("--max_seq_length", sort=int)
    parser.add_argument("--model_checkpoint_path", sort=str)
    parser.add_argument("--model_final_path", sort=str)
    args = parser.parse_args()
    training_function(args)

After creating the coaching file, use the next code to create the compile.sh script, which can set off finetune-llama3-8B.sh to compile the Llama 3 8B mannequin utilizing the neuron_parallel_compile command. You may run this with the sbatch compile.sh command.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH -o /fsx/ubuntu/peft_ft/logs/8b/compile.out

supply $HOME/peft_ft/env_llama3_8B_peft/bin/activate

export NEURON_EXTRACT_GRAPHS_ONLY=0
srun bash ${HOME}/peft_ft/finetune-llama3-8B.sh

The next is the finetune-llama3-8B.sh script, which lists the hyper-parameters to your mannequin high-quality tuning. The script makes use of tensor parallelism for the coaching with diploma of 8. With 32 NeuronCores within the ml.trn1.32xlarge occasion, you get information parallel of diploma 4. Notice that the script additionally units XLA_USE_BF16=1 to map each torch.float and torch.double tensors to bfloat16 tensors. This will each cut back reminiscence footprint and enhance efficiency. The script then units gradient_accumulation_steps to be 3 to get a bigger efficient batch measurement for gradient replace.

#!/bin/bash
GPUS_PER_NODE=32
if [ $NEURON_EXTRACT_GRAPHS_ONLY -gt 0 ]; then
    MAX_STEPS=10
    MAYBE_COMPILE="neuron_parallel_compile"
else
    MAX_STEPS=-1
fi

declare -a TORCHRUN_ARGS=(
    --nproc_per_node=$GPUS_PER_NODE
    --nnodes=$SLURM_JOB_NUM_NODES
)
export TRAIN_SCRIPT=${HOME}/peft_ft/prepare.py

declare -a TRAINING_ARGS=(
    --bf16 True 
    --checkpoint_frequency 400 
    --dataset "databricks/databricks-dolly-15k" 
    --max_steps $MAX_STEPS 
    --max_seq_length 1024 
    --epochs 1 
    --gradient_accumulation_steps 3 
    --learning_rate 2e-05 
    --model_path "/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" 
    --tokenizer_path "/fsx/ubuntu/peft_ft/tokenizer/llama3-8B" 
    --model_checkpoint_path "/fsx/ubuntu/peft_ft/model_checkpoints" 
    --model_final_path "/fsx/ubuntu/peft_ft/model_checkpoints/ultimate" 
    --tp_size 8 
    --pp_size 1 
    --train_batch_size 1 
    --warmup_steps 100 
    --weight_decay 0.01 
)
$MAYBE_COMPILE torchrun "${TORCHRUN_ARGS[@]}" $TRAIN_SCRIPT "${TRAINING_ARGS[@]}"

Step 3: Mannequin high-quality tuning

After the mannequin compiling is full, you possibly can then begin the mannequin high-quality tuning by reusing the compile.sh script. To do that, stop the neuron_parallel_compile utility from being utilized by setting export NEURON_EXTRACT_GRAPHS_ONLY=-1 in compile.sh, after which re-run the script to begin high-quality tuning your mannequin. You may must delete the model_consolidation listing created through the earlier mannequin compilation step earlier than you begin your fine-tuning job.

Mannequin consolidation

When working with distributed machine studying workflows, you’ll usually must handle and merge mannequin weights effectively. Let’s discover two important processes that you simply’ll incessantly encounter: checkpoint consolidation and weight merging when performing LoRA high-quality tuning.

Checkpoint consolidation

Throughout distributed coaching, your mannequin checkpoints are sometimes cut up throughout a number of units based on the mannequin parallelism configuration that you simply present. To deliver these items again collectively, you’ll use a consolidation course of. Your consolidation perform handles three major duties. First, it combines distributed checkpoints right into a unified mannequin. Then, it manages reminiscence effectively by processing tensors in chunks. Lastly, it creates sharded outputs with an index file for fast entry.

LoRA weight merging

While you’re working with LoRA, that you must merge these adapters together with your base mannequin. The merging course of is easy however requires cautious consideration to element. Begin by loading your base mannequin and LoRA configuration. Then remodel the LoRA weight names to match your base mannequin’s construction. The method concludes by merging the adapters and saving the ultimate mannequin in a sharded format.

To place these instruments into follow, you should utilize the next scripts after your fine-tuning job has completed. First, create the Python file, contact consolidation.py and shell file, contact consolidation.sh utilizing the next code.

import argparse
import json
from pathlib import Path
from huggingface_hub import split_torch_state_dict_into_shards
from safetensors.torch import save_file
from optimum.neuron.distributed.checkpointing import consolidate_model_parallel_checkpoints
import torch

def custom_consolidate_to_unified_checkpoint(checkpoint_dir: str, output_dir: str, save_format: str = "safetensors"):
    output_dir.mkdir(mother and father=True, exist_ok=True)
    state_dict = consolidate_model_parallel_checkpoints(checkpoint_dir)
    for key, worth in state_dict.objects():
        if isinstance(worth, torch.Tensor):
            state_dict[key] = worth.contiguous()

    split_result = split_torch_state_dict_into_shards(state_dict, max_shard_size="5GB")
    # Save shards
    for shard_file, shard_tensors in split_result.filename_to_tensors.objects():
        shard_dict = {identify: state_dict[name] for identify in shard_tensors}
        shard_path = output_dir / shard_file
        if save_format == "safetensors":
            save_file(shard_dict, shard_path, metadata={"format": "pt"})
        else:
            torch.save(shard_dict, shard_path)

    index = {
        "metadata": split_result.metadata,
        "weight_map": split_result.tensor_to_filename
    }
    
    index_file = "mannequin.safetensors.index.json" if save_format == "safetensors" else "pytorch_model.bin.index.json"
    with open(output_dir / index_file, "w") as f:
        json.dump(index, f, indent=2)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--input_dir", sort=str, required=True)
    parser.add_argument("--output_dir", sort=str, required=True)
    parser.add_argument("--save_format", sort=str, selections=["safetensors", "pytorch"])
    args = parser.parse_args()
    output_dir = Path(args.output_dir)
    checkpoint_dir = Path(args.input_dir) / "adapter_shards"
    custom_consolidate_to_unified_checkpoint(
        checkpoint_dir=checkpoint_dir,
        output_dir=output_dir,
        save_format=args.save_format
    )

This code will consolidate the sharded checkpoint information generated throughout coaching right into a consolidated LoRA adaptersafetensor format. After saving the file, you possibly can invoke this script to set off the mannequin checkpoint consolidation job. The enter listing that you simply present factors to your fine-tuned mannequin’s sharded checkpoints and the output listing for the consolidated LoRA adapter safetensor file. You set off this with sbatch consolidation.sh.

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive

export OMP_NUM_THREADS=1
supply $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 "$HOME/peft_ft/consolidation.py" 
--input_dir "/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251" 
--output_dir "$HOME/peft_ft/model_checkpoints/adapter_shards_consolidation"
--save_format "safetensors"

After consolidation is full, that you must merge the LoRA adapter weights from the consolidated information with the bottom mannequin’s weights. Start by creating a brand new Python file contact merge_lora.py and shell file merge_lora.sh utilizing the next code.

import json
from peft import LoraConfig, PeftModel
from transformers import AutoModelForCausalLM
import torch
import argparse
from safetensors import safe_open


def merge_lora_weights(args):
    base_model = AutoModelForCausalLM.from_pretrained(args.base_model_path)
    with open(args.adapter_config_path, "r") as f:
        config_dict = json.load(f)
    peft_config = LoraConfig(**config_dict)
    mannequin = PeftModel(base_model, peft_config)
    
    lora_weights_tensors = {}
    with safe_open(args.lora_safetensors_path, framework="pt", system="cpu") as f:
        for okay in f.keys():
            lora_weights_tensors[k] = f.get_tensor(okay)
            
    for layer_name in listing(lora_weights_tensors):
        if 'layer' in layer_name and 'lora' in layer_name:
            new_layer_name = layer_name.change('weight', 'default.weight')
            lora_weights_tensors[new_layer_name] = lora_weights_tensors[layer_name].clone()
            del lora_weights_tensors[layer_name]
        else:
            del lora_weights_tensors[layer_name]

    updated_state_dict = mannequin.state_dict().copy()
    for layer, weights in lora_weights_tensors.objects():
        updated_state_dict[layer] = weights
    mannequin.load_state_dict(updated_state_dict)    
    merged_model = mannequin.merge_and_unload()    
    merged_model.save_pretrained(args.final_model_path, safe_serialization=True, max_shard_size="5GB")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--final_model_path", sort=str)
    parser.add_argument("--adapter_config_path", sort=str)
    parser.add_argument("--base_model_path", sort=str)
    parser.add_argument("--lora_safetensors_path", sort=str)
    args = parser.parse_args()
    merge_lora_weights(args)

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --exclusive
#SBATCH --output=/fsx/ubuntu/peft_ft/logs/8b/lora_weights.log

export OMP_NUM_THREADS=1
supply $HOME/peft_ft/env_llama3_8B_peft/bin/activate

srun python3 "$HOME/peft_ft/merge_lora.py" 
    --final_model_path "/fsx/ubuntu/peft_ft/model_checkpoints/final_model_output" 
    --adapter_config_path "/fsx/ubuntu/peft_ft/model_checkpoints/checkpoint-1251/adapter_config.json"
    --base_model_path "/fsx/ubuntu/peft_ft/model_artifacts/llama3-8B" 
    --lora_safetensors_path "/fsx/ubuntu/peft_ft/model_checkpoints/adapter_shards_consolidation/mannequin.safetensors"

Set off the run with sbatch merge_lora.sh to merge the mannequin weights. Right here the base_model_path parameter is the native listing the place you beforehand downloaded the mannequin from Hugging Face in step 1 of “Mannequin compilation and high-quality tuning.” Equally, the adapter_config_path parameter would be the mannequin’s configuration file beforehand downloaded and the lora_safetensors_path parameter would be the path to the mannequin.safetensor file output by the LoRA consolidation within the earlier step.

Inference

After consolidation and merging, the safetensors information can be saved to your final_model_path output listing containing the up to date mannequin weights after high-quality tuning. Utilizing these up to date weights, you possibly can load and generate a prediction to your educated mannequin within the context of the dolly dataset. To test that the fine-tuned mannequin understands the databricks-dolly-15k dataset it was high-quality tuned on, choose a query from the dataset for validation, as proven within the following determine.

Utilizing Hugging Face’s LlamaForCausalLM class you possibly can load your newly fine-tuned mannequin, and generate a prediction for the query, “Who’re the Smiths?” (proven within the following determine):

Evaluating the generated reply to the bottom fact context and response from the coaching dataset, it’s clear that the fine-tuned Meta Llama 3 mannequin now understands this information and can provide coherent responses to posed questions.

Outcomes

Method	Trainable parameters	Samples processed per second	Coaching time (minutes)
FPFT	7,570,591,744	2.083	90
PEFT	6,815,744	3.554	53

To benchmark the fine-tuned mannequin’s efficiency with LoRA on a single ml.trn1.32xlarge, we in contrast it to full parameter high-quality tuning (FPFT) for the mannequin over three coaching epochs. Measuring coaching samples processed per second confirmed a 70% improve in throughput and discount in coaching time for the LoRA fine-tuned mannequin. Subsequently, on-demand hours required to high-quality tune the mannequin on the dolly 15k dataset for 3 epochs was halved in comparison with FPFT, leading to a 50% discount of coaching prices.

Clear up

To wash up the sources provisioned for this publish, first delete the SageMaker HyperPod cluster. This may be completed both by means of the AWS CLI or within the SageMaker console.

aws sagemaker delete-cluster --cluster-name ml-cluster

After the cluster is deleted, delete the CloudFormation template to delete the remaining provisioned sources.

aws cloudformation delete-stack --stack-name sagemaker-hyperpod

Conclusion

On this publish, we confirmed you the right way to arrange a SageMaker HyperPod compute cluster for coaching. Then we confirmed you the right way to carry out multi-node distributed high-quality tuning with Trainium for a Meta Llama 3 mannequin utilizing LoRA. Lastly, we confirmed you the right way to consolidate mannequin weights throughout a distributed coaching surroundings to generate coherent predictions for the newly fine-tuned mannequin.

In regards to the Authors

Georgios Ioannides is a Deep Studying Architect with the AWS Generative AI Innovation Middle. Earlier than AWS, Georgios labored in startups, the place he specialised in sign processing, deep studying, and multi-modal and cross-modal machine studying methods for speech, imaginative and prescient, and textual content functions. He holds Grasp’s levels from Imperial Faculty London and Carnegie Mellon College.

Bingchen Liu is a Machine Studying Engineer with the AWS Generative AI Innovation Middle. Earlier than AWS, he labored as a lead MLE in ADP specializing in RAG functions, vector database, mannequin improvement, and serving. He holds a Grasp’s diploma in Laptop Science from Columbia College and a PhD in Statistics from Southern Methodist College.

Hannah Marlowe is a Senior Supervisor of Mannequin Customization on the AWS Generative AI Innovation Middle. Her workforce focuses on serving to prospects develop differentiating generative AI options utilizing their distinctive and proprietary information to realize key enterprise outcomes. She holds a PhD in Physics from the College of Iowa, with a concentrate on astronomical X-ray evaluation and instrumentation improvement. Exterior of labor, she may be discovered mountain climbing, mountain biking, and snowboarding across the mountains in Colorado.

Jeremy Roghair is a Machine Studying Engineer with the AWS Generative AI Innovation Middle, the place he focuses on creating generative AI options for distributed coaching workloads and mannequin internet hosting for patrons. Previous to becoming a member of AWS, Jeremy labored as a Information Scientist within the finance/insurance coverage business and earned a Grasp’s diploma in Laptop Science with analysis in reinforcement studying from Iowa State College.

PEFT high-quality tuning of Llama 3 on SageMaker HyperPod with AWS Trainium

Classifier-Free Steerage for LLMs Efficiency Enhancing | by Roman S | Dec, 2024

Chance Distributions: Poisson vs. Binomial Distribution | by Sunghyun Ahn | Dec, 2024

Chance Distributions: Poisson vs. Binomial Distribution | by Sunghyun Ahn | Dec, 2024

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Context Engineering — A Complete Fingers-On Tutorial with DSPy

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts