Import a fine-tuned Meta Llama 3 mannequin for SQL question era on Amazon Bedrock

Amazon Bedrock is a totally managed service that provides a selection of high-performing basis fashions (FMs) from main synthetic intelligence (AI) firms like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon via a single API. Amazon Bedrock additionally supplies a broad set of capabilities wanted to construct generative AI functions with safety, privateness, and accountable AI practices.

Some FMs are publicly out there, which permits for personalization tailor-made to particular use circumstances and domains. Nonetheless, deploying personalized FMs to help generative AI functions in a safe and scalable method isn’t a trivial job. Internet hosting massive fashions includes complexity across the number of occasion kind and deployment parameters. To handle this problem, AWS just lately introduced the preview of Amazon Bedrock Customized Mannequin Import, a function that you should use to import personalized fashions created in different environments—reminiscent of Amazon SageMaker, Amazon Elastic Compute Cloud (Amazon EC2) situations, and on premises—into Amazon Bedrock. This function abstracts the complexity of the deployment course of via easy APIs for mannequin deployment and invocation. At the moment, Customized Mannequin Import helps importing customized weights for chosen mannequin architectures (Meta Llama 2 and Llama 3, Flan, and Mistral) and precisions (FP32, FP16, and BF16), and serving the fashions on demand and with provisioned throughput.

Customizing FMs can unlock important worth by tailoring their capabilities to particular domains or duties. That is the primary in a collection of posts about mannequin customization situations that may be imported into Amazon Bedrock to simplify the method of constructing scalable and safe generative AI functions. By demonstrating the method of deploying fine-tuned fashions, we purpose to empower information scientists, ML engineers, and software builders to harness the complete potential of FMs whereas addressing distinctive software necessities.

On this submit, we show the method of fine-tuning Meta Llama 3 8B on SageMaker to specialize it within the era of SQL queries (text-to-SQL). Meta Llama 3 8B is a comparatively small mannequin that provides a stability between efficiency and useful resource effectivity. AWS prospects have explored fine-tuning Meta Llama 3 8B for the era of SQL queries—particularly when utilizing non-standard SQL dialects—and have requested strategies to import their personalized fashions into Amazon Bedrock to profit from the managed infrastructure and safety that Amazon Bedrock supplies when serving these fashions.

Answer overview

We stroll via the steps of fine-tuning an FM with utilizing SageMaker, and importing and evaluating the fine-tuned FM for SQL question era utilizing Amazon Bedrock. The whole circulate is proven within the following determine and it covers the next steps:

The person invokes a SageMaker coaching job to fine-tune the mannequin utilizing QLoRA and retailer the weights in an Amazon Easy Storage Service (Amazon S3) bucket within the person’s account.
When the fine-tuning job is full, the person runs the mannequin import job utilizing the Amazon Bedrock console. This step will run Steps 3–5 routinely.
Amazon Bedrock service begins an import job in an AWS operated deployment account.
Mannequin artifacts are copied from the person’s account into an AWS managed S3 bucket.
When the import job is full, the fine-tuned mannequin will probably be made out there to be invoked.

All information stays throughout the chosen AWS Area, the mannequin artifacts are imported into the AWS operated deployment account utilizing a VPC endpoint, and you’ll encrypt your mannequin information with your personal Amazon Key Administration Service (AWS KMS) keys. The scripts for fine-tuning and analysis can be found on the GitHub repository.

A replica of your mannequin artifacts is saved in an AWS operated deployment account. This copy will stay till the customized mannequin is deleted. Deleting artifacts within the person’s account gained’t delete the mannequin or the artifacts within the AWS operated account. If totally different variations of a mannequin are imported into Amazon Bedrock, every model will probably be managed as an impartial venture with its personal set of artifacts. You’ll be able to apply tags to fashions and import jobs to maintain monitor of various initiatives and variations.

Meta Llama3 8B is a gated mannequin on Hugging Face, which implies that customers have to be granted entry earlier than they’re allowed to obtain and customise the mannequin. Check in to your Hugging Face account, learn the Meta Llama 3 Acceptable Use Coverage, and submit your contact info to be granted entry. This course of may take a few hours.

We use the sql-create-context dataset out there on Hugging Face for fine-tuning. The dataset accommodates 78,577 tuples of context (desk schema), query (question expressed in pure language), and reply (SQL question). Discuss with the licensing info relating to this dataset earlier than continuing additional.

We use Amazon SageMaker Studio to create a distant fine-tuning job, which is able to run as a SageMaker coaching job. SageMaker Studio is a single web-based interface for end-to-end machine studying (ML) improvement. In case you need assistance configuring your SageMaker Studio area and your JupyterLab atmosphere, see Launch Amazon SageMaker Studio. The coaching job will use QLoRA and the PyTorch FullyShardedDataParallel API (FSDP) to fine-tune the Meta Llama 3 mannequin. QLoRA quantizes a pretrained language mannequin to 4 bits and attaches smaller low-rank adapters (LoRA), that are fine-tuned with our coaching information. PyTorch FSDP is a parallelism method that shards the mannequin throughout GPUs for environment friendly coaching. See the next pocket book for the entire code pattern.

Information preparation

Within the information preparation stage, we use the next immediate template to insert particular directions for decoding the context and fulfilling the request, and retailer the modified coaching dataset as JSON recordsdata which are uploaded to Amazon S3:

system_message = """You're a highly effective text-to-SQL mannequin. Your job is to reply questions on a database."""

def create_conversation(file):
    pattern = {"messages": [
        {"role": "system", "content": system_message + f"""You can use the following table schema for context: {record["context"]}"""},
        {"function": "person", "content material": f"""Return the SQL question that solutions the next query: {file["question"]}"""},
        {"function" : "assistant", "content material": f"""{file["answer"]}"""}
    ]}
    return pattern

Advantageous-tune Meta Llama 3 8B mannequin

Discuss with the run_fsdp_qlora.py file outlined within the pocket book for a full description of the fine-tuning script. The next snippets describe the configuration of the QLoRA job:

if script_args.use_qlora:
    print(f"Utilizing QLoRA - {torch_dtype}")
    quantization_config = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
            bnb_4bit_compute_dtype=torch_dtype,
            bnb_4bit_quant_storage=quant_storage_dtype,
        )
else:
    quantization_config = None

peft_config = LoraConfig(
    lora_alpha=8,
    lora_dropout=0.05,
    r=16,
    bias="none",
    target_modules="all-linear",
    task_type="CAUSAL_LM",
)

The coach class relies on Supervised Advantageous-tuning Coach (SFT Coach) from Hugging Face, which is an API to create your SFT fashions and practice them with just a few traces of code:

coach = SFTTrainer(
    mannequin=mannequin,
    args=training_args,
    train_dataset=train_dataset,
    dataset_text_field="textual content",
    eval_dataset=test_dataset,
    peft_config=peft_config,
    max_seq_length=script_args.max_seq_length,
    tokenizer=tokenizer,
    packing=True,
    dataset_kwargs={
        "add_special_tokens": False,  # We template with particular tokens
        "append_concat_token": False,  # No want so as to add extra separator token
    },
)

As soon as the adapter is skilled, it’s merged with the unique mannequin earlier than persisting the weights. Customized Mannequin Import doesn’t help LoRA adapters in the mean time.

mannequin = mannequin.merge_and_unload()
mannequin.save_pretrained(
    sagemaker_save_dir, safe_serialization=True, max_shard_size="2GB"
)

For this use case, we use an ml.g5.12xlarge occasion, which has 4 NVIDIA A10 accelerators. The important thing configurations are as follows:

huggingface_estimator = HuggingFace(
    entry_point="run_fsdp_qlora.py",    # practice script
    source_dir="scripts/trl/",      # listing which incorporates all of the recordsdata wanted for coaching
    instance_type="ml.g5.12xlarge",   # situations kind used for the coaching job
    instance_count       = 1,                 # the variety of situations used for coaching
    max_run              = 2*24*60*60,        # most runtime in seconds (days * hours * minutes * seconds)
    base_job_name        = job_name,          # the title of the coaching job
    function                 = function,              # Iam function utilized in coaching job to entry AWS ressources, e.g. S3
    volume_size          = 300,               # the dimensions of the EBS quantity in GB
    transformers_version = '4.36.0',            # the transformers model used within the coaching job
    pytorch_version      = '2.1.0',             # the pytorch_version model used within the coaching job
    py_version           = 'py310',           # the python model used within the coaching job
    hyperparameters      =  hyperparameters,  # the hyperparameters handed to the coaching job
    disable_output_compression = True,        # not compress output to save lots of coaching time and value
    distribution={"torch_distributed": {"enabled": True}},
    atmosphere          = {
        "HUGGINGFACE_HUB_CACHE": "/tmp/.cache", # set env variable to cache fashions in /tmp
        "HF_TOKEN": HfFolder.get_token(),       # Retrieve HuggingFace Token for use for downloading base fashions from
        "ACCELERATE_USE_FSDP":"1", 
        "FSDP_CPU_RAM_EFFICIENT_LOADING":"1"
    },
)

In our testing, the coaching job accomplished two epochs in roughly 2.5 hours on a single ml.g5.12xlarge occasion, which incurred roughly $18 for coaching value. After coaching is full, mannequin weights within the Hugging Face safetensors format, the tokenizer, and the configuration file will probably be uploaded to the S3 bucket outlined within the coaching script. This path needs to be saved for use as the bottom listing for the import job within the subsequent part.

s3_files_path = huggingface_estimator.model_data["S3DataSource"]["S3Uri"]

The configuration file config.json will inform Amazon Bedrock the best way to load the weights from the safetensors recordsdata. Some parameters to remember are the model_type, which have to be one of many sorts presently supported by Amazon Bedrock, max_position_embeddings, which units the utmost size of enter sequence that the mannequin can deal with, the mannequin dimensions (hidden_size, intermediate_size, num_hidden_layers, and num_attention_heads), and rotary place embedding (RoPE) parameters, which describe the encoding of place info. See the next configuration:

{
  "_name_or_path": "meta-llama/Meta-Llama-3-8B",
  "architectures": [
    "LlamaForCausalLM"
  ],
  "attention_bias": false,
  "attention_dropout": 0.0,
  "bos_token_id": 128000,
  "eos_token_id": 128001,
  "hidden_act": "silu",
  "hidden_size": 4096,
  "initializer_range": 0.02,
  "intermediate_size": 14336,
  "max_position_embeddings": 8192,
  "model_type": "llama",
  "num_attention_heads": 32,
  "num_hidden_layers": 32,
  "num_key_value_heads": 8,
  "pretraining_tp": 1,
  "rms_norm_eps": 1e-05,
  "rope_scaling": null,
  "rope_theta": 500000.0,
  "tie_word_embeddings": false,
  "torch_dtype": "float16",
  "transformers_version": "4.40.2",
  "use_cache": true,
  "vocab_size": 128256
}

Import the fine-tuned mannequin into Amazon Bedrock

To import the fine-tuned Meta Llama 3 mannequin into Amazon Bedrock, compete the next steps:

On the Amazon Bedrock console, select Imported fashions on the navigation pane.
Select Import mannequin.

For Mannequin title, enter llama-3-8b-text-to-sql.
For Mannequin import settings, enter the Amazon S3 location from the earlier steps.
Select Import mannequin.
The mannequin import job ought to take 15–18 minutes to finish.
When it’s finished, select Fashions to see your mannequin.
Copy the mannequin Amazon Useful resource Title (ARN) so you possibly can invoke the mannequin with the AWS SDK within the subsequent part.

Consider SQL queries generated by the fine-tuned mannequin

On this part, we offer two examples to guage the SQL queries generated by the fine-tuned mannequin: one utilizing the Amazon Bedrock Textual content Playground and one utilizing a big language mannequin (LLM) as a decide.

Utilizing the Amazon Bedrock Textual content Playground

You’ll be able to take a look at the mannequin utilizing the Amazon Bedrock Textual content Playground. For optimum outcomes, use the identical immediate template used to preprocess your coaching information:

[INST] <>You're a highly effective text-to-SQL mannequin. Your job is to reply questions on a database. You need to use the next desk schema for context: CREATE TABLE table_name_11 (match VARCHAR)<>

[INST]Human: Return the SQL question that solutions the next query: Which Event has A in 1987?[/INST]

Assistant:

The next animation reveals the outcomes.

Utilizing LLM as a decide

On the identical instance pocket book, we used the Amazon Bedrock InvokeModel API to name our imported mannequin on demand to generate SQL queries for data in our take a look at dataset. We use the identical immediate template used with the coaching information within the fine-tuning step. The imported mannequin will solely help parameters that have been supported by the bottom mannequin (max_tokens, top_p, and temperature). Imported fashions don’t help penalty phrases (repetition_penalty or length_penalty) or the usage of token sampling as an alternative of grasping decoding (do_sample). See the next code:

def get_sql_query(system_prompt, user_question):
    """
    Generate a SQL question utilizing Llama 3 8B
    Keep in mind to make use of the identical template utilized in tremendous tuning
    """
    formatted_prompt = f"[INST] <>{system_prompt}<>nn[INST]Human: {user_question}[/INST]nnAssistant:"
    native_request = {
        "immediate": formatted_prompt,
        "max_tokens": 100,
        "top_p": 0.9,
        "temperature": 0.1
    }
    response = consumer.invoke_model(modelId=model_id,
                                   physique=json.dumps(native_request))
    response_text = json.hundreds(response.get('physique').learn())["outputs"][0]["text"]

    return response_text

After we generate mannequin predictions, we use a distinct (extra highly effective) mannequin to behave as a decide and consider our fine-tuned mannequin responses. For this instance, we use the Anthropic Claude 3 Sonnet LLM on Amazon Bedrock to measure the similarity between the specified reply and the anticipated reply utilizing the next immediate:

formatted_prompt = f"""You're a information science instructor that's introducing college students to SQL. Think about the next query and schema:
{query}
{db_schema}
    
Right here is the proper reply:
{correct_answer}
    
Right here is the scholar's reply:
{test_answer}

Please present a numeric rating from 0 to 100 on how nicely the scholar's reply matches the proper reply for this query.
The rating needs to be excessive if the solutions say primarily the identical factor.
The rating needs to be decrease if some components are lacking, or if further pointless components have been included.
The rating needs to be 0 for a completely incorrect reply. Put the rating in  XML tags.
Don't think about your personal reply to the query, however as an alternative rating primarily based solely on the proper reply above.
"""

The expected rating primarily based on our holdout break up of the dataset was 96.65%, which is great for a small mannequin tuned to a particular job.

Clear up

The mannequin will spin right down to zero after a interval of no exercise and your value will cease accruing. Nonetheless, we advocate deleting the imported mannequin utilizing the Amazon Bedrock console. Keep in mind to additionally delete mannequin artifacts out of your S3 bucket when the fine-tuned mannequin is now not wanted to stop incurring prices.

Conclusion

This submit introduced an outline of the method of fine-tuning a small mannequin utilizing SageMaker to assist generate extra correct SQL queries primarily based on questions requested in pure language after which importing the fine-tuned mannequin into Amazon Bedrock utilizing the Customized Mannequin Import function. After we imported the mannequin, it was made out there on demand via the Amazon Bedrock Playground and the InvokeModel API, which was used to guage the efficiency of the fine-tuned mannequin towards a holdout dataset utilizing an LLM as a decide.

The next are really helpful finest practices which may be useful when utilizing fine-tuned FMs for code era duties:

Choose a dataset that’s related and various sufficient to your code era job
Monitor the coaching job and PEFT parameters to stop overfitting and catastrophic forgetting
Preprocess coaching information with a constant instruction template
Retailer mannequin weights utilizing safetensors for quick loading
Invoke the mannequin utilizing the identical instruction template utilized in fine-tuning, utilizing solely inference parameters which are supported by the bottom mannequin and the Customized Mannequin Import function in Amazon Bedrock

Discover the Amazon Bedrock Customized Mannequin Import function as a approach to deploy FMs fine-tuned for code era duties in a safe and scalable method. Go to our GitHub repository to discover samples ready for fine-tuning and importing fashions from varied households.

Concerning the Authors

Evandro Franco is a Sr. AI/ML Specialist Options Architect engaged on Amazon Net Providers. He helps AWS prospects overcome enterprise challenges associated to AI/ML on prime of AWS. He has greater than 18 years working with know-how, from software program improvement, infrastructure, serverless, to machine studying.

Felipe Lopez is a Senior AI/ML Specialist Options Architect at AWS. Previous to becoming a member of AWS, Felipe labored with GE Digital and SLB, the place he targeted on modeling and optimization merchandise for industrial functions.

Jay Pillai is a Principal Answer Architect at Amazon Net Providers. On this function, he capabilities because the International Generative AI Lead Architect and likewise the Lead Architect for Provide Chain Options with AABG. As an Info Know-how Chief, Jay makes a speciality of synthetic intelligence, information integration, enterprise intelligence, and person interface domains. He has 23 years of intensive expertise working with a number of shoppers throughout provide chain, authorized applied sciences, actual property, monetary companies, insurance coverage, funds, and market analysis enterprise domains.

Rupinder Grewal is a Senior AI/ML Specialist Options Architect with AWS. He presently focuses on the serving of fashions and MLOps on Amazon SageMaker. Previous to this function, he labored as a Machine Studying Engineer constructing and internet hosting fashions. Outdoors of labor, he enjoys taking part in tennis and biking on mountain trails.

Sandeep Singh is a Senior Generative AI Information Scientist at Amazon Net Providers, serving to companies innovate with generative AI. He makes a speciality of Generative AI, Synthetic Intelligence, Machine Studying, and System Design. He’s obsessed with growing state-of-the-art AI/ML-powered options to resolve complicated enterprise issues for various industries, optimizing effectivity and scalability.

Ragha Prasad is a Principal Engineer and a founding member of Amazon Bedrock, the place he has had the privilege to hearken to buyer wants first-hand and understands what it takes to construct and launch scalable and safe Gen AI merchandise. Previous to Bedrock, he labored on quite a few merchandise in Amazon, starting from gadgets to Adverts to Robotics.

Import a fine-tuned Meta Llama 3 mannequin for SQL question era on Amazon Bedrock

Unlocking Idea Measurement. How the analogy-completion process modified… | by Danielle Boccelli | Aug, 2024

What You Want To Know To Construct Giant Streamlit Functions With Stripe Subscriptions And Firestore Integration. | by Erdogan Taskesen | Aug, 2024

What You Want To Know To Construct Giant Streamlit Functions With Stripe Subscriptions And Firestore Integration. | by Erdogan Taskesen | Aug, 2024

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Diffusion Mannequin from Scratch in Pytorch | by Nicholas DiSalvo | Jul, 2024

The Journey from Jupyter to Programmer: A Fast-Begin Information

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts