Metadata filtering for tabular information with Data Bases for Amazon Bedrock

Amazon Bedrock is a completely managed service that provides a alternative of high-performing basis fashions (FMs) from main synthetic intelligence (AI) firms like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon by a single API. To equip FMs with up-to-date and proprietary data, organizations use Retrieval Augmented Era (RAG), a method that fetches information from firm information sources and enriches the immediate to supply extra related and correct responses. Data Bases for Amazon Bedrock is a completely managed functionality that helps you implement your complete RAG workflow, from ingestion to retrieval and immediate augmentation. Nevertheless, details about one dataset may be in one other dataset, known as metadata. With out utilizing metadata, your retrieval course of may cause the retrieval of unrelated outcomes, thereby reducing FM accuracy and growing value within the FM immediate token.

On March 27, 2024, Amazon Bedrock introduced a key new characteristic known as metadata filtering and likewise modified the default engine. This modification lets you use metadata fields throughout the retrieval course of. Nevertheless, the metadata fields should be configured throughout the data base ingestion course of. Usually, you might need tabular information the place particulars about one discipline can be found in one other discipline. Additionally, you would have a requirement to quote the precise textual content doc or textual content discipline to stop hallucination. On this publish, we present you the right way to use the brand new metadata filtering characteristic with Data Bases for Amazon Bedrock for such tabular information.

Resolution overview

The answer consists of the next high-level steps:

Put together information for metadata filtering.
Create and ingest information and metadata into the data base.
Retrieve information from the data base utilizing metadata filtering.

Put together information for metadata filtering

As of this writing, Data Bases for Amazon Bedrock helps Amazon OpenSearch Serverless, Amazon Aurora, Pinecone, Redis Enterprise, and MongoDB Atlas as underlying vector retailer suppliers. On this publish, we create and entry an OpenSearch Serverless vector retailer utilizing the Amazon Bedrock Boto3 SDK. For extra particulars, see Arrange a vector index to your data base in a supported vector retailer.

For this publish, we create a data base utilizing the general public dataset Meals.com – Recipes and Opinions. The next screenshot exhibits an instance of the dataset.

The TotalTime is in ISO 8601 format. You’ll be able to convert that to minutes utilizing the next logic:

# Operate to transform ISO 8601 period to minutes
def convert_to_minutes(period):
    hours = 0
    minutes = 0
    
    # Discover hours and minutes utilizing regex
    match = re.match(r'PT(?:(d+)H)?(?:(d+)M)?', period)
    
    if match:
        if match.group(1):
            hours = int(match.group(1))
        if match.group(2):
            minutes = int(match.group(2))
    
    # Convert whole time to minutes
    total_minutes = hours * 60 + minutes
    return total_minutes

df['TotalTimeInMinutes'] = df['TotalTime'].apply(convert_to_minutes)

After changing among the options like CholesterolContent, SugarContent, and RecipeInstructions, the information body appears like the next screenshot.

To allow the FM to level to a selected menu with a hyperlink (cite the doc), we cut up every row of the tabular information in a single textual content file, with every file containing RecipeInstructions as the information discipline and TotalTimeInMinutes, CholesterolContent, and SugarContent as metadata. The metadata must be saved in a separate JSON file with the identical identify as the information file and .metadata.json added to its identify. For instance, if the information file identify is 100.txt, the metadata file identify must be 100.txt.metadata.json. For extra particulars, see Add metadata to your recordsdata to permit for filtering. Additionally, the content material within the metadata file must be within the following format:

{
"metadataAttributes": {
"${attribute1}": "${value1}",
"${attribute2}": "${value2}",
...
}
}

For the sake of simplicity, we solely course of the highest 2,000 rows to create the data base.

After you import the mandatory libraries, create an area listing utilizing the next Python code:

import pandas as pd
import os, json, tqdm, boto3

metafolder="multi_file_recipe_data"os.mkdir(metafolder)

Iterate excessive 2,000 rows to create information and metadata recordsdata to retailer within the native folder:

for i in tqdm.trange(2000):
    desc = str(df['RecipeInstructions'][i])
    meta = {
    "metadataAttributes": {
        "Title": str(df['Name'][i]),
        "TotalTimeInMinutes": str(df['TotalTimeInMinutes'][i]),
        "CholesterolContent": str(df['CholesterolContent'][i]),
        "SugarContent": str(df['SugarContent'][i]),
    }
    }
    filename = metafolder+'/' + str(i+1)+ '.txt'
    f = open(filename, 'w')
    f.write(desc)
    f.shut()
    metafilename = filename+'.metadata.json'
    with open( metafilename, 'w') as f:
        json.dump(meta, f)

Create an Amazon Easy Storage Service (Amazon S3) bucket named food-kb and add the recordsdata:

# Add information to s3
s3_client = boto3.consumer("s3")
bucket_name = "recipe-kb"
data_root = metafolder+'/'
def uploadDirectory(path,bucket_name):
    for root,dirs,recordsdata in os.stroll(path):
        for file in tqdm.tqdm(recordsdata):
            s3_client.upload_file(os.path.be part of(root,file),bucket_name,file)

uploadDirectory(data_root, bucket_name)

Create and ingest information and metadata into the data base

When the S3 folder is prepared, you possibly can create the data base on the Amazon Bedrock console utilizing the SDK in response to this instance pocket book.

Retrieve information from the data base utilizing metadata filtering

Now let’s retrieve some information from the data base. For this publish, we use Anthropic Claude Sonnet on Amazon Bedrock for our FM, however you possibly can select from a wide range of Amazon Bedrock fashions. First, it’s essential to set the next variables, the place kb_id is the ID of your data base. The data base ID may be discovered programmatically, as proven within the instance pocket book, or from the Amazon Bedrock console by navigating to the person data base, as proven within the following screenshot.

Set the required Amazon Bedrock parameters utilizing the next code:

import boto3
import pprint
from botocore.consumer import Config
import json

pp = pprint.PrettyPrinter(indent=2)
session = boto3.session.Session()
area = session.region_name
bedrock_config = Config(connect_timeout=120, read_timeout=120, retries={'max_attempts': 0})
bedrock_client = boto3.consumer('bedrock-runtime', region_name = area)
bedrock_agent_client = boto3.consumer("bedrock-agent-runtime",
                              config=bedrock_config, region_name = area)
kb_id = "EIBBXVFDQP"
model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

# retrieve api for fetching solely the related context.

question = " Inform me a recipe that I could make below half-hour and has ldl cholesterol lower than 10 "

relevant_documents = bedrock_agent_runtime_client.retrieve(
    retrievalQuery= {
        'textual content': question
    },
    knowledgeBaseId=kb_id,
    retrievalConfiguration= {
        'vectorSearchConfiguration': {
            'numberOfResults': 2 
        }
    }
)
pp.pprint(relevant_documents["retrievalResults"])

The next code is the output of the retrieval from the data base with out metadata filtering for the question “Inform me a recipe that I could make below half-hour and has ldl cholesterol lower than 10.” As we will see, out of the 2 recipes, the preparation durations are 30 and 480 minutes, respectively, and the ldl cholesterol contents are 86 and 112.4, respectively. Due to this fact, the retrieval isn’t following the question precisely.

The next code demonstrates the right way to use the Retrieve API with the metadata filters set to a ldl cholesterol content material lower than 10 and minutes of preparation lower than 30 for a similar question:

def retrieve(question, kbId, numberOfResults=5):
    return bedrock_agent_client.retrieve(
        retrievalQuery= {
            'textual content': question
        },
        knowledgeBaseId=kbId,
        retrievalConfiguration= {
            'vectorSearchConfiguration': {
                'numberOfResults': numberOfResults,
                 "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        }
            }
        }
    ) 
question = "Inform me a recipe that I could make below half-hour and has ldl cholesterol lower than 10" 
response = retrieve(question, kb_id, 2)
retrievalResults = response['retrievalResults']
pp.pprint(retrievalResults)

As we will see within the following outcomes, out of the 2 recipes, the preparation occasions are 27 and 20, respectively, and the ldl cholesterol contents are 0 and 0, respectively. With the usage of metadata filtering, we get extra correct outcomes.

The next code exhibits the right way to get correct output utilizing the identical metadata filtering with the retrieve_and_generate API. First, we set the immediate, then we arrange the API with metadata filtering:

immediate = f"""
Human: You've gotten nice data about meals, so present solutions to questions through the use of truth. 
If you do not know the reply, simply say that you do not know, do not attempt to make up a solution.

Assistant:"""

def retrieve_and_generate(question, kb_id,modelId, numberOfResults=10):
    return bedrock_agent_client.retrieve_and_generate(
        enter= {
            'textual content': question,
        },
        retrieveAndGenerateConfiguration={
        'knowledgeBaseConfiguration': {
            'generationConfiguration': {
                'promptTemplate': {
                    'textPromptTemplate': f"{immediate} $search_results$"
                }
            },
            'knowledgeBaseId': kb_id,
            'modelArn': model_id,
            'retrievalConfiguration': {
                'vectorSearchConfiguration': {
                    'numberOfResults': numberOfResults,
                    'overrideSearchType': 'HYBRID',
                     "filter": {
                            'andAll':[
                                {
                                "lessThan": {
                                "key": "CholesterolContent",
                                "value": 10
                                }
                            },
                                {
                            "lessThan": {
                                "key": "TotalTimeInMinutes",
                                "value": 30
                            }
                                }
                            ]
                        },
                }
        }
                    },
        'sort': 'KNOWLEDGE_BASE'
    }
    )
    
question = "Inform me a recipe that I could make below half-hour and has ldl cholesterol lower than 10"
response = retrieve_and_generate(question, kb_id,modelId, numberOfResults=10)
pp.pprint(response['output']['text'])

As we will see within the following output, the mannequin returns an in depth recipe that follows the instructed metadata filtering of lower than half-hour of preparation time and a ldl cholesterol content material lower than 10.

Clear up

Be certain to remark the next part when you’re planning to make use of the data base that you just created for constructing your RAG utility. In the event you solely needed to check out creating the data base utilizing the SDK, ensure that to delete all of the assets that had been created as a result of you’ll incur prices for storing paperwork within the OpenSearch Serverless index. See the next code:

bedrock_agent_client.delete_data_source(dataSourceId = ds["dataSourceId"], knowledgeBaseId=kb['knowledgeBaseId'])
bedrock_agent_client.delete_knowledge_base(knowledgeBaseId=kb['knowledgeBaseId'])
oss_client.indices.delete(index=index_name)
aoss_client.delete_collection(id=collection_id)
aoss_client.delete_access_policy(sort="information", identify=access_policy['accessPolicyDetail']['name'])
aoss_client.delete_security_policy(sort="community", identify=network_policy['securityPolicyDetail']['name'])
aoss_client.delete_security_policy(sort="encryption", identify=encryption_policy['securityPolicyDetail']['name'])
# Delete roles and polices 
iam_client.delete_role(RoleName=bedrock_kb_execution_role)
iam_client.delete_policy(PolicyArn=policy_arn)

Conclusion

On this publish, we defined the right way to cut up a big tabular dataset into rows to arrange a data base with metadata for every of these information, and the right way to then retrieve outputs with metadata filtering. We additionally confirmed how retrieving outcomes with metadata is extra correct than retrieving outcomes with out metadata filtering. Lastly, we confirmed the right way to use the end result with an FM to get correct outcomes.

To additional discover the capabilities of Data Bases for Amazon Bedrock, seek advice from the next assets:

In regards to the Writer

Tanay Chowdhury is a Information Scientist at Generative AI Innovation Middle at Amazon Net Providers. He helps clients to resolve their enterprise drawback utilizing Generative AI and Machine Studying.

Metadata filtering for tabular information with Data Bases for Amazon Bedrock

Higher AI Inventory: Meta Platforms or Palantir

Commure to accumulate AI scribe Augmedix in $139M deal

Commure to accumulate AI scribe Augmedix in $139M deal

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

The Good-Sufficient Fact | In direction of Knowledge Science

About Us

Category

Recent Posts