Benchmarking Amazon Nova and GPT-4o fashions with FloTorch

Based mostly on unique submit by Dr. Hemant Joshi, CTO, FloTorch.ai

A current analysis performed by FloTorch in contrast the efficiency of Amazon Nova fashions with OpenAI’s GPT-4o.

Amazon Nova is a brand new technology of state-of-the-art basis fashions (FMs) that ship frontier intelligence and industry-leading price-performance. The Amazon Nova household of fashions consists of Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Professional, which assist textual content, picture, and video inputs whereas producing text-based outputs. These fashions supply enterprises a spread of capabilities, balancing accuracy, pace, and cost-efficiency.

Utilizing its enterprise software program, FloTorch performed an in depth comparability between Amazon Nova fashions and OpenAI’s GPT-4o fashions with the Complete Retrieval Augmented Era (CRAG) benchmark dataset. FloTorch’s analysis centered on three crucial components—latency, accuracy, and price—throughout 5 numerous subjects.

Key findings from the benchmark research:

GPT-4o demonstrated a slight benefit in accuracy over Amazon Nova Professional
Amazon Nova Professional outperformed GPT-4o in effectivity, working 97% quicker whereas being 65.26% more cost effective
Amazon Nova Micro and Amazon Nova Lite outperformed GPT-4o-mini by 2 proportion factors in accuracy
When it comes to affordability, Amazon Nova Micro and Amazon Nova Lite had been 10% and 56.59% cheaper than GPT-4o-mini, respectively
Amazon Nova Micro and Amazon Nova Lite additionally demonstrated quicker response instances, with 48% and 26.60% enhancements, respectively

On this submit, we talk about the findings from this benchmarking in additional element.

The rising want for cost-effective AI fashions

The panorama of generative AI is quickly evolving. OpenAI launched GPT-4o in Might 2024, and Amazon launched Amazon Nova fashions at AWS re:Invent in December 2024. Though GPT-4o has gained traction within the AI neighborhood, enterprises are exhibiting elevated curiosity in Amazon Nova as a consequence of its decrease latency and cost-effectiveness.

Massive language fashions (LLMs) are typically proficient in responding to consumer queries, however they generally generate overly broad or inaccurate responses. Moreover, LLMs would possibly present solutions that reach past the company-specific context, making them unsuitable for sure enterprise use instances.

One of the crucial functions for LLMs right now is Retrieval Augmented Era (RAG), which permits AI fashions to floor responses in enterprise data bases corresponding to PDFs, inner paperwork, and structured knowledge. This can be a essential requirement for enterprises that need their AI methods to offer responses strictly inside an outlined scope.

To higher serve the enterprise prospects, the analysis aimed to reply three key questions:

How does Amazon Nova Professional examine to GPT-4o by way of latency, value, and accuracy?
How do Amazon Nova Micro and Amazon Nova Lite carry out in opposition to GPT-4o mini in these identical metrics?
How nicely do these fashions deal with RAG use instances throughout completely different {industry} domains?

By addressing these questions, the analysis gives enterprises with actionable insights into deciding on the suitable AI fashions for his or her particular wants—whether or not optimizing for pace, accuracy, or cost-efficiency.

Overview of the CRAG benchmark dataset

The CRAG dataset was launched by Meta for testing with factual queries throughout 5 domains with eight query varieties and numerous question-answer pairs. 5 domains in CRAG dataset are Finance, Sports activities, Music, Film, and Open (miscellaneous). The eight completely different query varieties are easy, simple_w_condition, comparability, aggregation, set, false_premise, post-processing, and multi-hop. The next desk gives instance questions with their area and query kind.

Area	Query	Query Sort
Sports activities	Are you able to carry lower than the utmost variety of golf equipment throughout a spherical of golf?	`easy`
Music	Are you able to inform me what number of grammies had been received by arlo guthrie till sixtieth grammy (2017)?	`simple_w_condition`
Open	Am i able to make cookies in an air fryer?	`easy`
Finance	Did meta have any mergers or acquisitions in 2022?	`simple_w_condition`
Film	In 2016, which film was distinguished for its visible results on the oscars?	`simple_w_condition`

The analysis thought-about 200 queries from this dataset representing 5 domains and two query varieties, easy and simple_w_condition. Each varieties of questions are widespread from customers, and a typical Google seek for the question corresponding to “Are you able to inform me what number of grammies had been received by arlo guthrie till sixtieth grammy (2017)?” won’t provide the appropriate reply (one Grammy). FloTorch used these queries and their floor reality solutions to create a subset benchmark dataset. The CRAG dataset additionally gives prime 5 search outcome pages for every question. These 5 webpages act as a data base (supply knowledge) to restrict the RAG mannequin’s response. The objective is to index these 5 webpages dynamically utilizing a standard embedding algorithm after which use a retrieval (and reranking) technique to retrieve chunks of information from the listed data base to deduce the ultimate reply.

Analysis setup

The RAG analysis pipeline consists of the a number of key elements, as illustrated within the following diagram.

On this part, we discover every part in additional element.

Information base

FloTorch used the highest 5 HTML webpages supplied with the CRAG dataset for every question because the data base supply knowledge. HTML pages had been parsed to extract textual content for the embedding stage.

Chunking technique

FloTorch used a hard and fast chunking technique with a piece measurement of 512 tokens (4 characters is often round one token) and a ten% overlap between chunks. Additional experiments with completely different chunking methods, chunk sizes, and p.c overlap will likely be performed in coming weeks and can replace this submit.

Embedding technique

FloTorch used the Amazon Titan Textual content Embeddings V2 mannequin on Amazon Bedrock with an output vector measurement of 1024. With a most enter token restrict of 8,192 for the mannequin, the system efficiently embedded chunks from the data base supply knowledge in addition to brief queries from the CRAG dataset effectively. Amazon Bedrock APIs make it simple to make use of Amazon Titan Textual content Embeddings V2 for embedding knowledge.

Vector database

FloTorch chosen Amazon OpenSearch Service as a vector database for its high-performance metrics. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Every provisioned node was r7g.4xlarge, chosen for its availability and adequate capability to satisfy the efficiency necessities. FloTorch used HSNW indexing in OpenSearch Service.

Retrieval (and reranking) technique

FloTorch used a retrieval technique with a k-nearest neighbor (k-NN) of 5 for retrieved chunks. The experiments excluded reranking algorithms to ensure retrieved chunks remained constant for each fashions when inferring the reply to the offered question. The next code snippet embeds the given question and passes the embeddings to the search operate:

def search_results(interaction_ids: Record[str], queries: Record[str], okay: int):
   """Retrieve search outcomes for queries."""
   outcomes = []
   embedding_max_length = int(os.getenv("EMBEDDING_MAX_LENGTH", 1024))
   normalize_embeddings = os.getenv("NORMALIZE_EMBEDDINGS", "True").decrease() == "true"

   for interaction_id, question in zip(interaction_ids, queries):
       strive:
           _, _, embedding = create_embeddings_with_titan_bedrock(question, embedding_max_length, normalize_embeddings)
           outcomes.append(search(interaction_id + '_titan', embedding, okay))
       besides Exception as e:
           logger.error(f"Error processing question {question}: {e}")
           outcomes.append(None)
   return outcomes

Inferencing

FloTorch used the GPT-4o mannequin from OpenAI utilizing the API key obtainable and used the Amazon Nova Professional mannequin with dialog APIs. GPT-4o helps a context window of 128,000 in comparison with Amazon Nova Professional with a context window of 300,000. The utmost output token restrict of GPT-4o is 16,384 vs. the Amazon Nova Professional most output token restrict of 5,000. The benchmarking experiments had been performed with out Amazon Bedrock Guardrails performance. The implementation used the common gateway offered by the FloTorch enterprise model to allow constant API calls utilizing the identical operate and to trace token depend and latency metrics uniformly. The inference operate code is as follows:

def generate_responses(dataset_path: str, model_name: str, batch_size: int, api_endpoint: str, auth_header: str,
                        max_tokens: int, search_k: int, system_prompt: str):
   """Generate response for queries."""
   outcomes = []

   for batch in tqdm(load_data_in_batches(dataset_path, batch_size), desc="Producing responses"):
       interaction_ids = [item["interaction_id"] for merchandise in batch]
       queries = [item["query"] for merchandise in batch]
       search_results_list = search_results(interaction_ids, queries, search_k)

       for i, merchandise in enumerate(batch):
           merchandise["search_results"] = search_results_list[i]

       responses = send_batch_request(batch, model_name, api_endpoint, auth_header, max_tokens, system_prompt)

       for i, response in enumerate(responses):
           outcomes.append({
               "interaction_id": interaction_ids[i],
               "question": queries[i],
               "prediction": response.get("decisions", [{}])[0].get("message", {}).get("content material") if response else None,
               "response_time": response.get("response_time") if response else None,
               "response": response,
           })

   return outcomes

Analysis

Each fashions had been evaluated by working batch queries. A batch of eight was chosen to adjust to Amazon Bedrock quota limits in addition to GPT-4o fee limits. The question operate code is as follows:

def send_batch_request(batch: Record[Dict], model_name: str, api_endpoint: str, auth_header: str, max_tokens: int,
                      system_prompt: str):
   """Ship batch queries to the API."""
   headers = {"Authorization": auth_header, "Content material-Sort": "software/json"}
   responses = []

   for merchandise in batch:
       question = merchandise["query"]
       query_time = merchandise["query_time"]
       retrieval_results = merchandise.get("search_results", [])

       references = "# References n" + "n".be a part of(
           [f"Reference {_idx + 1}:n{res['text']}n" for _idx, res in enumerate(retrieval_results)])
       user_message = f"{references}n------nnUsing solely the references listed above, reply the next query:nQuestion: {question}n"

       payload = {
           "mannequin": model_name,
           "messages": [{"role": "system", "content": system_prompt},
                        {"role": "user", "content": user_message}],
           "max_tokens": max_tokens,
       }

       strive:
           start_time = time.time()
           response = requests.submit(api_endpoint, headers=headers, json=payload, timeout=25000)
           response.raise_for_status()
           response_json = response.json()
           response_json['response_time'] = time.time() - start_time
           responses.append(response_json)
       besides requests.RequestException as e:
           logger.error(f"API request failed for question: {question}. Error: {e}")
           responses.append(None)

   return responses

Benchmarking on the CRAG dataset

On this part, we talk about the latency, accuracy, and price measurements of benchmarking on the CRAG dataset.

Latency

Latency measurements for every question response had been calculated because the distinction between two timestamps: the timestamp when the API name is made to the inference LLM, and a second timestamp when your complete response is obtained from the inference endpoint. The distinction between these two timestamps determines the latency. A decrease latency signifies a faster-performing LLM, making it appropriate for functions requiring speedy response instances. The research signifies that latency may be additional diminished for each fashions via optimizations and caching methods; nonetheless, the analysis centered on measuring out-of-the-box latency efficiency for each fashions.

Accuracy

FloTorch used a modified model of the local_evaluation.py script supplied with the CRAG benchmark for accuracy evaluations. The script was enhanced to offer correct categorization of appropriate, incorrect, and lacking responses. The default GPT-4o analysis LLM within the analysis script was changed with the mixtral-8x7b-instruct-v0:1 mannequin API. Further modifications to the script enabled monitoring of enter and output tokens and latency as described earlier.

Value

Value calculations had been simple as a result of each Amazon Nova Professional and GPT-4o have printed value per million enter and output tokens individually. The calculation methodology concerned multiplying enter tokens by corresponding charges and making use of the identical course of for output tokens. The whole value for working 200 queries was decided by combining enter token and output token prices. OpenSearch Service provisioned cluster prices had been excluded from this evaluation as a result of the price comparability centered solely on the inference stage between Amazon Nova Professional and GPT-4o LLMs.

Outcomes

The next desk summarizes the outcomes.

.	Amazon Nova Professional	GPT-4o	Commentary
Accuracy on subset of the CRAG dataset	51.50% (103 appropriate responses out of 200)	53.00% (106 appropriate responses out of 200)	GPT-4o outperforms Amazon Nova Professional by 1.5% on accuracy
Value for working inference for 200 queries	$0.00030205	$0.000869537	Amazon Nova Professional saves 65.26% in prices in comparison with GPT-4o
Common latency (seconds)	1.682539835	2.15615045	Amazon Nova Professional is 21.97% quicker than GPT-4o
Common of enter and output tokens	1946.621359	1782.707547	Typical GPT-4o responses are shorter than Amazon Nova responses

For easy queries, Amazon Nova Professional and GPT-4o have related accuracies (55 and 56 appropriate responses, respectively) however for easy queries with situations, GPT-4o performs barely higher than Amazon Nova Professional (50 vs. 48 appropriate solutions). Think about you’re a part of a company working an AI assistant service that handles 1,000 questions monthly from 10,000 customers (10,000,000 queries monthly). Amazon Nova Professional will save your group $5,674.88 monthly ($68,098 per 12 months) in comparison with GPT-4o.

Let’s take a look at related outcomes for Amazon Nova Micro, Amazon Nova Lite, and GPT-4o mini fashions on the identical dataset.

	Amazon Nova Lite	Nove Micro	GPT-4o mini	Commentary
Accuracy on subset of the CRAG dataset	52.00% (104 appropriate responses out of 200)	54.00% (108 appropriate responses out of 200)	50.00% (100 appropriate responses out of 200)	Each Amazon Nova Lite and Amazon Nova Micro outperform GPT-4o mini by 2 and 4 factors, respectively
Value for working inference for 200 queries	$0.00002247 (56.59% cheaper than GPT-4o mini)	$0.000013924 (73.10% cheaper than GPT-4o mini)	$0.000051768	Amazon Nova Lite and Amazon Nova Micro are cheaper than GPT-4o mini by 56.59% and 73.10%, respectively
Common latency (seconds)	1.553371465 (26.60% quicker than GPT-4o mini)	1.6828564 (20.48% quicker than GPT-4o mini)	2.116291895	Amazon Nova fashions are a minimum of 20% quicker than GPT-4o mini
Common of enter and output tokens	1930.980769	1940.166667	1789.54	GPT-4o mini returns shorter solutions

Amazon Nova Micro is considerably quicker and cheaper in comparison with GPT-4o mini whereas offering extra correct solutions. In case you are working a service that handles about 10 million queries every month, it’s going to prevent on common 73% of what you may be paying for barely much less correct outcomes from the GPT-4o mini mannequin.

Conclusion

Based mostly on these assessments for RAG instances, Amazon Nova fashions produce comparable or increased accuracy at considerably decrease value and latency in comparison with GPT-4o and GPT-4o mini fashions. FloTorch is continuous additional experimentation with different related LLMs for comparability. Future analysis will embody extra experiments with numerous question varieties corresponding to comparability, aggregation, set, false_premise, post-processing, and multi-hop queries.

Get began with Amazon Nova on the Amazon Bedrock console. Study extra on the Amazon Nova product web page.

About FloTorch

FloTorch.ai helps enterprise prospects design and handle agentic workflows in a safe and scalable method. FloTorch’s mission is to assist enterprises make data-driven choices within the end-to-end generative AI pipeline, together with however not restricted to mannequin choice, vector database choice, and analysis methods. FloTorch affords an open supply model for patrons with scalable experimentation with completely different chunking, embedding, retrieval, and inference methods. The open supply model works on a buyer’s AWS account so you’ll be able to experiment in your AWS account together with your proprietary knowledge. customers are invited to check out FloTorch from AWS Market or from GitHub. FloTorch additionally affords an enterprise model of this product for scalable experimentation with LLM fashions and vector databases on cloud platforms. The enterprise model additionally features a common gateway with mannequin registry to customized outline new LLMs and advice engine to counsel ew LLMs and agent workflows. For extra data, contact us at data@flotorch.ai.

Concerning the creator

Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI options for enterprise prospects. With a ardour for serving to AWS prospects construct revolutionary Gen AI functions, he focuses on creating scalable, cutting-edge AI options that drive enterprise transformation. You’ll be able to join with Prasanna on LinkedIn.

Dr. Hemant Joshi has over 20 years of {industry} expertise constructing services with AI/ML applied sciences. As CTO of FloTorch, Hemant is engaged with prospects to implement State of the Artwork GenAI options and agentic workflows for enterprises.

Benchmarking Amazon Nova and GPT-4o fashions with FloTorch

Platform-Mesh, Hub and Spoke, and Centralised | 3 Forms of knowledge staff

7 Highly effective DBeaver Ideas and Methods to Enhance Your SQL Workflow

7 Highly effective DBeaver Ideas and Methods to Enhance Your SQL Workflow

Leave a Reply Cancel reply

Popular News

How Aviva constructed a scalable, safe, and dependable MLOps platform utilizing Amazon SageMaker

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

Unlocking Japanese LLMs with AWS Trainium: Innovators Showcase from the AWS LLM Growth Assist Program

The Journey from Jupyter to Programmer: A Fast-Begin Information

The right way to run Qwen 2.5 on AWS AI chips utilizing Hugging Face libraries

About Us

Category

Recent Posts