Based mostly on unique submit by Dr. Hemant Joshi, CTO, FloTorch.ai
A current analysis performed by FloTorch in contrast the efficiency of Amazon Nova fashions with OpenAI’s GPT-4o.
Amazon Nova is a brand new technology of state-of-the-art basis fashions (FMs) that ship frontier intelligence and industry-leading price-performance. The Amazon Nova household of fashions consists of Amazon Nova Micro, Amazon Nova Lite, and Amazon Nova Professional, which assist textual content, picture, and video inputs whereas producing text-based outputs. These fashions supply enterprises a spread of capabilities, balancing accuracy, pace, and cost-efficiency.
Utilizing its enterprise software program, FloTorch performed an in depth comparability between Amazon Nova fashions and OpenAI’s GPT-4o fashions with the Complete Retrieval Augmented Era (CRAG) benchmark dataset. FloTorch’s analysis centered on three crucial components—latency, accuracy, and price—throughout 5 numerous subjects.
Key findings from the benchmark research:
- GPT-4o demonstrated a slight benefit in accuracy over Amazon Nova Professional
- Amazon Nova Professional outperformed GPT-4o in effectivity, working 97% quicker whereas being 65.26% more cost effective
- Amazon Nova Micro and Amazon Nova Lite outperformed GPT-4o-mini by 2 proportion factors in accuracy
- When it comes to affordability, Amazon Nova Micro and Amazon Nova Lite had been 10% and 56.59% cheaper than GPT-4o-mini, respectively
- Amazon Nova Micro and Amazon Nova Lite additionally demonstrated quicker response instances, with 48% and 26.60% enhancements, respectively
On this submit, we talk about the findings from this benchmarking in additional element.
The rising want for cost-effective AI fashions
The panorama of generative AI is quickly evolving. OpenAI launched GPT-4o in Might 2024, and Amazon launched Amazon Nova fashions at AWS re:Invent in December 2024. Though GPT-4o has gained traction within the AI neighborhood, enterprises are exhibiting elevated curiosity in Amazon Nova as a consequence of its decrease latency and cost-effectiveness.
Massive language fashions (LLMs) are typically proficient in responding to consumer queries, however they generally generate overly broad or inaccurate responses. Moreover, LLMs would possibly present solutions that reach past the company-specific context, making them unsuitable for sure enterprise use instances.
One of the crucial functions for LLMs right now is Retrieval Augmented Era (RAG), which permits AI fashions to floor responses in enterprise data bases corresponding to PDFs, inner paperwork, and structured knowledge. This can be a essential requirement for enterprises that need their AI methods to offer responses strictly inside an outlined scope.
To higher serve the enterprise prospects, the analysis aimed to reply three key questions:
- How does Amazon Nova Professional examine to GPT-4o by way of latency, value, and accuracy?
- How do Amazon Nova Micro and Amazon Nova Lite carry out in opposition to GPT-4o mini in these identical metrics?
- How nicely do these fashions deal with RAG use instances throughout completely different {industry} domains?
By addressing these questions, the analysis gives enterprises with actionable insights into deciding on the suitable AI fashions for his or her particular wants—whether or not optimizing for pace, accuracy, or cost-efficiency.
Overview of the CRAG benchmark dataset
The CRAG dataset was launched by Meta for testing with factual queries throughout 5 domains with eight query varieties and numerous question-answer pairs. 5 domains in CRAG dataset are Finance, Sports activities, Music, Film, and Open (miscellaneous). The eight completely different query varieties are easy
, simple_w_condition
, comparability
, aggregation
, set
, false_premise
, post-processing
, and multi-hop
. The next desk gives instance questions with their area and query kind.
Area | Query | Query Sort |
Sports activities | Are you able to carry lower than the utmost variety of golf equipment throughout a spherical of golf? | easy |
Music | Are you able to inform me what number of grammies had been received by arlo guthrie till sixtieth grammy (2017)? | simple_w_condition |
Open | Am i able to make cookies in an air fryer? | easy |
Finance | Did meta have any mergers or acquisitions in 2022? | simple_w_condition |
Film | In 2016, which film was distinguished for its visible results on the oscars? | simple_w_condition |
The analysis thought-about 200 queries from this dataset representing 5 domains and two query varieties, easy
and simple_w_condition
. Each varieties of questions are widespread from customers, and a typical Google seek for the question corresponding to “Are you able to inform me what number of grammies had been received by arlo guthrie till sixtieth grammy (2017)?” won’t provide the appropriate reply (one Grammy). FloTorch used these queries and their floor reality solutions to create a subset benchmark dataset. The CRAG dataset additionally gives prime 5 search outcome pages for every question. These 5 webpages act as a data base (supply knowledge) to restrict the RAG mannequin’s response. The objective is to index these 5 webpages dynamically utilizing a standard embedding algorithm after which use a retrieval (and reranking) technique to retrieve chunks of information from the listed data base to deduce the ultimate reply.
Analysis setup
The RAG analysis pipeline consists of the a number of key elements, as illustrated within the following diagram.
On this part, we discover every part in additional element.
Information base
FloTorch used the highest 5 HTML webpages supplied with the CRAG dataset for every question because the data base supply knowledge. HTML pages had been parsed to extract textual content for the embedding stage.
Chunking technique
FloTorch used a hard and fast chunking technique with a piece measurement of 512 tokens (4 characters is often round one token) and a ten% overlap between chunks. Additional experiments with completely different chunking methods, chunk sizes, and p.c overlap will likely be performed in coming weeks and can replace this submit.
Embedding technique
FloTorch used the Amazon Titan Textual content Embeddings V2 mannequin on Amazon Bedrock with an output vector measurement of 1024. With a most enter token restrict of 8,192 for the mannequin, the system efficiently embedded chunks from the data base supply knowledge in addition to brief queries from the CRAG dataset effectively. Amazon Bedrock APIs make it simple to make use of Amazon Titan Textual content Embeddings V2 for embedding knowledge.
Vector database
FloTorch chosen Amazon OpenSearch Service as a vector database for its high-performance metrics. The implementation included a provisioned three-node sharded OpenSearch Service cluster. Every provisioned node was r7g.4xlarge, chosen for its availability and adequate capability to satisfy the efficiency necessities. FloTorch used HSNW indexing in OpenSearch Service.
Retrieval (and reranking) technique
FloTorch used a retrieval technique with a k-nearest neighbor (k-NN) of 5 for retrieved chunks. The experiments excluded reranking algorithms to ensure retrieved chunks remained constant for each fashions when inferring the reply to the offered question. The next code snippet embeds the given question and passes the embeddings to the search operate:
Inferencing
FloTorch used the GPT-4o mannequin from OpenAI utilizing the API key obtainable and used the Amazon Nova Professional mannequin with dialog APIs. GPT-4o helps a context window of 128,000 in comparison with Amazon Nova Professional with a context window of 300,000. The utmost output token restrict of GPT-4o is 16,384 vs. the Amazon Nova Professional most output token restrict of 5,000. The benchmarking experiments had been performed with out Amazon Bedrock Guardrails performance. The implementation used the common gateway offered by the FloTorch enterprise model to allow constant API calls utilizing the identical operate and to trace token depend and latency metrics uniformly. The inference operate code is as follows:
Analysis
Each fashions had been evaluated by working batch queries. A batch of eight was chosen to adjust to Amazon Bedrock quota limits in addition to GPT-4o fee limits. The question operate code is as follows:
Benchmarking on the CRAG dataset
On this part, we talk about the latency, accuracy, and price measurements of benchmarking on the CRAG dataset.
Latency
Latency measurements for every question response had been calculated because the distinction between two timestamps: the timestamp when the API name is made to the inference LLM, and a second timestamp when your complete response is obtained from the inference endpoint. The distinction between these two timestamps determines the latency. A decrease latency signifies a faster-performing LLM, making it appropriate for functions requiring speedy response instances. The research signifies that latency may be additional diminished for each fashions via optimizations and caching methods; nonetheless, the analysis centered on measuring out-of-the-box latency efficiency for each fashions.
Accuracy
FloTorch used a modified model of the local_evaluation.py script supplied with the CRAG benchmark for accuracy evaluations. The script was enhanced to offer correct categorization of appropriate, incorrect, and lacking responses. The default GPT-4o analysis LLM within the analysis script was changed with the mixtral-8x7b-instruct-v0:1
mannequin API. Further modifications to the script enabled monitoring of enter and output tokens and latency as described earlier.
Value
Value calculations had been simple as a result of each Amazon Nova Professional and GPT-4o have printed value per million enter and output tokens individually. The calculation methodology concerned multiplying enter tokens by corresponding charges and making use of the identical course of for output tokens. The whole value for working 200 queries was decided by combining enter token and output token prices. OpenSearch Service provisioned cluster prices had been excluded from this evaluation as a result of the price comparability centered solely on the inference stage between Amazon Nova Professional and GPT-4o LLMs.
Outcomes
The next desk summarizes the outcomes.
. | Amazon Nova Professional | GPT-4o | Commentary |
Accuracy on subset of the CRAG dataset |
51.50% (103 appropriate responses out of 200) |
53.00% (106 appropriate responses out of 200) |
GPT-4o outperforms Amazon Nova Professional by 1.5% on accuracy |
Value for working inference for 200 queries | $0.00030205 | $0.000869537 | Amazon Nova Professional saves 65.26% in prices in comparison with GPT-4o |
Common latency (seconds) | 1.682539835 | 2.15615045 | Amazon Nova Professional is 21.97% quicker than GPT-4o |
Common of enter and output tokens | 1946.621359 | 1782.707547 | Typical GPT-4o responses are shorter than Amazon Nova responses |
For easy queries, Amazon Nova Professional and GPT-4o have related accuracies (55 and 56 appropriate responses, respectively) however for easy queries with situations, GPT-4o performs barely higher than Amazon Nova Professional (50 vs. 48 appropriate solutions). Think about you’re a part of a company working an AI assistant service that handles 1,000 questions monthly from 10,000 customers (10,000,000 queries monthly). Amazon Nova Professional will save your group $5,674.88 monthly ($68,098 per 12 months) in comparison with GPT-4o.
Let’s take a look at related outcomes for Amazon Nova Micro, Amazon Nova Lite, and GPT-4o mini fashions on the identical dataset.
|
Amazon Nova Lite | Nove Micro | GPT-4o mini | Commentary |
Accuracy on subset of the CRAG dataset |
52.00% (104 appropriate responses out of 200) |
54.00% (108 appropriate responses out of 200) |
50.00% (100 appropriate responses out of 200) |
Each Amazon Nova Lite and Amazon Nova Micro outperform GPT-4o mini by 2 and 4 factors, respectively |
Value for working inference for 200 queries |
$0.00002247 (56.59% cheaper than GPT-4o mini) |
$0.000013924 (73.10% cheaper than GPT-4o mini) |
$0.000051768 | Amazon Nova Lite and Amazon Nova Micro are cheaper than GPT-4o mini by 56.59% and 73.10%, respectively |
Common latency (seconds) |
1.553371465 (26.60% quicker than GPT-4o mini) |
1.6828564 (20.48% quicker than GPT-4o mini) |
2.116291895 | Amazon Nova fashions are a minimum of 20% quicker than GPT-4o mini |
Common of enter and output tokens | 1930.980769 | 1940.166667 | 1789.54 | GPT-4o mini returns shorter solutions |
Amazon Nova Micro is considerably quicker and cheaper in comparison with GPT-4o mini whereas offering extra correct solutions. In case you are working a service that handles about 10 million queries every month, it’s going to prevent on common 73% of what you may be paying for barely much less correct outcomes from the GPT-4o mini mannequin.
Conclusion
Based mostly on these assessments for RAG instances, Amazon Nova fashions produce comparable or increased accuracy at considerably decrease value and latency in comparison with GPT-4o and GPT-4o mini fashions. FloTorch is continuous additional experimentation with different related LLMs for comparability. Future analysis will embody extra experiments with numerous question varieties corresponding to comparability
, aggregation
, set
, false_premise
, post-processing
, and multi-hop
queries.
Get began with Amazon Nova on the Amazon Bedrock console. Study extra on the Amazon Nova product web page.
About FloTorch
FloTorch.ai helps enterprise prospects design and handle agentic workflows in a safe and scalable method. FloTorch’s mission is to assist enterprises make data-driven choices within the end-to-end generative AI pipeline, together with however not restricted to mannequin choice, vector database choice, and analysis methods. FloTorch affords an open supply model for patrons with scalable experimentation with completely different chunking, embedding, retrieval, and inference methods. The open supply model works on a buyer’s AWS account so you’ll be able to experiment in your AWS account together with your proprietary knowledge. customers are invited to check out FloTorch from AWS Market or from GitHub. FloTorch additionally affords an enterprise model of this product for scalable experimentation with LLM fashions and vector databases on cloud platforms. The enterprise model additionally features a common gateway with mannequin registry to customized outline new LLMs and advice engine to counsel ew LLMs and agent workflows. For extra data, contact us at data@flotorch.ai.
Concerning the creator
Prasanna Sridharan is a Principal Gen AI/ML Architect at AWS, specializing in designing and implementing AI/ML and Generative AI options for enterprise prospects. With a ardour for serving to AWS prospects construct revolutionary Gen AI functions, he focuses on creating scalable, cutting-edge AI options that drive enterprise transformation. You’ll be able to join with Prasanna on LinkedIn.
Dr. Hemant Joshi has over 20 years of {industry} expertise constructing services with AI/ML applied sciences. As CTO of FloTorch, Hemant is engaged with prospects to implement State of the Artwork GenAI options and agentic workflows for enterprises.