You need to use Amazon Bedrock Customized Mannequin Import to seamlessly combine your custom-made fashions—comparable to Llama, Mistral, and Qwen—that you’ve fine-tuned elsewhere into Amazon Bedrock. The expertise is totally serverless, minimizing infrastructure administration whereas offering your imported fashions with the identical unified API entry as native Amazon Bedrock fashions. Your customized fashions profit from automated scaling, enterprise-grade safety, and native integration with Amazon Bedrock options comparable to Amazon Bedrock Guardrails and Amazon Bedrock Information Bases.
Understanding how assured a mannequin is in its predictions is important for constructing dependable AI functions, significantly when working with specialised customized fashions which may encounter domain-specific queries.
With log chance assist now added to Customized Mannequin Import, you’ll be able to entry details about your fashions’ confidence of their predictions on the token degree. This enhancement supplies higher visibility into mannequin habits and permits new capabilities for mannequin analysis, confidence scoring, and superior filtering strategies.
On this submit, we discover how log possibilities work with imported fashions in Amazon Bedrock. You’ll study what log possibilities are, methods to allow them in your API calls, and methods to interpret the returned knowledge. We additionally spotlight sensible functions—from detecting potential hallucinations to optimizing RAG methods and evaluating fine-tuned fashions—that display how these insights can enhance your AI functions, serving to you construct extra reliable options together with your customized fashions.
Understanding log possibilities
In language fashions, a log chance represents the logarithm of the chance that the mannequin assigns to a token in a sequence. These values point out how assured the mannequin is about every token it generates or processes. Log possibilities are expressed as adverse numbers, with values nearer to zero indicating greater confidence. For instance, a log chance of -0.1 corresponds to roughly 90% confidence, whereas a worth of -3.0 corresponds to about 5% confidence. By inspecting these values, you’ll be able to determine when a mannequin is very sure versus when it’s making much less assured predictions. Log possibilities present a quantitative measure of how seemingly the mannequin thought of every generated token, providing beneficial perception into the boldness of its output. By analyzing them you’ll be able to,
- Gauge confidence throughout a response: Assess how assured the mannequin was in several sections of its output, serving to you determine the place it was sure versus unsure.
- Rating and evaluate outputs: Examine general sequence probability (by including or averaging log possibilities) to rank or filter a number of mannequin outputs.
- Detect potential hallucinations: Establish sudden drops in token-level confidence, which may flag segments which may require verification or evaluate.
- Scale back RAG prices with early pruning: Run quick, low-cost draft generations primarily based on retrieved contexts, compute log possibilities for these drafts, and discard low-scoring candidates early, avoiding pointless full-length generations or costly reranking whereas maintaining solely essentially the most promising contexts within the pipeline.
- Construct confidence-aware functions: Adapt system habits primarily based on certainty ranges—for instance, set off clarifying prompts, present fallback responses, or flagging for human evaluate.
Total, log possibilities are a robust instrument for deciphering and debugging mannequin responses with measurable certainty—significantly beneficial for functions the place understanding why a mannequin responded in a sure method might be as vital because the response itself.
Conditions
To make use of log chance assist with customized mannequin import in Amazon Bedrock, you want:
- An energetic AWS account with entry to Amazon Bedrock
- A customized mannequin created in Amazon Bedrock utilizing the Customized Mannequin Import function after July 31, 2025, when the log possibilities assist was launched
- Acceptable AWS Identification and Entry Administration (IAM) permissions to invoke fashions by the Amazon Bedrock Runtime
Introducing log possibilities assist in Amazon Bedrock
With this launch, Amazon Bedrock now permits fashions imported utilizing the Customized Mannequin Import function to return token-level log possibilities as a part of the inference response.
When invoking a mannequin by Amazon Bedrock InvokeModel API, you’ll be able to entry token log possibilities by setting "return_logprobs": true
within the JSON request physique. With this flag enabled, the mannequin’s response will embody extra fields offering log possibilities for each the immediate tokens and the generated tokens, in order that prospects can analyze the mannequin’s confidence in its predictions. These log possibilities allow you to quantitatively assess how assured your customized fashions are when processing inputs and producing responses. The granular metrics permit for higher analysis of response high quality, troubleshooting of surprising outputs, and optimization of prompts or mannequin configurations.
Let’s stroll by an instance of invoking a customized mannequin on Amazon Bedrock with log possibilities enabled and study the output format. Suppose you might have already imported a customized mannequin (as an example, a fine-tuned Llama 3.2 1B mannequin) into Amazon Bedrock and have its mannequin Amazon Useful resource Title (ARN). You’ll be able to invoke this mannequin utilizing the Amazon Bedrock Runtime SDK (Boto3 for Python on this instance) as proven within the following instance:
Within the previous code, we ship a immediate—"The fast brown fox jumps"
—to our customized imported mannequin. We configure commonplace inference parameters: a most era size of fifty tokens, a average temperature of 0.5 for average randomness, and a cease situation (both a interval or a newline). The "return_logprobs":True
parameter tells Amazon Bedrock to return log possibilities within the response.
The InvokeModel
API returns a JSON response containing three principal parts: the usual generated textual content output, metadata concerning the era course of, and now log possibilities for each immediate and generated tokens. These values reveal the mannequin’s inside confidence for every token prediction, so you’ll be able to perceive not simply what textual content was produced, however how sure the mannequin was at every step of the method. The next is an instance response from the "fast brown fox jumps"
immediate, displaying log possibilities (showing as adverse numbers):
The uncooked API response supplies token IDs paired with their log possibilities. To make this knowledge interpretable, we have to first decode the token IDs utilizing the suitable tokenizer (on this case, the Llama 3.2 1B tokenizer), which maps every ID again to its precise textual content token. Then we convert log possibilities to possibilities by making use of the exponential operate, translating these values into extra intuitive possibilities between 0 and 1. We have now applied these transformations utilizing customized code (not proven right here) to supply a human-readable format the place every token seems alongside its chance, making the mannequin’s confidence in its predictions instantly clear.
Let’s break down what this tells us concerning the mannequin’s inside processing:
era
: That is the precise textual content generated by the mannequin (in our instance, it’s a continuation of the immediate that we despatched to the mannequin). This is identical area you’ll get usually from any mannequin invocation.prompt_token_count
andgeneration_token_count
: These point out the variety of tokens within the enter immediate and within the output, respectively. In our instance, the immediate was tokenized into six tokens, and the mannequin generated 5 tokens in its completion.stop_reason
: The rationale the era stopped ("cease"
means the mannequin naturally stopped at a cease sequence or end-of-text,"size"
means it hit the max token restrict, and so forth). In our case it reveals"cease"
, indicating the mannequin stopped by itself or due to the cease situation we offered.prompt_logprobs
: This array supplies log possibilities for every token within the immediate. Because the mannequin processes your enter, it constantly predicts what ought to come subsequent primarily based on what it has seen thus far. These values measure which tokens in your immediate had been anticipated or shocking to the mannequin.- The primary entry is
None
as a result of the very first token has no previous context. The mannequin can not predict something with out prior data. Every subsequent entry incorporates token IDs mapped to their log possibilities. We have now transformed these IDs to readable textual content and remodeled the log possibilities into percentages for simpler understanding. - You’ll be able to observe the mannequin’s growing confidence because it processes acquainted sequences. For instance, after seeing The fast brown, the mannequin predicted fox with 95.1% confidence. After seeing the complete context as much as fox, it predicted jumps with 81.1% confidence.
- Many positions present a number of tokens with their possibilities, revealing alternate options the mannequin thought of. As an example, on the second place, the mannequin evaluated each The (2.7%) and Query (30.6%), which suggests the mannequin thought of each tokens viable at that place. This added visibility helps you perceive the place the mannequin weighted alternate options and may reveal when it was extra unsure or had problem selecting from a number of choices.
- Notably low possibilities seem for some tokens—fast obtained simply 0.01%—indicating the mannequin discovered these phrases surprising of their context.
- The general sample tells a transparent story: particular person phrases initially obtained low possibilities, however as the whole fast brown fox jumps phrase emerged, the mannequin’s confidence elevated dramatically, displaying it acknowledged this as a well-known expression.
- When a number of tokens in your immediate persistently obtain low possibilities, your phrasing is perhaps uncommon for the mannequin. This uncertainty can have an effect on the standard of completions. Utilizing these insights, you’ll be able to reformulate prompts to higher align with patterns the mannequin encountered in its coaching knowledge.
- The primary entry is
logprobs
: This array incorporates log possibilities for every token within the mannequin’s generated output. The format is analogous: a dictionary mapping token IDs to their corresponding log possibilities.- After decoding these values, we will see that the tokens over, the, lazy, and canine all have excessive possibilities. This demonstrates the mannequin acknowledged it was finishing the well-known phrase the fast brown fox jumps over the lazy canine—a typical pangram that the mannequin seems to have sturdy familiarity with.
- In distinction, the ultimate interval (newline) token has a a lot decrease chance (30.3%), revealing the mannequin’s uncertainty about methods to conclude the sentence. This is sensible as a result of the mannequin had a number of legitimate choices: ending the sentence with a interval, persevering with with extra content material, or selecting one other punctuation mark altogether.
Sensible use circumstances of log possibilities
Token-level log possibilities from the Customized Mannequin Import function present beneficial insights into your mannequin’s decision-making course of. These metrics remodel the way you work together together with your customized fashions by revealing their confidence ranges for every generated token. Listed below are impactful methods to make use of these insights:
Rating a number of completions
You need to use log possibilities to quantitatively rank a number of generated outputs for a similar immediate. When your utility wants to decide on between completely different attainable completions—whether or not for summarization, translation, or artistic writing—you’ll be able to calculate every completion’s general probability by averaging or including the log possibilities throughout all its tokens.
Instance:
Immediate: Translate the phrase "Battre le fer pendant qu'il est chaud"
- Completion A:
"Strike whereas the iron is sizzling"
(Common log chance: -0.39) - Completion B:
"Beat the iron whereas it's sizzling."
(Common log chance: -0.46)
On this instance, Completion A receives the next log chance rating (nearer to zero), indicating the mannequin discovered this idiomatic translation extra pure than the extra literal Completion B. This numerical method permits your utility to robotically choose essentially the most possible output or current a number of candidates ranked by the mannequin’s confidence degree.
This rating functionality extends past translation to many situations the place a number of legitimate outputs exist—together with content material era, code completion, and artistic writing—offering an goal high quality metric primarily based on the mannequin’s confidence slightly than relying solely on subjective human judgment.
Detecting hallucinations and low-confidence solutions
Fashions may produce hallucinations—plausible-sounding however factually incorrect statements—when dealing with ambiguous prompts, advanced queries, or matters exterior their experience. Log possibilities present a sensible approach to detect these situations by revealing the mannequin’s inside uncertainty, serving to you determine probably inaccurate data even when the output seems assured.
By analyzing token-level log possibilities, you’ll be able to determine which elements of a response the mannequin was probably unsure about, even when the textual content seems assured on the floor. This functionality is particularly beneficial in retrieval-augmented era (RAG) methods, the place responses needs to be grounded in retrieved context. When a mannequin has related data obtainable, it usually generates solutions with greater confidence. Conversely, low confidence throughout a number of tokens suggests the mannequin is perhaps producing content material with out ample supporting data.
Instance:
- Immediate:
- Mannequin output:
On this instance, we deliberately requested a couple of fictional metric—Portfolio Synergy Quotient (PSQ)—to display how log possibilities reveal uncertainty in mannequin responses. Regardless of producing a professional-sounding definition for this non-existent monetary idea, the token-level confidence scores inform a revealing story. The boldness scores proven beneath are derived by making use of the exponential operate to the log possibilities returned by the mannequin.
- PSQ reveals medium confidence (63.8%), indicating that the mannequin acknowledged the acronym format however wasn’t extremely sure about this particular time period.
- Frequent finance terminology like lessons (98.2%) and portfolio (92.8%) exhibit excessive confidence, seemingly as a result of these are commonplace ideas extensively utilized in monetary contexts.
- Vital connecting ideas present notably low confidence: measure (14.0%) and diversification (31.8%), reveal the mannequin’s uncertainty when making an attempt to elucidate what PSQ means or does.
- Practical phrases like is (45.9%) and of (56.6%) hover within the medium confidence ranges, suggesting uncertainty concerning the general construction of the reason.
By figuring out these low-confidence segments, you’ll be able to implement focused safeguards in your functions—comparable to flagging content material for verification, retrieving extra context, producing clarifying questions, or making use of confidence thresholds for delicate data. This method helps create extra dependable AI methods that may distinguish between high-confidence data and unsure responses.
Monitoring immediate high quality
When engineering prompts on your utility, log possibilities reveal how properly the mannequin understands your directions. If the primary few generated tokens present unusually low possibilities, it usually alerts that the mannequin struggled to interpret what you’re asking.
By monitoring the typical log chance of the preliminary tokens—usually the primary 5–10 generated tokens—you’ll be able to quantitatively measure immediate readability. Properly-structured prompts with clear context usually produce greater possibilities as a result of the mannequin instantly is aware of what to do. Obscure or underspecified prompts usually yield decrease preliminary token likelihoods because the mannequin hesitates or searches for route.
Instance:
Immediate comparability for customer support responses:
- Primary immediate:
- Common log chance of first 5 tokens: -1.215 (decrease confidence)
- Optimized immediate:
- Common log chance of first 5 tokens: -0.333 (greater confidence)
The optimized immediate generates greater log possibilities, demonstrating that exact directions and clear context scale back the mannequin’s uncertainty. Somewhat than making absolute judgments about immediate high quality, this method allows you to measure relative enchancment between variations. You’ll be able to instantly observe how particular parts—function definitions, contextual particulars, and specific expectations—improve mannequin confidence. By systematically measuring these confidence scores throughout completely different immediate iterations, you construct a quantitative framework for immediate engineering that reveals precisely when and the way your directions develop into unclear to the mannequin, enabling steady data-driven refinement.
Lowering RAG prices with early pruning
In conventional RAG implementations, methods retrieve 5–20 paperwork and generate full responses utilizing these retrieved contexts. This method drives up inference prices as a result of each retrieved context consumes tokens no matter precise usefulness.
Log possibilities allow a cheaper different by early pruning. As a substitute of instantly processing the retrieved paperwork in full:
- Generate draft responses primarily based on every retrieved context
- Calculate the typical log chance throughout these quick drafts
- Rank contexts by their common log chance scores
- Discard low-scoring contexts that fall beneath a confidence threshold
- Generate the whole response utilizing solely the highest-confidence contexts
This method works as a result of contexts that include related data produce greater log possibilities within the draft era section. When the mannequin encounters useful context, it generates textual content with higher confidence, mirrored in log possibilities nearer to zero. Conversely, irrelevant or tangential contexts produce extra unsure outputs with decrease log possibilities.
By filtering contexts earlier than full era, you’ll be able to scale back token consumption whereas sustaining and even bettering reply high quality. This shifts the method from a brute-force method to a focused pipeline that directs full era solely towards contexts the place the mannequin demonstrates real confidence within the supply materials.
High quality-tuning analysis
When you might have fine-tuned a mannequin on your particular area, log possibilities provide a quantitative approach to assess the effectiveness of your coaching. By analyzing confidence patterns in responses, you’ll be able to decide in case your mannequin has developed correct calibration—displaying excessive confidence for proper domain-specific solutions and acceptable uncertainty elsewhere.
A well-calibrated fine-tuned mannequin ought to assign greater possibilities to correct data inside its specialised space whereas sustaining decrease confidence when working exterior its coaching area. Issues with calibration seem in two principal types. Overconfidence happens when the mannequin assigns excessive possibilities to incorrect responses, suggesting it hasn’t correctly realized the boundaries of its data. Underneath confidence manifests as persistently low possibilities regardless of producing correct solutions, indicating that coaching may not have sufficiently bolstered right patterns.
By systematically testing your mannequin throughout varied situations and analyzing the log possibilities, you’ll be able to determine areas needing extra coaching or detect potential biases in your present method. This creates a data-driven suggestions loop for iterative enhancements, ensuring your mannequin performs reliably inside its supposed scope whereas sustaining acceptable boundaries round its experience.
Getting began
Right here’s methods to begin utilizing log possibilities with fashions imported by the Amazon Bedrock Customized Mannequin Import function:
- Allow log possibilities in your API calls: Add
"return_logprobs": true
to your request payload when invoking your customized imported mannequin. This parameter works with each theInvokeModel
andInvokeModelWithResponseStream
APIs. Start with acquainted prompts to watch which tokens your mannequin predicts with excessive confidence in comparison with which it finds shocking. - Analyze confidence patterns in your customized fashions: Look at how your fine-tuned or domain-adapted fashions reply to completely different inputs. The log possibilities reveal whether or not your mannequin is appropriately calibrated on your particular area—displaying excessive confidence the place it needs to be sure.
- Develop confidence-aware functions: Implement sensible use circumstances comparable to hallucination detection, response rating, and content material verification to make your functions extra sturdy. For instance, you’ll be able to flag low-confidence sections of responses for human evaluate or choose the highest-confidence response from a number of generations.
Conclusion
Log chance assist for Amazon Bedrock Customized Mannequin Import gives enhanced visibility into mannequin decision-making. This function transforms beforehand opaque mannequin habits into quantifiable confidence metrics that builders can analyze and use.
All through this submit, we’ve demonstrated methods to allow log possibilities in your API calls, interpret the returned knowledge, and use these insights for sensible functions. From detecting potential hallucinations and rating a number of completions to optimizing RAG methods and evaluating fine-tuning high quality, log possibilities provide tangible advantages throughout various use circumstances.
For patrons working with custom-made basis fashions like Llama, Mistral, or Qwen, these insights deal with a elementary problem: understanding not simply what a mannequin generates, however how assured it’s in its output. This distinction turns into vital when deploying AI in domains requiring excessive reliability—comparable to finance, healthcare, or enterprise functions—the place incorrect outputs can have important penalties.
By revealing confidence patterns throughout several types of queries, log possibilities enable you assess how properly your mannequin customizations have affected calibration, highlighting the place your mannequin excels and the place it’d want refinement. Whether or not you’re evaluating fine-tuning effectiveness, debugging surprising responses, or constructing methods that adapt to various confidence ranges, this functionality represents an vital development in bringing higher transparency and management to generative AI improvement on Amazon Bedrock.
We sit up for seeing how you employ log possibilities to construct extra clever and reliable functions together with your customized imported fashions. This functionality demonstrates the dedication from Amazon Bedrock to supply builders with instruments that allow assured innovation whereas delivering the scalability, safety, and ease of a completely managed service.
Concerning the authors
Manoj Selvakumar is a Generative AI Specialist Options Architect at AWS, the place he helps organizations design, prototype, and scale AI-powered options within the cloud. With experience in deep studying, scalable cloud-native methods, and multi-agent orchestration, he focuses on turning rising improvements into production-ready architectures that drive measurable enterprise worth. He’s obsessed with making advanced AI ideas sensible and enabling prospects to innovate responsibly at scale—from early experimentation to enterprise deployment. Earlier than becoming a member of AWS, Manoj labored in consulting, delivering knowledge science and AI options for enterprise purchasers, constructing end-to-end machine studying methods supported by sturdy MLOps practices for coaching, deployment, and monitoring in manufacturing.
Yanyan Zhang is a Senior Generative AI Information Scientist at Amazon Internet Providers, the place she has been engaged on cutting-edge AI/ML applied sciences as a Generative AI Specialist, serving to prospects use generative AI to attain their desired outcomes. Yanyan graduated from Texas A&M College with a PhD in Electrical Engineering. Outdoors of labor, she loves touring, figuring out, and exploring new issues.
Lokeshwaran Ravi is a Senior Deep Studying Compiler Engineer at AWS, specializing in ML optimization, mannequin acceleration, and AI safety. He focuses on enhancing effectivity, decreasing prices, and constructing safe ecosystems to democratize AI applied sciences, making cutting-edge ML accessible and impactful throughout industries.
Revendra Kumar is a Senior Software program Improvement Engineer at Amazon Internet Providers. In his present function, he focuses on mannequin internet hosting and inference MLOps on Amazon Bedrock. Previous to this, he labored as an engineer on internet hosting Quantum computer systems on the cloud and growing infrastructure options for on-premises cloud environments. Outdoors of his skilled pursuits, Revendra enjoys staying energetic by enjoying tennis and climbing.