Introducing Gemma 4 fashions on Amazon Bedrock

At this time, we’re asserting the supply of the Gemma 4 household on Amazon Bedrock. Constructed by Google DeepMind and launched underneath the Apache 2.0 license, Gemma 4 is a household of open-weight fashions designed with a give attention to intelligence-per-parameter throughout a broad vary of deployment situations. The household contains three instruction-tuned variants: Gemma 4 31B, Gemma 4 26B-A4B, and Gemma 4 E2B. These cowl dense and mixture-of-experts (MoE) architectures, the place solely a fraction of the mannequin’s parameters activate per request. The variants supply built-in reasoning, native perform calling, and multimodal enter throughout textual content and picture.

Impartial benchmarks replicate Gemma 4’s intelligence-per-parameter focus: Synthetic Evaluation reviews an Intelligence Index of 39 for Gemma 4 31B, effectively above the median of 15 within the 4B–40B open-weights class.

Organizations adopting open-weight basis fashions (FMs) for manufacturing face a continuing trade-off: entry the main fashions, however with out compromising on information safety, regulatory alignment, or operational management. Amazon Bedrock removes that trade-off. It provides you main open-weight FMs by means of a totally managed service, with inference working fully on infrastructure operated by AWS and the safety and privateness controls you count on from Amazon Bedrock.

On this put up, we stroll by means of methods to get began with Gemma 4 fashions on Amazon Bedrock. We cowl the capabilities supported by these fashions, the service tiers accessible, how on-demand inference scales to deal with your workloads, and the totally different APIs you should use to entry them. With these fashions, you possibly can construct multimodal brokers, light-weight purposes, doc understanding pipelines, and software program engineering workflows on Amazon Bedrock. Your prompts and completions will not be used to coach any fashions, and your content material will not be shared with third events.

Key capabilities of Gemma 4

The Gemma 4 household on Amazon Bedrock spans a 2.3B-effective-parameter compact mannequin as much as a 30.7B-parameter dense mannequin, supplying you with a alternative of variants to match totally different price and latency profiles. All variants assist a built-in reasoning mode, native perform calling for agentic workflows, multimodal enter that mixes textual content and picture, and out-of-the-box assist for over 35 languages with pre-training throughout 140+. As a result of the fashions are open-weight, you possibly can independently consider the mannequin structure and coaching methodology, benchmark by yourself workloads, and fine-tune on proprietary information when customization is required. You possibly can entry the fashions by means of a totally managed AWS service with out provisioning infrastructure, internet hosting mannequin weights, or working inference stacks your self.anno

For the newest listing of supported fashions, discuss with the Amazon Bedrock mannequin catalog.

The Gemma 4 household on Amazon Bedrock at a look

The household contains three instruction-tuned variants optimized for various price and latency profiles. The next desk summarizes the important thing specs for every mannequin on Amazon Bedrock:

	Gemma 4 31B	Gemma 4 26B-A4B	Gemma 4 E2B
Mannequin ID	`google.gemma-4-31b`	`google.gemma-4-26b-a4b`	`google.gemma-4-e2b`
Structure	Dense	Combination-of-Specialists	Dense (PLE)
Complete / Energetic parameters	30.7B	25.2B / 3.8B lively	5.1B complete / 2.3B efficient
Context window	256K tokens	256K tokens	128K tokens
Modalities	Textual content, picture	Textual content, picture	Textual content, picture
Reasoning mode	Sure	Sure	Sure
Perform calling	Native	Native	Native
Service tiers	Normal, Precedence, Flex	Normal, Precedence, Flex	Normal, Precedence, Flex

Selecting a variant

Choose the variant that finest matches your workload’s efficiency and price necessities. The next desk gives steering on which mannequin to decide on based mostly in your use case:

In case your workload is…	Select	Why
Reasoning-heavy or coding-heavy with a single dense mannequin	Gemma 4 31B	Largest dense variant within the household; robust reasoning and coding efficiency with a 256K context window.
Price-sensitive at excessive throughput, with information breadth necessities	Gemma 4 26B-A4B	MoE design means inference price and latency nearer to a 4B dense mannequin whereas retaining the information capability of a a lot bigger one.
Latency-sensitive, on-device-style, or multimodal classification	Gemma 4 E2B	Smallest, quickest variant; suited to lowest-cost or fastest-response multimodal workloads. Set `reasoning_effort=excessive` for this variant (see Allow reasoning mode).

Throughout the household, Gemma 4 fashions share a typical interface: system prompts, structured instrument calling, picture enter, and a considering mode that may be toggled per request. You possibly can develop an utility towards the API floor as soon as and swap between variants based mostly on the fee and latency profile that matches the workload.

Structure highlights

All Gemma 4 variants use a hybrid consideration design that interleaves native and international consideration, supporting lengthy contexts as much as 256K tokens on 31B and 26B-A4B whereas maintaining a small reminiscence footprint. The 26B-A4B variant is a mixture-of-experts mannequin: 25.2B complete parameters however solely 3.8B lively per token, giving roughly 4B-class price and latency with the information capability of a bigger mannequin. The E2B variant makes use of Per-Layer Embeddings (PLE) to maintain its efficient parameter rely (2.3B of 5.1B complete) small, reducing reminiscence and compute price. For structure particulars, discuss with the Gemma 4 mannequin card.

Observe: Use the bedrock-mantle endpoint for Gemma 4 fashions.

Reasoning mode

Gemma 4 features a built-in reasoning mode. When enabled, the mannequin emits its inner thought course of earlier than producing the ultimate reply. On the bedrock-mantle endpoint, you allow reasoning by means of the Responses API reasoning parameter, and the thought course of is returned as a separate reasoning merchandise alongside the ultimate reply (see Allow reasoning mode within the walkthrough).

In multi-turn conversations, ship again solely the ultimate solutions from earlier turns, not their reasoning objects. Replaying prior reasoning again to the mannequin can degrade its responses. You possibly can nonetheless maintain the reasoning in your personal logs or audit path. Strip it from the historical past you ship on the following flip.

Accessing Gemma 4 fashions on Amazon Bedrock

You entry Gemma 4 fashions on Amazon Bedrock by means of the bedrock-mantle endpoint, the OpenAI-compatible API purpose-built for the next-generation inference engine for Amazon Bedrock. Its endpoint URL is https://bedrock-mantle.{area}.api.aws/openai/v1, and it exposes the Chat Completions and Responses APIs.

It’s useful to maintain the endpoint and the engine distinct: the engine is the underlying serving infrastructure—designed with Mannequin Deployment Account isolation and 0 operator entry—and the endpoint is the HTTPS API floor you name. The bedrock-mantle endpoint is the general public API for that next-generation inference engine and exposes its full characteristic set. For a deeper have a look at the underlying inference engine, discuss with Exploring the zero operator entry design of Amazon Bedrock’s next-generation inference engine.

The bedrock-mantle endpoint makes use of the identical interface because the OpenAI Python and TypeScript SDKs, so groups already on these SDKs can swap to Gemma 4 fashions on Amazon Bedrock by updating solely the bottom URL and mannequin ID. It additionally helps Amazon Bedrock API keys, tasks, and client-side instrument calling.

Get began with Gemma 4 household fashions on Amazon Bedrock

Full the next steps to start out utilizing Gemma 4 on Amazon Bedrock.

Stipulations

To make use of Gemma 4 fashions, you want an AWS account with permissions to run inference on the bedrock-mantle endpoint. The best approach to grant these is to connect the AWS managed coverage AmazonBedrockMantleInferenceAccess to your AWS Id and Entry Administration (IAM) principal. It grants learn and inference-creation entry on Mantle—the permissions the examples on this put up want. These embrace bedrock-mantle:CreateInference (which authorizes the Chat Completions and Responses inference calls) and bedrock-mantle:CallWithBearerToken (which authorizes calling the endpoint with an Amazon Bedrock API key). For particulars on creating and managing API keys, discuss with Amazon Bedrock API keys.

Should you additionally have to handle tasks, fine-tuning, customized fashions, and extra, connect AmazonBedrockMantleFullAccess as an alternative, which grants the total bedrock-mantle motion set.

Console playground

The Amazon Bedrock console features a chat/textual content playground the place you possibly can shortly check fashions with out writing any code. Observe these steps to load a Gemma 4 mannequin and put together it for interplay:

Navigate to the Amazon Bedrock console.
From the left menu, select Take a look at playgrounds.
Select the Chat/Textual content playground.
Select Choose mannequin within the upper-left nook of the playground.
From the class listing, select Google.
Select a Gemma 4 mannequin.
Select Apply to load the mannequin.
Confirm that the mannequin loaded efficiently: the mannequin title seems within the playground header, and the chat interface is prepared for enter.

To exhibit Gemma 4 31B’s reasoning and code technology capabilities, strive the next immediate within the playground: “Design a Python microservice that exposes a REST API for managing a job queue. Embody error dealing with, enter validation, and write unit checks. Clarify your design selections.”

Name the bedrock-mantle endpoint from the OpenAI SDK

The next instance makes use of the OpenAI Python SDK as a shopper library to name the bedrock-mantle endpoint. When utilizing the OpenAI SDK, you want an Amazon Bedrock API key. For manufacturing workloads, use short-term API keys. They expire mechanically (most 12 hours) and inherit the permissions of the IAM position that generated them. In case you are already utilizing native AWS credentials and don’t have an API key, the aws-bedrock-token-generator package deal generates a short-term bearer token from these credentials.

from openai import OpenAI

shopper = OpenAI(
    api_key="",
    base_url="https://bedrock-mantle.us-east-1.api.aws/openai/v1",
)

response = shopper.chat.completions.create(
    mannequin="google.gemma-4-31b",
    messages=[
        {"role": "user", "content": "Explain the benefits of mixture-of-experts architectures for production inference."}
    ],
    max_tokens=512,
)
print(response.selections[0].message.content material)

In case you are migrating an current utility that already makes use of the OpenAI SDK format with a unique mannequin, the migration sometimes requires updating solely the bottom URL and the mannequin ID. To regulate permissions for producing and utilizing API keys, discuss with Management permissions for producing and utilizing Amazon Bedrock API keys.

Observe: The code examples on this put up learn the API key from an surroundings variable for demonstration solely. In manufacturing, retailer and retrieve credentials from a managed secrets and techniques service similar to AWS Secrets and techniques Supervisor or AWS Methods Supervisor Parameter Retailer relatively than surroundings variables.

Multimodal enter (picture)

As a result of Gemma 4 helps picture enter throughout all variants, the identical Chat Completions API works for imaginative and prescient duties. The bedrock-mantle endpoint accepts pictures as inline base64-encoded information URLs or as Amazon Easy Storage Service (Amazon S3) URLs (s3://); arbitrary public https:// picture URLs will not be supported. The next instance reads an area picture file, encodes it as a base64 information URL, and contains it within the message content material alongside the textual content immediate:

import base64

# Learn an area picture file and encode it as a base64 information URL.
with open("chart.png", "rb") as image_file:
    image_b64 = base64.b64encode(image_file.learn()).decode("utf-8")
data_url = f"information:picture/png;base64,{image_b64}"

response = shopper.chat.completions.create(
    mannequin="google.gemma-4-31b",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": data_url}},
            {"type": "text", "text": "Describe the trend shown in this chart."}
        ]
    }],
)
print(response.selections[0].message.content material)

Alternatively, you possibly can reference a picture saved in Amazon S3 by passing an s3:// URL rather than the information URL (for instance, {"url": "s3://my-bucket/chart.png"}). This avoids inlining massive pictures within the request physique. For finest outcomes, place picture content material earlier than the textual content within the immediate, which matches Google DeepMind’s beneficial ordering for Gemma 4 multimodal inputs.

Streaming responses

For chat and agent use circumstances the place you wish to floor tokens to the consumer as they’re generated, set stream=True. The response turns into an iterator of incremental delta occasions:

stream = shopper.chat.completions.create(
    mannequin="google.gemma-4-31b",
    messages=[
        {"role": "user", "content": "Write a short poem about distributed systems."}
    ],
    stream=True,
)

for chunk in stream:
    delta = chunk.selections[0].delta.content material
    if delta:
        print(delta, finish="", flush=True)
print()

Streaming makes use of the identical bedrock-mantle:CreateInference permission as a non-streaming name—the IAM coverage proven earlier already covers it.

Device calling

Gemma 4 helps native perform calling for agentic workflows. The next instance exhibits a whole tool-calling loop: defining a instrument, receiving a instrument name from the mannequin, executing the perform, and passing the consequence again.

import json
from openai import OpenAI

shopper = OpenAI(
    api_key="",
    base_url="https://bedrock-mantle.us-east-1.api.aws/openai/v1",
)

instruments = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country (e.g., Seattle, US)"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Step 1: Ship the consumer request with instrument definitions
messages = [
    {"role": "user", "content": "What's the weather like in Seattle?"}
]

response = shopper.chat.completions.create(
    mannequin="google.gemma-4-31b",
    messages=messages,
    instruments=instruments,
    tool_choice="auto",
)

assistant_message = response.selections[0].message

# Step 2: Test if the mannequin needs to name a instrument
if assistant_message.tool_calls:
    messages.append(assistant_message)

    for tool_call in assistant_message.tool_calls:
        function_name = tool_call.perform.title
        arguments = json.hundreds(tool_call.perform.arguments)

        # Step 3: Validate the perform title and run it (your implementation)
        if function_name == "get_weather":
            location = arguments.get("location", "Unknown")
            unit = arguments.get("unit", "fahrenheit")
            consequence = {
                "location": location,
                "temperature": 18 if unit == "celsius" else 64,
                "unit": unit,
                "situation": "Partly cloudy",
                "humidity": 72,
            }
        else:
            consequence = {"error": f"Unknown perform: {function_name}"}

        # Step 4: Return the perform consequence to the mannequin
        messages.append({
            "position": "instrument",
            "tool_call_id": tool_call.id,
            "content material": json.dumps(consequence),
        })

    # Step 5: Get the ultimate response incorporating instrument outcomes
    final_response = shopper.chat.completions.create(
        mannequin="google.gemma-4-31b",
        messages=messages,
        instruments=instruments,
    )

    print(final_response.selections[0].message.content material)
else:
    print(assistant_message.content material)

Responses API

Along with Chat Completions, the bedrock-mantle endpoint helps the OpenAI Responses API, which makes use of a single enter area as an alternative of a messages listing and returns a top-level output_text for the generated response. It is a good match for single-turn technology and for purposes already constructed towards the Responses interface:

response = shopper.responses.create(
    mannequin="google.gemma-4-31b",
    enter="Clarify the advantages of mixture-of-experts architectures for manufacturing inference.",
    max_output_tokens=512,
)
print(response.output_text)

The Responses API makes use of the identical Amazon Bedrock API key and base URL because the Chat Completions examples; solely the tactic and response form differ. Use Chat Completions if you want multi-turn message historical past or client-side tool-calling loops, and the Responses API for streamlined single-turn technology or if you need the mannequin’s reasoning (lined subsequent).

Allow reasoning mode

Gemma 4 can produce an express thought course of earlier than its remaining reply, which is beneficial for complicated multi-step duties at the price of extra latency and token utilization. On the bedrock-mantle endpoint, you allow this by means of the Responses API by setting the reasoning parameter. The mannequin returns the thought course of as a separate reasoning merchandise within the output, alongside the ultimate reply in output_text:

response = shopper.responses.create(
    mannequin="google.gemma-4-31b",
    enter="If a practice leaves at 3pm at 60 km/h and one other leaves an hour later at 90 km/h from the identical station, when does the second catch up?",
    reasoning={"effort": "excessive"},
)

# Last reply.
print(response.output_text)

# The thought course of is returned as a separate output merchandise of kind "reasoning".
for merchandise in response.output:
    if merchandise.kind == "reasoning":
        for block in merchandise.content material:
            print(block.textual content)

The effort worth (low, medium, or excessive) controls how a lot the mannequin causes earlier than answering. For multi-turn conversations, move solely the ultimate solutions again as historical past—exclude prior reasoning objects from the enter on the following flip.

For Gemma 4 E2B, we suggest setting reasoning_effort to excessive. The smallest variant tends to cause extensively by default, and a excessive reasoning effort retains that considering within the devoted reasoning channel—bettering output high quality and stopping reasoning textual content from leaking into the ultimate reply.

Really useful sampling parameters

On the bedrock-mantle endpoint, sampling is managed by temperature and top_p. For Gemma 4, we suggest temperature=1.0 and top_p=0.95, which work effectively for each reasoning and non-reasoning modes.

Clear up

On-demand inference incurs costs solely when invoked, so there isn’t any infrastructure to tear down. To keep away from unintended costs:

Should you generated short-term Amazon Bedrock API keys for testing, the keys expire mechanically (most 12 hours). To revoke earlier, delete the API key within the console. Deleting an API key instantly revokes entry for all purposes utilizing that key, so be certain no lively purposes rely on it earlier than you delete it.
Should you opted in to the Precedence tier for testing, swap again to Normal for non-latency-sensitive visitors by eradicating the service_tier parameter out of your invocations.
For pricing particulars by tier and mannequin, discuss with Amazon Bedrock pricing.

Service tiers

Amazon Bedrock gives a number of service tiers to match totally different workload necessities:

Tier	Finest for	Traits
Precedence	Mission-critical, customer-facing workflows that want the quickest response occasions	As much as 25% higher output tokens per second (OTPS) latency in comparison with Normal. Prioritized forward of Normal and Flex requests. Premium over customary on-demand pricing. No upfront reservation or dedication.
Normal	On a regular basis AI duties similar to content material technology, textual content evaluation, and routine doc processing	Constant efficiency at customary on-demand pricing. Default tier when no tier is specified. No dedication.
Flex	Workloads that may tolerate longer processing occasions, similar to mannequin evaluations, content material summarization, and agentic workflows	Discounted pricing relative to Normal. Larger latency, particularly throughout peak visitors, as a result of Flex requests are processed after Normal. No dedication.

For the newest tier availability per mannequin, discuss with the Amazon Bedrock mannequin catalog.

Scaling on-demand inference

Once you invoke Gemma 4 fashions on Amazon Bedrock, requests use on-demand inference (Normal tier) by default, the place you pay per token with out reserving capability. On-demand throughput is shared and allotted per AWS Area, so in periods of excessive regional demand a request could be briefly queued or throttled. Designing for that is necessary for purposes that have to scale reliably in manufacturing.

On the bedrock-mantle endpoint, there isn’t any requests-per-minute (RPM) quota. Inference is ruled by per-model, per-Area token-based quotas—separate input-tokens-per-minute and output-tokens-per-minute limits. Gemma 4 and different open fashions don’t presently have per-account token quotas printed within the Service Quotas console. Their throughput is ruled by inner service capability, so use retry logic with exponential backoff to deal with transient throttling. Cached enter tokens learn by means of immediate caching don’t rely towards the input-token quota. For particulars, see Quotas for the bedrock-mantle endpoint.

Amazon Bedrock surfaces two HTTP error codes that point out {that a} request can’t be served:

Error	What it means	What to do
HTTP 429	A token-per-minute quota for the mannequin has been exceeded.	Cut back the submission fee and retry with exponential backoff; request a quota enhance by means of AWS Assist when you persistently hit the restrict.
HTTP 503	Regional capability for the mannequin is underneath strain.	Retry with exponential backoff for infrequent responses; cut back the submission fee for sustained ones.

The excellence between these two responses issues when troubleshooting manufacturing visitors. A 429 signifies {that a} token-per-minute quota has been exceeded and is finest addressed by decreasing your submission fee or requesting a quota enhance by means of AWS Assist. A 503 signifies that regional capability for the mannequin is underneath strain. Following the steering within the subsequent part to ramp step by step permits purposes to proceed working on on-demand inference with out disruption.

Deal with one-off 503 responses

Some on-demand inference requests would possibly see occasional 503 responses when the mannequin is in excessive demand. The beneficial approach to deal with them is exponential backoff with jitter and a bounded retry rely. The OpenAI SDK helps this by means of its built-in max_retries setting, which retries failed requests with exponential backoff. The next instance configures the shopper used with the bedrock-mantle endpoint:

from openai import OpenAI

# Retry transient failures with exponential backoff.
shopper = OpenAI(
    api_key="",
    base_url="https://bedrock-mantle.us-east-1.api.aws/openai/v1",
    max_retries=6,
)

Backoff and retry is the beneficial approach to deal with these transient responses on the bedrock-mantle endpoint.

If 503 responses turn out to be sustained, retries alone received’t resolve the problem as a result of the efficient request fee is exceeding accessible capability for the mannequin. In that case, take into account routing latency-sensitive visitors to the Precedence tier, which receives preferential processing forward of Normal and Flex requests in periods of excessive demand.

Deal with steep visitors ramps

When utilizing Normal tier on-demand inference, your utility’s incoming visitors ought to align with how the mannequin’s regional capability scales. Sudden, massive jumps in request fee usually tend to set off 503s than gradual will increase that the system can accommodate. Everytime you enhance the request fee towards Gemma 4 fashions, scale up in measured increments relatively than stepping straight to a brand new goal quantity. The beneficial ramp process is as follows:

Begin at your goal request fee.
Should you obtain 503 responses, cut back the speed by 50% and proceed decreasing till requests are succeeding persistently.
Maintain at that regular state for quarter-hour.
Improve the speed by 50% and maintain for an additional quarter-hour.
Repeat till you attain your goal quantity.

As a labored instance: in case your goal is 2,000 RPM and also you encounter 503s, cut back to 1,000 RPM, then to 500 if errors persist. As soon as 500 RPM is regular for quarter-hour, scale to 750, then 1,125, and so forth. Skipping the 15-minute maintain turns every step-up right into a recent load check.

Select the Precedence tier for latency-sensitive workloads

Past reactive use throughout sustained 503s, the Precedence tier is usually a helpful lever to scale back occurrences of 503 whereas persevering with to make use of on-demand inference. Precedence delivers as much as 25% higher output tokens per second in comparison with Normal, and there’s no upfront reservation or dedication. Purposes decide in by setting the service_tier parameter to precedence on every invocation, and tiers will be blended throughout the identical utility. Buyer-facing prompts, real-time brokers, and different consumer interactions the place response time straight impacts expertise are good candidates for Precedence. For background and batch-style work, Normal or Flex is often the suitable alternative and avoids paying the Precedence premium on requests that wouldn’t profit from it.

Extra finest practices for manufacturing scale

A handful of practices assist maintain inference workloads working easily at scale:

Unfold massive workloads throughout a number of minutes, relatively than firing them in tight bursts, to scale back strain on regional capability.
When migrating manufacturing visitors to a brand new Gemma 4 mannequin model, use characteristic flags to ramp the share of visitors step by step as an alternative of chopping over abruptly.
Route asynchronous work—mannequin evaluations, content material summarization, agentic backfills—to the Flex tier for improved price-performance.
For workloads with out information residency necessities, distribute throughout a number of Areas to enhance resilience throughout regional demand spikes.
For workloads anticipated to develop, plan headroom for 2 to 3 occasions the anticipated peak as a buffer for visitors surges.

For full steering, discuss with Scaling and throughput finest practices within the Amazon Bedrock Consumer Information.

Cut back latency with implicit immediate caching

Gemma 4 fashions on Amazon Bedrock assist implicit immediate caching, which is mechanically enabled. Consecutive requests that share a typical immediate prefix would possibly end in a cache hit, permitting the mannequin to reuse the cached inner state as an alternative of recomputing it. Cache hits cut back inference latency on the matching tokens, with no code modifications or cache markers required.

Implicit immediate caching is accessible throughout all on-demand service tiers (Normal, Precedence, and Flex), so purposes can benefit from it no matter how their visitors is routed. Cache hits will not be all the time current on each request, however they’re frequent in workloads with secure prefixes similar to multi-turn brokers, retrieval-augmented technology, and long-context evaluation, the place system prompts, instrument definitions, or supply paperwork are reused throughout requests. Should you place static content material on the entrance of the immediate and dynamic content material on the finish, purposes can benefit from caching when it happens.

Pricing and availability

At launch, Gemma 4 fashions can be found in 4 AWS Areas: US East (N. Virginia), US East (Ohio), US West (Oregon), and Europe (Frankfurt). For the newest listing, discuss with the Amazon Bedrock mannequin catalog.

Pricing is per token and varies by mannequin and repair tier. For present charges, discuss with Amazon Bedrock pricing.

Conclusion

On this put up, we explored the Gemma 4 household on Amazon Bedrock—open-weight fashions from Google DeepMind protecting dense and mixture-of-experts architectures, with built-in reasoning, native perform calling, and multimodal enter throughout textual content and picture. We walked by means of the bedrock-mantle endpoint and the safety structure that runs beneath it (Mannequin Deployment Account isolation with zero operator entry), and confirmed methods to get began utilizing the OpenAI Python SDK towards bedrock-mantle. We lined service tiers (Normal, Precedence, and Flex), the shared-throughput mannequin for on-demand inference, and methods to deal with 503 responses and ramp visitors in manufacturing.

To get began:

Open the Amazon Bedrock console and check out Gemma 4 within the Chat/Textual content playground.
Run the bedrock-mantle Python pattern on this put up towards your personal information.
Consider Gemma 4 31B, 26B-A4B, and E2B in your workloads to decide on the variant that matches your price and latency profile.
For manufacturing deployment, overview the Scaling and throughput finest practices and take into account the Precedence tier for latency-sensitive visitors.

For extra data, discuss with the next sources: