Run NVIDIA Nemotron and OpenAI GPT OSS fashions on Amazon Bedrock in AWS GovCloud (US)

Authorities companies working workloads in AWS GovCloud (US) want AI capabilities that maintain tempo with the industrial sector. On the identical time, they will’t compromise the safety and compliance controls their missions require. As open-weight basis fashions (FMs) transfer from experimentation into mission programs, two necessities form each mannequin choice. First, the mannequin should ship the aptitude the mission calls for. Second, the inference atmosphere should fulfill the company’s safety, compliance, and information residency obligations. For U.S. authorities companies, the protection and intelligence neighborhood and the contractors that serve them, these necessities are non-negotiable. Entry to superior open-weight fashions is crucial for work comparable to intelligence evaluation, mission planning, acquisition and contract doc overview, safety log evaluation, and compliance automation. This entry should not require transferring delicate information exterior the boundary that governs it.

We’re excited to introduce US-based frontier open-weight fashions in AWS GovCloud (US). With this launch, Amazon Bedrock now helps OpenAI’s open-weight GPT OSS fashions (120B and 20B) and NVIDIA Nemotron (Nano 9B v2, Nano 12B v2, Nano 30B, Tremendous 120B) fashions. With these new fashions, you possibly can construct and scale generative AI purposes with various, high-performance FMs. This provides the pliability to make use of OpenAI’s and NVIDIA’s newest fashions alongside different main AI fashions by means of a single, unified API. You should use this unified API to pick the proper mannequin for every particular use case with out altering your software code.

AWS GovCloud (US) supplies an remoted set of AWS Areas designed to host delicate information and controlled workloads. Areas are bodily situated in the USA and administered solely by U.S. residents. They assist prospects meet compliance frameworks together with FedRAMP Excessive (Provisional Authority to Function) and DoD Cloud Computing Safety Necessities Information (SRG) Affect Ranges 2, 4, and 5. Extra frameworks embrace Worldwide Visitors in Arms Rules (ITAR) and Felony Justice Data Providers (CJIS).

Amazon Bedrock is a totally managed service for accessing FMs from impartial mannequin suppliers, with inference working totally on AWS-operated infrastructure.

With Amazon Bedrock, inference runs contained in the AWS GovCloud (US) isolation boundary, on infrastructure operated by U.S. residents on U.S. soil. For particulars on how Amazon Bedrock handles your information, consult with Information safety in Amazon Bedrock.

OpenAI’s open-weight GPT OSS fashions and NVIDIA Nemotron open-weight fashions at the moment are out there on Amazon Bedrock in AWS GovCloud (US). This launch delivers two open-weight mannequin households into the AWS GovCloud (US) Areas: OpenAI gpt-oss-120b and gpt-oss-20b, and the NVIDIA Nemotron 3 household, together with Nemotron 3 Tremendous 120B alongside the Nemotron 3 Nano fashions. With these fashions, you possibly can construct agentic purposes and mission workflows comparable to automated safety management assessments, multi-document intelligence synthesis, contract and acquisition evaluation, and coverage compliance checking. All of this runs throughout the AWS GovCloud (US) compliance boundary.

On this put up, we cowl the fashions presently out there in AWS GovCloud (US) and their capabilities, the inference choices for information residency, the out there service tiers and the way to get began.

Concerning the fashions

This part introduces the 2 open-weight mannequin households now out there in AWS GovCloud (US) and the capabilities that set every aside.

NVIDIA Nemotron

The NVIDIA Nemotron household delivers each small language mannequin (SLM) and enormous language mannequin (LLM) capabilities, constructed for compute effectivity and accuracy in specialised agentic AI programs. NVIDIA describes the 2 fashions as follows:

NVIDIA Nemotron 3 Tremendous is a 120B open hybrid mixture-of-experts (MoE) mannequin for complicated multi-agent workloads with 120 billion complete parameters that prompts solely 12 billion parameters per token. This MoE design delivers as much as 5 instances greater throughput than the earlier technology for cost-efficient inference, and its 1-million-token context window provides brokers the long-term reminiscence to remain targeted throughout lengthy, multi-step duties.
NVIDIA Nemotron 3 Nano is a 30-billion-parameter open mannequin that prompts roughly 3 billion parameters per token, delivering 4 instances greater throughput than the earlier technology and decreasing reasoning-token technology by as much as 60 %. Its 1-million-token context window helps long-running, multi-step agent workflows.

For the complete listing of NVIDIA Nemotron fashions out there in AWS GovCloud (US), consult with NVIDIA fashions on Amazon Bedrock.

OpenAI GPT OSS

OpenAI’s GPT OSS fashions are open-weight, text-to-text fashions designed for reasoning, agentic, and developer duties, with adjustable reasoning effort and help for exterior instrument integration. This put up focuses on two variants:

gpt-oss-120b is OpenAI’s 120-billion parameter open-weight mannequin, designed for manufacturing, general-purpose, and high-reasoning use circumstances.

gpt-oss-20b is the 20-billion parameter mannequin, designed for decrease latency and native or specialised use circumstances.

Each fashions present a 128K-token context window and as much as 16K output tokens, and each settle for textual content enter and generate textual content output. As a result of the weights are open, organizations can independently consider the mannequin structure, overview the printed mannequin card, and run their very own benchmarks on consultant workloads. For presidency groups, this transparency helps organizational threat assessments, permits buyer safety groups to judge mannequin conduct earlier than deployment, and aligns with the zero-trust rules many U.S. authorities companies are adopting.

For the complete listing of OpenAI fashions out there in AWS GovCloud (US), consult with OpenAI fashions on Amazon Bedrock.

Serverless inference inside your compliance boundary

NVIDIA Nemotron and GPT OSS fashions on Amazon Bedrock are served by the next-generation inference engine in Amazon Bedrock. To grasp the structure, it helps to tell apart between the engine and the endpoint: the engine is the underlying serving infrastructure, designed with Mannequin Deployment Account isolation and nil operator entry, whereas the bedrock-mantle endpoint is the OpenAI-compatible HTTPS API that purposes name to ship requests to the engine. For companies, there’s no infrastructure to provision, no GPUs to handle, and no model-deployment experience required.

The following-generation inference engine is constructed on a zero operator entry design. No operator, whether or not from AWS, the client, or a mannequin supplier, can entry buyer information, comparable to inference prompts or completions. Mixed with the AWS GovCloud (US) isolation boundary, this provides authorities groups a robust data-protection basis. For the technical particulars, consult with Exploring the zero operator entry design of Mantle.

Amazon Bedrock supplies two endpoints for invoking these fashions. The bedrock-mantle endpoint is the OpenAI-compatible API for the next-generation inference engine, so you possibly can name it with the OpenAI Python and TypeScript SDKs. It makes use of the Chat Completions and Responses APIs. The bedrock-runtime endpoint makes use of the Converse and InvokeModel APIs by means of the AWS SDK, with entry to native Amazon Bedrock options comparable to Guardrails. Code samples for each are within the Getting began part.

Regional availability and information residency

Amazon Bedrock provides a number of choices for the place your inference requests are processed. In-Area retains each request inside a single Area, and Geographic Cross-Area inference routes requests throughout Areas inside a geography for greater throughput, so your information stays inside that geographic boundary. For NVIDIA Nemotron and GPT OSS fashions in AWS GovCloud (US), the choices are as follows:

In-Area inference is accessible in us-gov-west-1 (AWS GovCloud (US-West)).
Geo cross-Area inference is accessible by means of a devoted AWS GovCloud (US) cross-Area inference ID that routes requests throughout us-gov-west-1 and us-gov-east-1. Visitors stays throughout the AWS GovCloud (US) boundary, when you acquire resilience throughout each Areas.

All inference for these fashions stays throughout the AWS GovCloud (US) boundary. World cross-Area inference, which routes requests throughout industrial AWS Areas worldwide, isn’t out there in AWS GovCloud (US). You may select between single-Area and Geo cross-Area based mostly in your necessities.

Service tiers

Amazon Bedrock provides a number of service tiers to match totally different workload necessities. For all three fashions, the Normal, Precedence, and Flex tiers are supported.

Service tier	Description	Supported
Normal	Pay-per-token entry with no dedication	Sure
Precedence	Increased throughput for latency-sensitive site visitors	Sure
Flex	Decrease-cost entry for versatile, non-time-sensitive workloads	Sure
Reserved	Devoted throughput with a time period dedication	Not presently out there

By default, requests use on-demand inference on the Normal tier, the place you pay per token with out reserving capability upfront. For latency-sensitive, customer-facing workloads, you possibly can route particular person requests to the Precedence tier. For non-time-sensitive work comparable to mannequin evaluations or batch summarization, the Flex tier provides a lower-cost possibility. For scaling steering and the way to deal with throttling at manufacturing quantity, consult with Scaling and throughput finest practices and the Getting began part.

Getting began in AWS GovCloud (US)

This part walks by means of invoking the fashions, beginning with the really useful bedrock-mantle endpoint. The examples use the us-gov-west-1 Area, the place in-Area inference is accessible.

Console playground

Navigate to the Amazon Bedrock console in your AWS GovCloud (US) account.
Select Playground from the left menu below the Check part.
Select Choose mannequin.
Select the supplier (NVIDIA or OpenAI) from the class listing, then choose the mannequin (for instance, NVIDIA Nemotron 3 Tremendous or 120B gpt-oss-120b).
Select Apply to load the mannequin.
Enter a immediate to check the mannequin.

Utilizing the bedrock-mantle endpoint (really useful)

To make use of these fashions, you want an AWS account in AWS GovCloud (US) with permissions to invoke Amazon Bedrock fashions. For the bedrock-mantle endpoint, you want an Amazon Bedrock API key or commonplace AWS credentials. The next is a pattern coverage:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Sid": "BedrockMantleInference",
            "Effect": "Allow",
            "Action": [
                "bedrock-mantle:CreateInference",
                "bedrock-mantle:Get*",
                "bedrock-mantle:List*"
            ],
            "Useful resource": "arn:aws-us-gov:bedrock-mantle:us-gov-west-1:111122223333:undertaking/*"
        },
        {
            "Sid": "BedrockMantleCallWithBearerToken",
            "Impact": "Permit",
            "Motion": "bedrock-mantle:CallWithBearerToken",
            "Useful resource": "*"
        }
    ]
}

Exchange 111122223333 together with your AWS account ID and scope the Area to the AWS GovCloud (US) Areas you employ. The code examples on this put up authenticate with a Bedrock API key, which requires bedrock-mantle:CallWithBearerToken. This motion have to be scoped to "Useful resource": "*", as proven within the second assertion. To regulate which identities can generate or use Amazon Bedrock API keys, consult with Management permissions for producing and utilizing Amazon Bedrock API keys. To limit your group to authorised fashions solely, use a service management coverage (SCP).

The next instance makes use of the OpenAI Python SDK to name the bedrock-mantle endpoint. For manufacturing workloads, use short-term API keys, which expire robotically (most 12 hours) and inherit the permissions of the IAM position that generated them.

import boto3
from openai import OpenAI

# Retrieve the Bedrock API key from AWS Secrets and techniques Supervisor
secrets_client = boto3.consumer("secretsmanager", region_name="us-gov-west-1")
api_key = secrets_client.get_secret_value(SecretId="bedrock-api-key")["SecretString"]

consumer = OpenAI(
    # Use the AWS GovCloud (US) Area within the base URL, e.g. us-gov-west-1
    base_url="https://bedrock-mantle.us-gov-west-1.api.aws/v1",
    api_key=api_key,
)

response = consumer.chat.completions.create(
    mannequin="openai.gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Explain the benefits of open-weight models for regulated workloads."}
    ],
    reasoning_effort="medium",  # low | medium | excessive
    max_completion_tokens=512,
)

print(response.decisions[0].message.content material)

Word: These examples retrieve the Bedrock API key from AWS Secrets and techniques Supervisor. For native improvement, you possibly can as an alternative learn the important thing from an atmosphere variable, however keep away from that sample in manufacturing. Use AWS Secrets and techniques Supervisor or one other secrets and techniques retailer.

To name NVIDIA Nemotron 3 Tremendous 120B as an alternative, change the mannequin parameter to nvidia.nemotron-super-3-120b and take away the reasoning_effort parameter (reasoning effort management is particular to GPT OSS). No different code modifications are required.

Controlling reasoning effort

GPT OSS fashions are reasoning fashions that expose an adjustable reasoning effort. Set the reasoning_effort parameter on the Chat Completions name to low, medium, or excessive to commerce response latency in opposition to reasoning depth. Use low for high-volume, latency-sensitive site visitors, and excessive for complicated, multi-step reasoning or agentic planning. For reasoning fashions, want max_completion_tokens to sure the response size (the older max_tokens discipline continues to be accepted).

Utilizing the Responses API

Along with Chat Completions, GPT OSS fashions help the Responses API, OpenAI’s interface for reasoning-style interactions. It takes a single enter relatively than a messages array. NVIDIA Nemotron 3 Tremendous 120B doesn’t help the Responses API. Use Chat Completions, Converse, or Invoke for that mannequin.

import boto3
from openai import OpenAI

# Retrieve the Bedrock API key from AWS Secrets and techniques Supervisor
secrets_client = boto3.consumer("secretsmanager", region_name="us-gov-west-1")
api_key = secrets_client.get_secret_value(SecretId="bedrock-api-key")["SecretString"]

consumer = OpenAI(
    base_url="https://bedrock-mantle.us-gov-west-1.api.aws/v1",
    api_key=api_key,
)

response = consumer.responses.create(
    mannequin="openai.gpt-oss-120b",
    enter="Clarify the advantages of open-weight fashions for regulated workloads.",
)

print(response)

Streaming responses

For chat and agent use circumstances the place you need to floor tokens to the consumer as they’re generated, set stream=True. The response turns into an iterator of incremental delta occasions:

stream = consumer.chat.completions.create(
    mannequin="openai.gpt-oss-120b",
    messages=[
        {"role": "user", "content": "Write a short summary of mixture-of-experts architectures."}
    ],
    stream=True,
)

for chunk in stream:
    delta = chunk.decisions[0].delta.content material
    if delta:
        print(delta, finish="", flush=True)
print()

On the bedrock-runtime endpoint, the equal functionality requires the bedrock:InvokeModelWithResponseStream permission, which the minimal coverage proven later already grants.

Device calling

NVIDIA Nemotron and GPT OSS open-weight fashions are designed for agentic workflows, making them actionable for tool-calling situations. In a tool-calling workflow, you outline capabilities (instruments) that the mannequin can invoke, the mannequin decides when to name them based mostly on the consumer’s request, and your software runs the operate and returns the outcome for the mannequin to include into its remaining response.

The next instance demonstrates this sample finish to finish. We outline a get_weather instrument, ship a consumer message, let the mannequin request the instrument name, run the operate with mock information, and move the outcome again so the mannequin can generate a natural-language reply.

import json
import boto3
from openai import OpenAI

# Retrieve the Bedrock API key from AWS Secrets and techniques Supervisor
secrets_client = boto3.consumer("secretsmanager", region_name="us-gov-west-1")
api_key = secrets_client.get_secret_value(SecretId="bedrock-api-key")["SecretString"]

consumer = OpenAI(
    base_url="https://bedrock-mantle.us-gov-west-1.api.aws/v1",
    api_key=api_key,
)

instruments = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a given location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "City and country (e.g., Seattle, US)"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Step 1: Ship the consumer request with instrument definitions
messages = [
    {"role": "user", "content": "What's the weather like in Seattle?"}
]

response = consumer.chat.completions.create(
    mannequin="openai.gpt-oss-120b",
    messages=messages,
    instruments=instruments,
    tool_choice="auto",
)

assistant_message = response.decisions[0].message

# Step 2: Examine if the mannequin desires to name a instrument
if assistant_message.tool_calls:
    messages.append(assistant_message)

    for tool_call in assistant_message.tool_calls:
        function_name = tool_call.operate.identify
        arguments = json.hundreds(tool_call.operate.arguments)

        # Step 3: Validate operate identify and run it
        if function_name == "get_weather":
            location = arguments.get("location", "Unknown")
            unit = arguments.get("unit", "fahrenheit")
            outcome = {
                "location": location,
                "temperature": 18 if unit == "celsius" else 64,
                "unit": unit,
                "situation": "Partly cloudy",
                "humidity": 72,
            }
        else:
            outcome = {"error": f"Unknown operate: {function_name}"}

        # Step 4: Return the operate outcome to the mannequin
        messages.append({
            "position": "instrument",
            "tool_call_id": tool_call.id,
            "content material": json.dumps(outcome),
        })

    # Step 5: Get the ultimate response incorporating instrument outcomes
    final_response = consumer.chat.completions.create(
        mannequin="openai.gpt-oss-120b",
        messages=messages,
        instruments=instruments,
    )

    print(final_response.decisions[0].message.content material)
else:
    print(assistant_message.content material)

The instance proven right here demonstrates client-side instrument calling: the mannequin returns a instrument name, your software runs the operate, and also you move the outcome again. On bedrock-mantle, GPT OSS fashions help each client-side and server-side instrument calling, whereas NVIDIA Nemotron 3 Tremendous 120B helps client-side instrument calling solely. Each mannequin households additionally help instrument calling on the bedrock-runtime endpoint by means of the Converse API (utilizing toolConfig). Refer to every mannequin’s mannequin card for the complete function matrix.

Utilizing the bedrock-runtime endpoint (boto3)

For the bedrock-runtime endpoint, you want AWS credentials configured (AWS Identification and Entry Administration (IAM) consumer or position) with permission to invoke the mannequin. The next is a pattern coverage:

{
    "Model": "2012-10-17",
    "Assertion": [
        {
            "Effect": "Allow",
            "Action": [
                "bedrock:InvokeModel",
                "bedrock:InvokeModelWithResponseStream"
            ],
            "Useful resource": "arn:aws-us-gov:bedrock:us-gov-west-1::foundation-model/openai.gpt-oss-120b-1:0"
        }
    ]
}

For manufacturing deployments, scope the Useful resource to the precise AWS GovCloud (US) Areas and mannequin IDs that you just use.

The next instance sends a single-turn request utilizing the AWS SDK for Python (boto3) with the Converse API. On the bedrock-runtime endpoint, the GPT OSS mannequin IDs embrace a model suffix (for instance, openai.gpt-oss-120b-1:0). Use the precise mannequin ID from every mannequin’s mannequin card. The response accommodates a reasoning block adopted by a textual content block, so the instance selects the textual content block when printing the reply.

import boto3

consumer = boto3.consumer("bedrock-runtime", region_name="us-gov-west-1")

response = consumer.converse(
    modelId="openai.gpt-oss-120b-1:0",
    messages=[{
        "role": "user",
        "content": [{"text": "What is a mixture-of-experts architecture?"}]
    }],
    inferenceConfig={"maxTokens": 2048, "temperature": 1.0, "topP": 0.95},
)

content_blocks = response["output"]["message"]["content"]
response_text = subsequent(
    (block["text"] for block in content_blocks if "textual content" in block),
    None
)

if response_text:
    print(response_text)
else:
    print("No textual content response.")

To name NVIDIA Nemotron 3 Tremendous 120B by means of bedrock-runtime, use the mannequin ID nvidia.nemotron-super-3-120b (this mannequin ID doesn’t carry a model suffix).

It’s also possible to entry these fashions out of your terminal utilizing the AWS Command Line Interface (AWS CLI):

aws bedrock-runtime converse 
--model-id openai.gpt-oss-120b-1:0 
--messages '[{"role":"user","content":[{"text":"Type_Your_Prompt_Here"}]}]' 
--inference-config '{"maxTokens":512}' 
--region us-gov-west-1

Scaling on-demand inference

On-demand capability on the Normal tier is shared and allotted per AWS Area, so in periods of excessive regional demand a request might be briefly queued or throttled. On the bedrock-mantle endpoint, there isn’t any requests-per-minute quota. Throughput is ruled by token-based limits. These open-weight fashions don’t presently have per-account token quotas printed within the Service Quotas console, so use retry logic with exponential backoff to deal with transient throttling. Amazon Bedrock surfaces two HTTP error codes that point out when a request can’t be served:

Error code	That means	Really helpful motion
429	The request was denied as a result of it exceeded the account quotas for Amazon Bedrock.	Request a quota improve by means of the Service Quotas console, and apply client-side throttling.
503	The service is experiencing excessive demand or non permanent capability constraints.	Retry with exponential backoff and jitter. If throttling is sustained, cut back the request fee and ramp again up progressively.

For transient 503 responses, configure automated retries in your SDK:

import boto3
from botocore.config import Config

config = Config(retries={"total_max_attempts": 6, "mode": "commonplace"})
consumer = boto3.consumer("bedrock-runtime", config=config)

When ramping again up after sustained throttling, maintain at a gradual state for about quarter-hour between will increase relatively than stepping straight to the goal quantity. For extra detailed ramp-up process and extra finest practices, see Scaling and throughput finest practices within the Amazon Bedrock Person Information.

Clear up

These fashions use on-demand inference, which incurs fees solely whenever you invoke a mannequin, so there’s no endpoint or infrastructure to tear down. To keep away from unintended fees after testing:

Should you generate short-term Bedrock API keys, they expire robotically (most 12 hours). To revoke one sooner, delete it within the Amazon Bedrock console.

Should you opted in to the Precedence tier for testing, return to Normal pricing for non-latency-sensitive site visitors by eradicating the service_tier parameter out of your invocations.

Should you saved a Bedrock API key in AWS Secrets and techniques Supervisor for testing, delete the key to keep away from storage fees.

For pricing particulars by mannequin and tier, consult with Amazon Bedrock pricing.

Pricing and availability

OpenAI GPT OSS and NVIDIA Nemotron fashions can be found at this time on Amazon Bedrock in AWS GovCloud (US). In-Area inference is accessible in AWS GovCloud (US-West) (us-gov-west-1), and Geo cross-Area inference routes requests throughout AWS GovCloud (US-West) and AWS GovCloud (US-East) (us-gov-east-1) whereas maintaining site visitors throughout the AWS GovCloud (US) boundary.

Pricing is per token and varies by mannequin and repair tier. On-demand inference on the Normal tier incurs fees whenever you invoke a mannequin, with no capability to order and no infrastructure to tear down. For present charges, consult with Amazon Bedrock pricing.

Conclusion

OpenAI GPT OSS and NVIDIA Nemotron fashions at the moment are out there on Amazon Bedrock in AWS GovCloud (US), giving authorities prospects entry to superior open-weight fashions inside their compliance boundary. On this put up, we lined the out there fashions and their capabilities, the 2 endpoints for invoking them, the out there service tiers, and scaling steering. Authorities groups can run these open-weight fashions for mission workloads whereas maintaining inference contained in the AWS GovCloud (US) boundary, on AWS-operated infrastructure.

To get began:

Open the Amazon Bedrock console in your AWS GovCloud (US) account and check out the fashions within the Playground.
Run the bedrock-mantle Python pattern from this put up in opposition to your individual information.
Consider gpt-oss-120b, gpt-oss-20b, and NVIDIA Nemotron 3 Tremendous 120B in your workloads to decide on the mannequin that matches your value and latency profile.
For manufacturing deployment, overview Scaling and throughput finest practices and contemplate the Precedence tier for latency-sensitive site visitors.

Sources

For extra info, consult with the next assets:

Concerning the authors

Run NVIDIA Nemotron and OpenAI GPT OSS fashions on Amazon Bedrock in AWS GovCloud (US)

AI Brokers Defined: What Is a ReAct Loop and How Does It Work?

Leave a Reply Cancel reply

Popular News

Greatest practices for Amazon SageMaker HyperPod activity governance

How Cursor Really Indexes Your Codebase

Context Engineering — A Complete Fingers-On Tutorial with DSPy

Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

About Us

Category

Recent Posts