OpenAI has launched two open-weight fashions, gpt-oss-120b
(117 billion parameters) and gpt-oss-20b
(21 billion parameters), each constructed with a Combination of Specialists (MoE) design and a 128K context window. These fashions are the main open supply fashions, based on Synthetic Evaluation benchmarks, and excel at reasoning and agentic workflows. With Amazon SageMaker AI, you may fine-tune or customise fashions and deploy together with your alternative of framework via a totally managed service. Amazon SageMaker Inference provides you the flexibleness to deliver your personal inference code and framework with out having to construct and keep your personal clusters.
Though massive language fashions (LLMs) excel at understanding language and producing content material, constructing real-world agentic purposes requires complicated workflow administration, device calling capabilities, and context administration. Multi-agent architectures handle these challenges by breaking down complicated programs into specialised elements, however they introduce new complexities in agent coordination, reminiscence administration, and workflow orchestration.
On this submit, we present how you can deploy gpt-oss-20b
mannequin to SageMaker managed endpoints and display a sensible inventory analyzer agent assistant instance with LangGraph, a robust graph-based framework that handles state administration, coordinated workflows, and chronic reminiscence programs. We’ll then deploy our brokers to Amazon Bedrock AgentCore, a unified orchestration layer that abstracts away infrastructure and means that you can securely deploy and function AI brokers at scale.
Resolution overview
On this answer, we construct an agentic inventory analyzer with the next key elements:
- The GPT OSS 20B mannequin deployed to a SageMaker endpoint utilizing vLLM, an open supply serving framework for LLMs
- LangGraph to construct a multi-agent orchestration framework
- Amazon Bedrock AgentCore to deploy the brokers
The next diagram illustrates the answer structure.
This structure illustrates a multi-agent workflow hosted on Amazon Bedrock AgentCore Runtime operating on AWS. A consumer submits a question, which is dealt with by a pipeline of specialised brokers—Information Gathering Agent, Inventory Efficiency Analyzer Agent, and Inventory Report Era Agent—which might be every liable for a definite a part of the inventory analysis course of.
These brokers collaborate inside Amazon Bedrock AgentCore Runtime, and when language understanding or technology is required, they invoke a GPT OSS mannequin hosted on SageMaker AI. The mannequin processes the enter and returns structured outputs that inform agent actions, enabling a totally serverless, modular, and scalable agentic system utilizing open-source fashions.
Stipulations
- Guarantee that you’ve required quota for G6e cases to deploy the mannequin. Request quota right here if you don’t.
- If that is your first time working with Amazon SageMaker Studio, you first must create a SageMaker area.
- Guarantee your IAM function has required permissions to deploy SageMaker Fashions and Endpoints. For extra info, see How Amazon SageMaker AI works with IAM within the SageMaker Developer Information.
Deploy GPT-OSS fashions to SageMaker Inference
Clients who need to customise their fashions and frameworks can deploy utilizing serverful deployments, however this requires entry to GPUs, serving frameworks, load balancers, and infrastructure setup. SageMaker AI supplies a totally managed internet hosting platform that takes care of provisioning the infrastructure with the mandatory drivers, downloads the fashions, and deploys them. OpenAI’s GPT-OSS fashions are launched with a 4-bit quantization scheme (MXFP4), enabling quick inference whereas retaining useful resource utilization low. These fashions can run on P5(H100), P6(H200), and P4(A100) and G6e(L40) cases.The GPT-OSS fashions are sparse MoE architectures with 128 consultants (120B) or 32 consultants (20B), the place every token is routed to 4 consultants with no shared knowledgeable. Utilizing MXFP4 for MoE weights alone reduces the mannequin sizes to 63 GB (120B) and 14 GB (20B), making them runnable on a single H100 GPU.
To deploy these fashions successfully, you want a robust serving framework like vLLM. To deploy the mannequin, we construct a vLLM container with the newest model that helps GPT OSS fashions on SageMaker AI.
You should use the next Docker file and script to construct the container and push it to a neighborhood Amazon Elastic Container Registry (Amazon ECR). The advisable method is to do that instantly from Amazon SageMaker Studio, which supplies a managed JupyterLab atmosphere with AWS CLI entry the place you may construct and push photos to ECR as a part of your SageMaker workflow. Alternatively, you can even carry out the identical steps on an Amazon Elastic Compute Cloud (Amazon EC2) occasion with Docker put in.
After you might have constructed and pushed the container to Amazon ECR, you may open Amazon SageMaker Studio by going to the SageMaker AI console, as proven within the following screenshot.
You may then create a Jupyter house or use an current one to launch JupyterLab and run notebooks.
Clone the next pocket book and run “Choice 3: Deploying from HF utilizing BYOC.” Replace the required parameters, such because the inference picture within the pocket book with the container picture. We additionally present mandatory atmosphere variables, as proven within the following code.
After you arrange the deployment configuration, you may deploy to SageMaker AI utilizing the next code:
Now you can run an inference instance:
Use LangGraph to construct a inventory analyzer agent
For our inventory analyzing multi-agent system, we use LangGraph to orchestrate the workflow. Jupyter pocket book for the code is positioned on this github repository. The system contains three specialised instruments that work collectively to investigate shares comprehensively:
- The
gather_stock_data
device collects complete inventory information for a given ticker image, together with present worth, historic efficiency, monetary metrics, and market information. It returns formatted info protecting worth historical past, firm fundamentals, buying and selling metrics, and up to date information headlines. - The
analyze_stock_performance
device performs detailed technical and elementary evaluation of inventory information, calculating metrics like worth traits, volatility, and general funding scores. It evaluates a number of elements together with P/E ratios, revenue margins, and dividend yields to offer a complete efficiency evaluation - The
generate_stock_report
device creates skilled PDF experiences from the gathered inventory information and evaluation, mechanically importing them to Amazon S3 with organized date-based folders.
For native testing, you should use a simplified model of the system by importing the mandatory features out of your native script. For instance:
This manner, you may iterate shortly in your agent’s logic earlier than deploying it to a scalable platform, ensuring every element features appropriately and the general workflow produces the anticipated outcomes for several types of shares.
Deploy to Amazon Bedrock AgentCore
After you might have developed and examined your LangGraph framework regionally, you may deploy it to Amazon Bedrock AgentCore Runtime. Amazon Bedrock AgentCore handles the heavy lifting of container orchestration, session administration, scalability and abstracting the administration of infrastructure. It supplies persistent execution environments that may keep an agent’s state throughout a number of invocations.
Earlier than deploying our inventory analyzer agent to Amazon Bedrock AgentCore Runtime, we have to create an AWS Identification and Entry Administration IAM function with the suitable permissions. This function permits Amazon Bedrock AgentCore to invoke your SageMaker endpoint for GPT-OSS mannequin inference, handle ECR repositories for storing container photos, write Amazon CloudWatch logs for monitoring and debugging, entry Amazon Bedrock AgentCore workload companies for runtime operations, and ship telemetry information to AWS X-Ray and CloudWatch for observability. See the next code:
After creating the function, you should use the Amazon Bedrock AgentCore Starter Toolkit to deploy your agent. The toolkit simplifies the deployment course of by packaging your code, creating the mandatory container picture, and configuring the runtime atmosphere:
If you’re utilizing BedrockAgentCoreApp
, it mechanically creates an HTTP server that listens on port 8080, implements the required /invocations
endpoint for processing the agent’s necessities, implements the/ping
endpoint for well being checks (which is essential for asynchronous brokers), handles correct content material varieties and response codecs, and manages error dealing with based on AWS requirements.
After you deploy to Amazon Bedrock AgentCore Runtime, it is possible for you to to see the standing present as Prepared on the Amazon Bedrock AgentCore console.
Invoke the agent
After you create the agent, you could arrange the agent invocation entry level. With Amazon AgentCore Runtime, we beautify the invocation a part of our agent with the @app.entrypoint
decorator and use it because the entry level for our runtime. After you deploy the agent to Amazon AgentCore Runtime, you may invoke it utilizing the AWS SDK:
After invoking the inventory analyzer agent via Amazon Bedrock AgentCore Runtime, you could parse and format the response for clear presentation. The response processing entails the next steps:
- Decode the byte stream from Amazon Bedrock AgentCore into readable textual content.
- Parse the JSON response containing the whole inventory evaluation.
- Extract three fundamental sections utilizing regex sample matching:
- Inventory Information Gathering Part: Extracts core inventory info together with image, firm particulars, present pricing, market metrics, monetary ratios, buying and selling information, and up to date information headlines.
- Efficiency Evaluation part: Analyzes technical indicators, elementary metrics, and volatility measures to generate complete inventory evaluation.
- Inventory Report Era Part: Generates an in depth PDF report with all of the Inventory Technical Evaluation.
The system additionally contains error dealing with that gracefully handles JSON parsing errors, falls again to plain textual content show if structured parsing fails, and supplies debugging info for troubleshooting parsing problems with the inventory evaluation response.
This formatted output makes it simple to assessment the agent’s decision-making course of and current skilled inventory evaluation outcomes to stakeholders, finishing the end-to-end workflow from mannequin deployment to significant enterprise output:
Clear up
You may delete the SageMaker endpoint to keep away from accruing prices after your testing by operating the next cells in the identical pocket book:
You can too delete Amazon Bedrock AgentCore assets utilizing the next instructions:
Conclusion
On this submit, we constructed an end-to-end answer for deploying OpenAI’s open-weight fashions on a single G6e(L40s) GPU, making a multi-agent inventory evaluation system with LangGraph and deploying it seamlessly with Amazon Bedrock AgentCore. This implementation demonstrates how organizations can now use highly effective open supply LLMs cost-effectively with environment friendly serving frameworks similar to vLLM. Past the technical implementation, enhancing this workflow can present vital enterprise worth, similar to discount in inventory evaluation processing time, elevated analyst productiveness by automating routine inventory assessments. Moreover, by liberating analysts from repetitive duties, organizations can redirect expert professionals towards complicated circumstances and relationship-building actions that drive enterprise development.
We invite you to check out our code samples and iterate your agentic workflows to fulfill your use circumstances.
In regards to the authors
Vivek Gangasani is a Worldwide Lead GenAI Specialist Options Architect for SageMaker Inference. He drives Go-to-Market (GTM) and Outbound Product technique for SageMaker Inference. He additionally helps enterprises and startups deploy, handle, and scale their GenAI fashions with SageMaker and GPUs. At present, he’s centered on growing methods and options for optimizing inference efficiency and GPU effectivity for internet hosting Giant Language Fashions. In his free time, Vivek enjoys mountain climbing, watching films, and making an attempt totally different cuisines.
Surya Kari is a Senior Generative AI Information Scientist at AWS, specializing in growing options leveraging state-of-the-art basis fashions. He has in depth expertise working with superior language fashions together with DeepSeek-R1, the Llama household, and Qwen, specializing in their fine-tuning and optimization for particular scientific purposes. His experience extends to implementing environment friendly coaching pipelines and deployment methods utilizing AWS SageMaker, enabling the scaling of basis fashions from growth to manufacturing. He collaborates with prospects to design and implement generative AI options, serving to them navigate mannequin choice, fine-tuning approaches, and deployment methods to attain optimum efficiency for his or her particular use circumstances.