Automationscribe.com
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automation Scribe
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us
No Result
View All Result
Automationscribe.com
No Result
View All Result

Introducing OS Stage Actions in Amazon Bedrock AgentCore Browser

admin by admin
May 10, 2026
in Artificial Intelligence
0
Introducing OS Stage Actions in Amazon Bedrock AgentCore Browser
399
SHARES
2.3k
VIEWS
Share on FacebookShare on Twitter


AI brokers that automate net workflows function throughout the browser’s net layer, the DOM that Playwright and the Chrome DevTools Protocol (CDP) expose. AgentCore Browser offers a safe, remoted browser setting for this, and it really works nicely for the overwhelming majority of automation: navigating pages, filling types, clicking components, extracting content material. However the net layer has a tough boundary. Something that the working system renders (native dialogs, safety prompts, certificates choosers, context menus, even Chrome settings) sits outdoors the DOM fully. CDP can’t see it, and Playwright can’t work together with it.

When an online utility calls window.print() and a system print dialog seems, Playwright has no DOM to work together with. When a workflow requires a keyboard shortcut or a right-click context menu, CDP has no mechanism to problem these instructions on the OS degree. When a browser session encounters a macOS privateness dialog, a Home windows Safety immediate, or a certificates chooser, they’re invisible to the online automation layer. These situations are inclined to floor in manufacturing. They’re triggered by particular utility states, OS configurations, or consumer permissions, not in testing, the place net content material is predictable sufficient to validate towards.

The problem compounds for vision-enabled brokers. A standard structure is to seize a screenshot, ship it to a mannequin, obtain again coordinates or directions, and execute. This loop works nicely for net content material, however breaks the second that native UI seems. The screenshot captures it, the mannequin causes about it, after which there’s nothing to behave with. CDP can’t attain what the OS rendered. The agent sees precisely what to do and has no technique to do it.

We’re asserting OS Stage Actions for AgentCore Browser. This new functionality unblocks these situations by exposing direct OS management by the InvokeBrowser API, so brokers can work together with content material seen on the display screen, not solely what’s accessible by the browser’s net layer. By combining full-desktop screenshots with mouse and keyboard management on the OS degree, brokers can observe native UI, motive about it, and act on it throughout the identical session. This publish walks by how OS Stage Actions work, what actions are supported, and methods to get began.

How OS Stage Actions work

OS Stage Actions can be found for brand new and current browser configurations with out additional setup. After a session is lively, you dispatch actions by the InvokeBrowser API. Every name carries precisely one motion, recognized by its kind and arguments, and returns a SUCCESS or FAILED standing. The lively session is recognized utilizing the x-amzn-browser-session-id header, which ties every OS-level motion to the right browser session.

The anticipated interplay sample is an action-screenshot-reaction loop. The agent takes an motion (click on, kind, shortcut), captures a screenshot to look at the present state of the display screen, after which decides the subsequent motion primarily based on what it sees. This loop permits the agent to react to dynamic UI. This contains native dialogs and OS prompts that may seem mid-workflow.

  1. Agent sends an motion. This is usually a mouse click on, key press, or shortcut utilizing InvokeBrowser.
  2. AgentCore executes the motion on the complete OS desktop and returns SUCCESS or FAILED.
  3. Agent requests a screenshot to look at the present display screen state.
  4. AgentCore captures the complete desktop, together with native dialogs, OS modals, and UI outdoors the browser window, and returns a base64-encoded PNG.
  5. Agent causes in regards to the screenshot sending it to a imaginative and prescient mannequin to find out what occurred and what to do subsequent.
  6. Agent sends the subsequent motion primarily based on what it noticed, persevering with the loop.

Supported actions

OS Stage Actions are organized into three classes: mouse management, keyboard enter, and visible seize. The next desk summarizes eight actions with their fields and constraints.

Motion Required fields Non-obligatory fields Notes
mouseClick — x, y, button, clickCount Defaults to present place, LEFT, single click on. clickCount: 1–10.
mouseMove x, y — Strikes cursor to coordinates.
mouseDrag endX, endY startX, startY, button Drags from begin to finish. button defaults to LEFT.
mouseScroll — x, y, deltaX, deltaY deltaY damaging = scroll down. Vary: -1000 to 1000.
keyType textual content — Varieties a string. Max 10,000 characters.
keyPress key presses Presses a key N instances. presses: 1–100, defaults to 1.
keyShortcut keys — Key mixture array. As much as 5 keys, for instance, [“ctrl”, “a”].
screenshot — format Captures full OS desktop. Returns base64-encoded PNG.

Mouse actions

Mouse actions cowl the complete vary of pointer interactions: clicking, transferring, dragging, and scrolling. Coordinate fields are optionally available for mouseClick. If omitted, the clicking lands on the present cursor place with a left button single click on. That is helpful when a previous mouseMove has already positioned the cursor. mouseDrag requires the 4 coordinates, begin and finish positions. mouseScroll accepts a place and delta values for each axes—damaging deltaY scrolls down, optimistic scrolls up. A right-click context menu, for instance, is a single mouseClick with button set to RIGHT on the goal coordinates. Word that some context menu gadgets may not perform as anticipated due to the virtualized setting during which the browser session runs.

Keyboard actions

The three keyboard actions cowl completely different ranges of enter. keyType is for typing textual content. It sends characters immediately and handles strings as much as 10,000 characters. keyPress is for particular person keys that have to be pressed repeatedly, resembling tab to advance by type fields or escape to dismiss a modal. keyShortcut is for mixtures—cross an array of key names and AgentCore presses them concurrently.

Key names for keyPress and keyShortcut have to be lowercase. Supported keys embody single characters (a–z, 0–9) and named keys resembling enter, tab, house, backspace, delete, escape, ctrl, alt, and shift.

To pick your complete textual content, for instance, you’ll use keyShortcut with ["ctrl", "a"].

{
  "motion": {
    "keyShortcut": {
      "keys": ["ctrl", "a"]
    }
  }
}

Screenshot

The screenshot motion captures the complete OS desktop and returns a base64-encoded PNG within the response. It’s the one motion that returns knowledge. The opposite actions return solely a standing (SUCCESS or FAILED) and an error area on failure.

{
   "motion":{
      "screenshot":{
         "format":"PNG"
      }
   }
}

Getting began

The next examples stroll by the action-screenshot-reaction loop, matching the companion pocket book. For the complete working pocket book with eight actions demonstrated finish to finish, begin there.

Arrange purchasers and create a browser

You want two purchasers: a management airplane shopper (bedrock-agentcore-control) for managing browser assets, and a knowledge airplane shopper (bedrock-agentcore) for dispatching actions throughout a session.

import boto3
import time

browser_boto3 = boto3.shopper('bedrock-agentcore-control', region_name="us-west-2")

BROWSER_NAME = "browser_with_os_actions"

Earlier than beginning a session, you want an AWS Id and Entry Administration (IAM) execution position and a browser useful resource. The execution position requires bedrock-agentcore:InvokeBrowser, bedrock-agentcore:StartBrowserSession, and bedrock-agentcore:StopBrowserSession permissions. The companion pocket book features a helper that creates this position for you:

from helpers.utils import create_agentcore_execution_role, SAMPLE_ROLE_NAME

execution_role_arn = create_agentcore_execution_role(SAMPLE_ROLE_NAME)

With the position created, create a customized browser:

created_browser = browser_boto3.create_browser(
    identify=BROWSER_NAME,
    executionRoleArn=execution_role_arn,
    networkConfiguration={
        'networkMode': 'PUBLIC'
    }
)

browser_id = created_browser['browserId']
print(f"Browser ID: {browser_id}")

Begin a browser session

With the browser useful resource created, begin a session. The viewPort units the display screen decision. This determines the coordinate house for mouse actions and the size of captured screenshots. The sessionTimeoutSeconds controls how lengthy the session stays alive earlier than it’s mechanically terminated.

# These helpers are included within the companion pocket book repository
from helpers.browser import get_credentials, invoke, start_session, stop_session

creds, default_region = get_credentials()
BEDROCK_AGENTCORE_DP_ENDPOINT = f"https://bedrock-agentcore.{default_region}.amazonaws.com/"

sid = start_session(BEDROCK_AGENTCORE_DP_ENDPOINT, browser_id, area=default_region, credentials=creds)

# Await session to initialize — regulate if wanted on your setting
time.sleep(3)

The start_session helper sends a SigV4-signed PUT request to create the session and returns the sessionId. The invoke helper handles signing and dispatching particular person actions.

Invoke an OS-level motion

With the session working, you may dispatch OS-level actions by the invoke helper. Every name takes a single motion — on this case, a left click on at coordinates (600, 370) on the display screen:

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"mouseClick": {"x": 600, "y": 370, "button": "LEFT"}},
    area=default_region, credentials=creds, browser_id=browser_id
)

print(f"Mouse click on standing: {r.status_code}, motion: {r.json()['result']}")

The response tells you whether or not the motion succeeded or failed. Coordinates map to display screen pixels, if the session viewport is 1920×1080, legitimate x values vary from 0 to 1919 and y from 0 to 1079. Coordinates outdoors the display screen dimensions return a ValidationException.

Seize a screenshot

After every motion, the agent should observe what occurred. The screenshot motion captures the complete desktop and returns the picture as a base64-encoded PNG:

import base64
from IPython.show import Picture, show

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"screenshot": {"format": "PNG"}},
    area=default_region, credentials=creds, browser_id=browser_id
)

img_bytes = base64.b64decode(r.json()['result']['screenshot']['data'])
show(Picture(img_bytes))

That is the remark step within the loop. The agent sends the screenshot to a imaginative and prescient mannequin, which causes about what’s on display screen and returns the subsequent motion to take. The cycle repeats till the workflow is full.

Placing it collectively: dismissing a print dialog

Right here is the action-screenshot-reaction loop in follow. Suppose the agent navigates to a web page that triggers window.print(), and a local print dialog seems. The agent can’t work together with it by CDP, however it could possibly with OS Stage Actions.First, the agent captures a screenshot to see the present state of the display screen:

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"screenshot": {"format": "PNG"}},
    area=default_region, credentials=creds, browser_id=browser_id
)

# Ship the screenshot to a imaginative and prescient mannequin to establish the dialog and find the Cancel button.
# The imaginative and prescient mannequin integration is determined by your agent structure — see the Bedrock
# InvokeModel API for methods to ship photos to Claude or different fashions.
# The mannequin returns coordinates, e.g.: {"x": 410, "y": 535}

The imaginative and prescient mannequin identifies the print dialog and returns the coordinates of the Cancel button. The agent selects it:

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"mouseClick": {"x": 410, "y": 535, "button": "LEFT"}},
    area=default_region, credentials=creds, browser_id=browser_id
)

print(f"Click on standing: {r.status_code}, motion: {r.json()['result']}")

The agent takes one other screenshot to verify that the dialog was dismissed, and the workflow continues.

Cease the session and clear up

When the workflow is completed, cease the session and clear up assets:

stop_session(BEDROCK_AGENTCORE_DP_ENDPOINT, sid, browser_id, area=default_region, credentials=creds)

To delete the browser useful resource and IAM position:

browser_boto3.delete_browser(browserId=browser_id)
print(f"Browser {browser_id} deleted")

from helpers.utils import delete_agentcore_execution_role, SAMPLE_ROLE_NAME
delete_agentcore_execution_role(SAMPLE_ROLE_NAME)

These steps, act, observe, determine, type the core of the action-screenshot-reaction sample. The companion pocket book walks by eight supported actions with a stay browser session, together with mouse drag, scroll, keyboard enter, and shortcut mixtures.

Conclusion

After we launched Amazon Bedrock AgentCore Browser, it gave AI brokers a completely managed, cloud-based browser setting to work together with web sites. It navigated pages, extracted content material, and automatic workflows at scale by Playwright and CDP. OS Stage Actions lengthen that functionality past the online layer to UI components seen on the display screen. Native dialogs, safety prompts, keyboard shortcuts, and browser chrome are not blockers. Brokers can now observe, motive about, and act on the complete OS desktop throughout the identical session.

Mixed with AgentCore Browser’s current capabilities like visible understanding and framework integration with Playwright and Amazon Nova Act, OS Stage Actions shut the final hole in browser automation protection.

To begin constructing:


Concerning the authors

Evandro Franco

Evandro Franco is a Sr. Information Scientist engaged on Amazon Internet Providers. He’s a part of the International GTM workforce that helps AWS prospects overcome enterprise challenges associated to AI/ML on prime of AWS, primarily on Amazon Bedrock AgentCore and Strands Brokers. He has greater than 18 years of expertise working with expertise, from software program growth, infrastructure, serverless, to machine studying. In his free time, Evandro enjoys enjoying together with his son, primarily constructing some humorous Lego bricks.

Phelipe Fabres

Phelipe Fabres is a Sr. Options Architect for Generative AI at AWS for Startups. He’s a part of a world Frontier AI workforce with a give attention to costumers which are constructing Basis Fashions/LLMs/SLMs. Has prolonged work on Agentic programs and Software program pushed AI programs. He has greater than 10 years of working with software program growth, from monolith to event-driven architectures with a Ph.D. in Graph Concept. In his free time, Phelipe enjoys enjoying together with his daughter, primarily board video games and drawing princess.

Saurav Das

Saurav Das is a part of the Amazon Bedrock AgentCore Product Administration workforce. He has greater than 15 years of expertise in working with cloud, knowledge and infrastructure applied sciences. He has a deep curiosity in fixing buyer challenges centered round knowledge and AI infrastructure.

Yanda Hu

Yanda Hu is a software program engineer on the Amazon Bedrock AgentCore Engineering workforce with 5+ years of expertise constructing machine studying and AI options at scale. He makes a speciality of designing and delivering scalable agentic programs. He’s passionate in regards to the rising agentic AI panorama, specializing in serving to prospects overcome real-world challenges in agentic workflows.

Cristiano Scandura

Cristiano has been within the IT business since 1998. He joined Amazon Internet Providers (AWS) in 2018, the place he labored on initiatives for enterprise purchasers. Presently, he makes a speciality of GenAI and machine studying (ML) initiatives for all industries in AWS Worldwide Public Sector.

Joshua Samuel

Joshua Samuel is a Senior AI/ML Specialist Options Architect at AWS who accelerates enterprise transformation by AI/ML, and generative AI options, primarily based in Melbourne, Australia. A passionate disrupter, he makes a speciality of agentic AI and coding methods – Something that makes builders sooner and happier. Exterior work, he tinkers with dwelling automation and AI coding initiatives, and enjoys life together with his spouse, children and canine.

Tags: ActionsAgentCoreAmazonBedrockbrowserIntroducingLevel
Previous Post

LLM Summarizers Skip the Identification Step

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Popular News

  • Greatest practices for Amazon SageMaker HyperPod activity governance

    Greatest practices for Amazon SageMaker HyperPod activity governance

    405 shares
    Share 162 Tweet 101
  • How Cursor Really Indexes Your Codebase

    404 shares
    Share 162 Tweet 101
  • Speed up edge AI improvement with SiMa.ai Edgematic with a seamless AWS integration

    403 shares
    Share 161 Tweet 101
  • Construct a serverless audio summarization resolution with Amazon Bedrock and Whisper

    403 shares
    Share 161 Tweet 101
  • Optimizing Mixtral 8x7B on Amazon SageMaker with AWS Inferentia2

    403 shares
    Share 161 Tweet 101

About Us

Automation Scribe is your go-to site for easy-to-understand Artificial Intelligence (AI) articles. Discover insights on AI tools, AI Scribe, and more. Stay updated with the latest advancements in AI technology. Dive into the world of automation with simplified explanations and informative content. Visit us today!

Category

  • AI Scribe
  • AI Tools
  • Artificial Intelligence

Recent Posts

  • Introducing OS Stage Actions in Amazon Bedrock AgentCore Browser
  • LLM Summarizers Skip the Identification Step
  • Brokers that transact: Introducing Amazon Bedrock AgentCore funds, constructed with Coinbase and Stripe
  • Home
  • Contact Us
  • Disclaimer
  • Privacy Policy
  • Terms & Conditions

© 2024 automationscribe.com. All rights reserved.

No Result
View All Result
  • Home
  • AI Scribe
  • AI Tools
  • Artificial Intelligence
  • Contact Us

© 2024 automationscribe.com. All rights reserved.