Introducing OS Stage Actions in Amazon Bedrock AgentCore Browser

AI brokers that automate net workflows function throughout the browser’s net layer, the DOM that Playwright and the Chrome DevTools Protocol (CDP) expose. AgentCore Browser offers a safe, remoted browser setting for this, and it really works nicely for the overwhelming majority of automation: navigating pages, filling types, clicking components, extracting content material. However the net layer has a tough boundary. Something that the working system renders (native dialogs, safety prompts, certificates choosers, context menus, even Chrome settings) sits outdoors the DOM fully. CDP can’t see it, and Playwright can’t work together with it.

When an online utility calls window.print() and a system print dialog seems, Playwright has no DOM to work together with. When a workflow requires a keyboard shortcut or a right-click context menu, CDP has no mechanism to problem these instructions on the OS degree. When a browser session encounters a macOS privateness dialog, a Home windows Safety immediate, or a certificates chooser, they’re invisible to the online automation layer. These situations are inclined to floor in manufacturing. They’re triggered by particular utility states, OS configurations, or consumer permissions, not in testing, the place net content material is predictable sufficient to validate towards.

The problem compounds for vision-enabled brokers. A standard structure is to seize a screenshot, ship it to a mannequin, obtain again coordinates or directions, and execute. This loop works nicely for net content material, however breaks the second that native UI seems. The screenshot captures it, the mannequin causes about it, after which there’s nothing to behave with. CDP can’t attain what the OS rendered. The agent sees precisely what to do and has no technique to do it.

We’re asserting OS Stage Actions for AgentCore Browser. This new functionality unblocks these situations by exposing direct OS management by the InvokeBrowser API, so brokers can work together with content material seen on the display screen, not solely what’s accessible by the browser’s net layer. By combining full-desktop screenshots with mouse and keyboard management on the OS degree, brokers can observe native UI, motive about it, and act on it throughout the identical session. This publish walks by how OS Stage Actions work, what actions are supported, and methods to get began.

How OS Stage Actions work

OS Stage Actions can be found for brand new and current browser configurations with out additional setup. After a session is lively, you dispatch actions by the InvokeBrowser API. Every name carries precisely one motion, recognized by its kind and arguments, and returns a SUCCESS or FAILED standing. The lively session is recognized utilizing the x-amzn-browser-session-id header, which ties every OS-level motion to the right browser session.

The anticipated interplay sample is an action-screenshot-reaction loop. The agent takes an motion (click on, kind, shortcut), captures a screenshot to look at the present state of the display screen, after which decides the subsequent motion primarily based on what it sees. This loop permits the agent to react to dynamic UI. This contains native dialogs and OS prompts that may seem mid-workflow.

Agent sends an motion. This is usually a mouse click on, key press, or shortcut utilizing InvokeBrowser.
AgentCore executes the motion on the complete OS desktop and returns SUCCESS or FAILED.
Agent requests a screenshot to look at the present display screen state.
AgentCore captures the complete desktop, together with native dialogs, OS modals, and UI outdoors the browser window, and returns a base64-encoded PNG.
Agent causes in regards to the screenshot sending it to a imaginative and prescient mannequin to find out what occurred and what to do subsequent.
Agent sends the subsequent motion primarily based on what it noticed, persevering with the loop.

Supported actions

OS Stage Actions are organized into three classes: mouse management, keyboard enter, and visible seize. The next desk summarizes eight actions with their fields and constraints.

Motion	Required fields	Non-obligatory fields	Notes
mouseClick	—	x, y, button, clickCount	Defaults to present place, LEFT, single click on. clickCount: 1–10.
mouseMove	x, y	—	Strikes cursor to coordinates.
mouseDrag	endX, endY	startX, startY, button	Drags from begin to finish. button defaults to LEFT.
mouseScroll	—	x, y, deltaX, deltaY	deltaY damaging = scroll down. Vary: -1000 to 1000.
keyType	textual content	—	Varieties a string. Max 10,000 characters.
keyPress	key	presses	Presses a key N instances. presses: 1–100, defaults to 1.
keyShortcut	keys	—	Key mixture array. As much as 5 keys, for instance, [“ctrl”, “a”].
screenshot	—	format	Captures full OS desktop. Returns base64-encoded PNG.

Mouse actions

Mouse actions cowl the complete vary of pointer interactions: clicking, transferring, dragging, and scrolling. Coordinate fields are optionally available for mouseClick. If omitted, the clicking lands on the present cursor place with a left button single click on. That is helpful when a previous mouseMove has already positioned the cursor. mouseDrag requires the 4 coordinates, begin and finish positions. mouseScroll accepts a place and delta values for each axes—damaging deltaY scrolls down, optimistic scrolls up. A right-click context menu, for instance, is a single mouseClick with button set to RIGHT on the goal coordinates. Word that some context menu gadgets may not perform as anticipated due to the virtualized setting during which the browser session runs.

Keyboard actions

The three keyboard actions cowl completely different ranges of enter. keyType is for typing textual content. It sends characters immediately and handles strings as much as 10,000 characters. keyPress is for particular person keys that have to be pressed repeatedly, resembling tab to advance by type fields or escape to dismiss a modal. keyShortcut is for mixtures—cross an array of key names and AgentCore presses them concurrently.

Key names for keyPress and keyShortcut have to be lowercase. Supported keys embody single characters (a–z, 0–9) and named keys resembling enter, tab, house, backspace, delete, escape, ctrl, alt, and shift.

To pick your complete textual content, for instance, you’ll use keyShortcut with ["ctrl", "a"].

{
  "motion": {
    "keyShortcut": {
      "keys": ["ctrl", "a"]
    }
  }
}

Screenshot

The screenshot motion captures the complete OS desktop and returns a base64-encoded PNG within the response. It’s the one motion that returns knowledge. The opposite actions return solely a standing (SUCCESS or FAILED) and an error area on failure.

{
   "motion":{
      "screenshot":{
         "format":"PNG"
      }
   }
}

Getting began

The next examples stroll by the action-screenshot-reaction loop, matching the companion pocket book. For the complete working pocket book with eight actions demonstrated finish to finish, begin there.

Arrange purchasers and create a browser

You want two purchasers: a management airplane shopper (bedrock-agentcore-control) for managing browser assets, and a knowledge airplane shopper (bedrock-agentcore) for dispatching actions throughout a session.

import boto3
import time

browser_boto3 = boto3.shopper('bedrock-agentcore-control', region_name="us-west-2")

BROWSER_NAME = "browser_with_os_actions"

Earlier than beginning a session, you want an AWS Id and Entry Administration (IAM) execution position and a browser useful resource. The execution position requires bedrock-agentcore:InvokeBrowser, bedrock-agentcore:StartBrowserSession, and bedrock-agentcore:StopBrowserSession permissions. The companion pocket book features a helper that creates this position for you:

from helpers.utils import create_agentcore_execution_role, SAMPLE_ROLE_NAME

execution_role_arn = create_agentcore_execution_role(SAMPLE_ROLE_NAME)

With the position created, create a customized browser:

created_browser = browser_boto3.create_browser(
    identify=BROWSER_NAME,
    executionRoleArn=execution_role_arn,
    networkConfiguration={
        'networkMode': 'PUBLIC'
    }
)

browser_id = created_browser['browserId']
print(f"Browser ID: {browser_id}")

Begin a browser session

With the browser useful resource created, begin a session. The viewPort units the display screen decision. This determines the coordinate house for mouse actions and the size of captured screenshots. The sessionTimeoutSeconds controls how lengthy the session stays alive earlier than it’s mechanically terminated.

# These helpers are included within the companion pocket book repository
from helpers.browser import get_credentials, invoke, start_session, stop_session

creds, default_region = get_credentials()
BEDROCK_AGENTCORE_DP_ENDPOINT = f"https://bedrock-agentcore.{default_region}.amazonaws.com/"

sid = start_session(BEDROCK_AGENTCORE_DP_ENDPOINT, browser_id, area=default_region, credentials=creds)

# Await session to initialize — regulate if wanted on your setting
time.sleep(3)

The start_session helper sends a SigV4-signed PUT request to create the session and returns the sessionId. The invoke helper handles signing and dispatching particular person actions.

Invoke an OS-level motion

With the session working, you may dispatch OS-level actions by the invoke helper. Every name takes a single motion — on this case, a left click on at coordinates (600, 370) on the display screen:

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"mouseClick": {"x": 600, "y": 370, "button": "LEFT"}},
    area=default_region, credentials=creds, browser_id=browser_id
)

print(f"Mouse click on standing: {r.status_code}, motion: {r.json()['result']}")

The response tells you whether or not the motion succeeded or failed. Coordinates map to display screen pixels, if the session viewport is 1920×1080, legitimate x values vary from 0 to 1919 and y from 0 to 1079. Coordinates outdoors the display screen dimensions return a ValidationException.

Seize a screenshot

After every motion, the agent should observe what occurred. The screenshot motion captures the complete desktop and returns the picture as a base64-encoded PNG:

import base64
from IPython.show import Picture, show

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"screenshot": {"format": "PNG"}},
    area=default_region, credentials=creds, browser_id=browser_id
)

img_bytes = base64.b64decode(r.json()['result']['screenshot']['data'])
show(Picture(img_bytes))

That is the remark step within the loop. The agent sends the screenshot to a imaginative and prescient mannequin, which causes about what’s on display screen and returns the subsequent motion to take. The cycle repeats till the workflow is full.

Placing it collectively: dismissing a print dialog

Right here is the action-screenshot-reaction loop in follow. Suppose the agent navigates to a web page that triggers window.print(), and a local print dialog seems. The agent can’t work together with it by CDP, however it could possibly with OS Stage Actions.First, the agent captures a screenshot to see the present state of the display screen:

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"screenshot": {"format": "PNG"}},
    area=default_region, credentials=creds, browser_id=browser_id
)

# Ship the screenshot to a imaginative and prescient mannequin to establish the dialog and find the Cancel button.
# The imaginative and prescient mannequin integration is determined by your agent structure — see the Bedrock
# InvokeModel API for methods to ship photos to Claude or different fashions.
# The mannequin returns coordinates, e.g.: {"x": 410, "y": 535}

The imaginative and prescient mannequin identifies the print dialog and returns the coordinates of the Cancel button. The agent selects it:

r = invoke(
    BEDROCK_AGENTCORE_DP_ENDPOINT, sid,
    {"mouseClick": {"x": 410, "y": 535, "button": "LEFT"}},
    area=default_region, credentials=creds, browser_id=browser_id
)

print(f"Click on standing: {r.status_code}, motion: {r.json()['result']}")

The agent takes one other screenshot to verify that the dialog was dismissed, and the workflow continues.

Cease the session and clear up

When the workflow is completed, cease the session and clear up assets:

stop_session(BEDROCK_AGENTCORE_DP_ENDPOINT, sid, browser_id, area=default_region, credentials=creds)

To delete the browser useful resource and IAM position:

browser_boto3.delete_browser(browserId=browser_id)
print(f"Browser {browser_id} deleted")

from helpers.utils import delete_agentcore_execution_role, SAMPLE_ROLE_NAME
delete_agentcore_execution_role(SAMPLE_ROLE_NAME)

These steps, act, observe, determine, type the core of the action-screenshot-reaction sample. The companion pocket book walks by eight supported actions with a stay browser session, together with mouse drag, scroll, keyboard enter, and shortcut mixtures.

Conclusion

After we launched Amazon Bedrock AgentCore Browser, it gave AI brokers a completely managed, cloud-based browser setting to work together with web sites. It navigated pages, extracted content material, and automatic workflows at scale by Playwright and CDP. OS Stage Actions lengthen that functionality past the online layer to UI components seen on the display screen. Native dialogs, safety prompts, keyboard shortcuts, and browser chrome are not blockers. Brokers can now observe, motive about, and act on the complete OS desktop throughout the identical session.

Mixed with AgentCore Browser’s current capabilities like visible understanding and framework integration with Playwright and Amazon Nova Act, OS Stage Actions shut the final hole in browser automation protection.

To begin constructing:

Concerning the authors