Organizations are sometimes inundated with video and audio content material that comprises priceless insights. Nonetheless, extracting these insights effectively and with excessive accuracy stays a problem. This publish explores an progressive resolution to speed up video and audio evaluate workflows via a thoughtfully designed consumer expertise that permits human and AI collaboration. By approaching the issue from the consumer’s viewpoint, we are able to create a robust device that permits individuals to rapidly discover related data inside lengthy recordings with out the chance of AI hallucinations.
Many professionals, from legal professionals and journalists to content material creators and medical practitioners, have to evaluate hours of recorded content material commonly to extract verifiably correct insights. Conventional strategies of handbook evaluate or easy key phrase searches over transcripts are time-consuming and sometimes miss necessary context. Extra superior AI-powered summarization instruments exist, however they threat producing hallucinations or inaccurate data, which might be harmful in high-stakes environments like healthcare or authorized proceedings.
Our resolution, the Recorded Voice Perception Extraction Webapp (ReVIEW), addresses these challenges by offering a seamless methodology for people to collaborate with AI, accelerating the evaluate course of whereas sustaining accuracy and belief within the outcomes. The appliance is constructed on prime of Amazon Transcribe and Amazon Bedrock, a completely managed service that gives a alternative of high-performing basis fashions (FMs) from main AI corporations like AI21 Labs, Anthropic, Cohere, Meta, Mistral AI, Stability AI, and Amazon via a single API, together with a broad set of capabilities to construct generative AI purposes with safety, privateness, and accountable AI.
Consumer expertise
To speed up a consumer’s evaluate of a long-form audio or video whereas mitigating the chance of hallucinations, we introduce the idea of timestamped citations. Not solely are massive language fashions (LLMs) able to answering a consumer’s query primarily based on the transcript of the file, they’re additionally able to figuring out the timestamp (or timestamps) of the transcript throughout which the reply was mentioned. Through the use of a mixture of transcript preprocessing, immediate engineering, and structured LLM output, we allow the consumer expertise proven within the following screenshot, which demonstrates the conversion of LLM-generated timestamp citations into clickable buttons (proven underlined in pink) that navigate to the right portion of the supply video.
The consumer on this instance has uploaded a variety of movies, together with some recordings of AWS re:Invent talks. You’ll discover that the previous reply really comprises a hallucination originating from an error within the transcript; the AI assistant replied that “Hyperpaths” was introduced, when in actuality the service is named Amazon SageMaker HyperPod.
The consumer within the previous screenshot had the next journey:
- The consumer asks the AI assistant “What’s new with SageMaker?” The assistant searches the timestamped transcripts of the uploaded re:Invent movies.
- The assistant offers a solution with citations. These citations comprise each the title of the video and a timestamp, and the frontend shows buttons akin to the citations. Every quotation can level to a unique video, or to totally different timestamps throughout the identical video.
- The consumer reads that SageMaker “Hyperpaths” was introduced. They proceed to confirm the accuracy of the generated reply by deciding on the buttons, which auto play the supply video beginning at that timestamp.
- The consumer sees that the product is definitely referred to as Amazon SageMaker HyperPod, and might be assured that SageMaker HyperPod was the product introduced at re:Invent.
This expertise, which is on the coronary heart of the ReVIEW utility, permits customers to effectively get solutions to questions primarily based on uploaded audio or video recordsdata and to confirm the accuracy of the solutions by rewatching the supply media for themselves.
Answer overview
The total code for this utility is accessible on the GitHub repo.
The structure of the answer is proven within the following diagram, showcasing the circulate of knowledge via the applying.
The workflow consists of the next steps:
- A consumer accesses the applying via an Amazon CloudFront distribution, which provides a customized header and forwards HTTPS visitors to an Elastic Load Balancing utility load balancer. Behind the load balancer is a containerized Streamlit utility working on Amazon Elastic Container Service (Amazon ECS).
- Amazon Cognito handles consumer logins to the frontend utility and Amazon API Gateway.
- When a consumer uploads a media file via the frontend, a pre-signed URL is generated for the frontend to add the file to Amazon Easy Storage Service (Amazon S3).
- The frontend posts the file to an utility S3 bucket, at which level a file processing circulate is initiated via a triggered AWS Lambda. The file is distributed to Amazon Transcribe and the ensuing transcript is saved in Amazon S3. The transcript will get postprocessed right into a textual content kind extra acceptable to be used by an LLM, and an AWS Step Features state machine syncs the transcript to a data base configured in Amazon Bedrock Data Bases. The data base sync course of handles chunking and embedding of the transcript, and storing embedding vectors and file metadata in an Amazon OpenSearch Serverless vector database.
- If a consumer asks a query of 1 particular transcript (designated by the “choose media file” dropdown menu within the UI), the whole transcript is used to generate the response, so a retrieval step utilizing the data base shouldn’t be required and an LLM is named immediately via Amazon Bedrock.
- If the consumer is asking a query whose reply would possibly seem in any variety of supply movies (by selecting Chat with all media recordsdata on the dropdown menu within the UI), the Amazon Bedrock Data Bases RetrieveAndGenerate API is used to embed the consumer question, discover semantically related chunks within the vector database, enter these chunks into an LLM immediate, and generate a specifically formatted response.
- All through the method, utility information from monitoring transcription and ingestion standing, mapping consumer names to uploaded recordsdata, and caching responses are completed with Amazon DynamoDB.
One necessary attribute of the structure is the clear separation of frontend and backend logic via an API Gateway deployed REST API. This was a design resolution to allow customers of this utility to exchange the Streamlit frontend with a customized frontend. There are directions for changing the frontend within the README of the GitHub repository.
Timestamped citations
The important thing to this resolution lies within the immediate engineering and structured output format. When producing a response to a consumer’s query, the LLM is instructed to not solely present a solution to the query (if potential), but in addition to quote its sources in a selected means.
The total immediate might be seen within the GitHub repository, however a shortened pseudo immediate (for brevity) is proven right here:
You might be an clever AI which makes an attempt to reply questions primarily based on retrieved chunks of robotically generated transcripts.
Beneath are retrieved chunks of transcript with metadata together with the file title. Every chunk features a
and contours of a transcript, every line starting with a timestamp. $$ retrieved transcript chunks $$
Your reply needs to be in json format, together with a listing of partial solutions, every of which has a quotation. The quotation ought to embody the supply file title and timestamp. Right here is the consumer’s query:
$$ consumer query $$
The frontend then parses the LLM response into a set schema information mannequin, described with Pydantic BaseModels:
from pydantic import BaseModel
class Quotation(BaseModel):
"""A single quotation from a transcript"""
media_name: str
timestamp: int
class PartialQAnswer(BaseModel):
"""A part of an entire reply, to be concatenated with different partial solutions"""
partial_answer: str
citations: Listing[Citation]
class FullQAnswer(BaseModel):
"""Full consumer question response together with citations and a number of partial solutions"""
reply: Listing[PartialQAnswer]
This format permits the frontend to parse the response and show buttons for every quotation that cue up the related media phase for consumer evaluate.
Deployment particulars
The answer is deployed within the type of one AWS Cloud Improvement Package (AWS CDK) stack, which comprises 4 nested stacks:
- A backend that handles transcribing uploaded media and monitoring job statuses
- A Retrieval Augmented Technology (RAG) stack that handles organising OpenSearch Serverless and Amazon Bedrock Data Bases
- An API stack that stands up an Amazon Cognito licensed REST API and varied Lambda features to logically separate the frontend from the backend
- A frontend stack that consists of a containerized Streamlit utility working as a load balanced service in an ECS cluster, with a CloudFront distribution linked to the load balancer
Conditions
The answer requires the next stipulations:
- You want to have an AWS account and an AWS Identification and Entry Administration (IAM) function and consumer with permissions to create and handle the required sources and elements for this utility. Should you don’t have an AWS account, see How do I create and activate a brand new Amazon Internet Providers account?
- You additionally have to request entry to no less than one Amazon Bedrock LLM (to generate solutions to questions) and one embedding mannequin (to seek out transcript chunks which might be semantically much like a consumer query). The next Amazon Bedrock fashions are the default, however might be modified utilizing a configuration file on the utility deployment time as described later on this publish:
- Amazon Titan Embeddings V2 – Textual content
- Amazon’s Nova Professional
- You want a Python atmosphere with AWS CDK dependencies put in. For directions, see Working with the AWS CDK in Python.
- Docker is required to construct the Streamlit frontend container at deployment time.
- The minimal IAM permissions wanted to bootstrap and deploy the AWS CDK are described within the ReVIEW/infra/minimal-iam-policy.json file within the GitHub repository. Ensure the IAM consumer or function deploying the stacks has these permissions.
Clone the repository
Fork the repository, and clone it to the placement of your alternative. For instance:
Edit the deployment config file
Optionally, edit the infra/config.yaml
file to offer a descriptive base title in your stack. This file can also be the place you possibly can select particular Amazon Bedrock embedding fashions for semantic retrieval and LLMs for response technology, and outline chunking methods for the data base that may ingest transcriptions of uploaded media recordsdata. This file can also be the place you possibly can reuse an current Amazon Cognito consumer pool if you wish to bootstrap your utility with an current consumer base.
Deploy the AWS CDK stacks
Deploy the AWS CDK stacks with the next code:
You solely want to make use of the previous command one time per AWS account. The deploy command will deploy the dad or mum stack and 4 nested stacks. The method takes roughly 20 minutes to finish.
When the deployment is full, a CloudFront distribution URL of the shape xxx.cloudfront.internet
will likely be printed on the console display screen to entry the applying. This URL can be discovered on the AWS CloudFormation console by finding the stack whose title matches the worth within the config file, then selecting the Outputs tab and finding the worth related to the important thing ReVIEWFrontendURL
. That URL will lead you to a login display screen like the next screenshot.
Create an Amazon Cognito consumer to entry the app
To log in to the working internet utility, you need to create an Amazon Cognito consumer. Full the next steps:
- On the Amazon Cognito console, navigate to the just lately created consumer pool.
- Within the Customers part below Consumer Administration¸ select Create consumer.
- Create a consumer title and password to log in to the ReVIEW utility deployed within the account.
When the applying deployment is destroyed (as described within the cleanup part), the Amazon Cognito pool stays to protect the consumer base. The pool might be absolutely eliminated manually utilizing the Amazon Cognito console.
Check the applying
Check the applying by importing a number of audio or video recordsdata on the File Add tab. The appliance helps media codecs supported by Amazon Transcribe. In case you are searching for a pattern video, think about downloading a TED speak. After importing, you will notice the file seem on the Job Standing tab. You may monitor processing progress via transcription, postprocessing, and data base syncing steps on this tab. After no less than one file is marked Full, you possibly can chat with it on the Chat With Your Media tab.
The Analyze Your Media tab lets you create and apply customized LLM template prompts to particular person uploaded recordsdata. For instance, you possibly can create a fundamental abstract template, or an extract key data template, and apply it to your uploaded recordsdata right here. This performance was not described intimately on this publish.
Clear up
The deployed utility will incur ongoing prices even when it isn’t used, for instance from OpenSearch Serverless indexing and search OCU minimums. To delete all sources created when deploying the applying, run the next command:
Conclusion
The answer introduced on this publish demonstrates a robust sample for accelerating video and audio evaluate workflows whereas sustaining human oversight. By combining the facility of AI fashions in Amazon Bedrock with human experience, you possibly can create instruments that not solely increase productiveness but in addition keep the essential aspect of human judgment in necessary decision-making processes.
We encourage you to discover this absolutely open sourced resolution, adapt it to your particular use circumstances, and supply suggestions in your experiences.
For professional help, the AWS Generative AI Innovation Heart, AWS Skilled Providers, and our AWS Companions are right here to assist.
Concerning the Creator
David Kaleko is a Senior Utilized Scientist within the AWS Generative AI Innovation Heart.