Energy video semantic search with Amazon Nova Multimodal Embeddings

Video semantic search is unlocking new worth throughout industries. The demand for video-first experiences is reshaping how organizations ship content material, and prospects count on quick, correct entry to particular moments inside video. For instance, sports activities broadcasters must floor the precise second a participant scored to ship spotlight clips to followers immediately. Studios want to search out each scene that includes a selected actor throughout 1000’s of hours of archived content material to create customized trailers and promotional content material. Information organizations must retrieve footage by temper, location, or occasion to publish breaking tales sooner than opponents. The purpose is identical: ship video content material to finish customers rapidly, seize the second, and monetize the expertise.

Video is of course extra complicated than different modalities like textual content or picture as a result of it amalgamates a number of unstructured alerts: the visible scene unfolding on display, the ambient audio and sound results, the spoken dialogue, the temporal data, and the structured metadata describing the asset. A consumer trying to find “a tense automotive chase with sirens” is asking a couple of visible occasion and an audio occasion on the similar time. A consumer trying to find a selected athlete by identify could also be in search of somebody who seems prominently on display however is rarely spoken aloud.

The dominant strategy immediately grounds all video alerts into textual content, whether or not by transcription, guide tagging, or automated captioning, after which applies textual content embeddings for search. Whereas this works for dialogue-heavy content material, changing video to textual content inevitably loses crucial data. Temporal understanding disappears, and transcription errors emerge from visible and audio high quality points. What should you had a mannequin that would course of all modalities and straight map them right into a single searchable illustration with out shedding element? Amazon Nova Multimodal Embeddings is a unified embedding mannequin that natively processes textual content, paperwork, photos, video, and audio right into a shared semantic vector house. It delivers main retrieval accuracy and price effectivity.

On this publish, we present you how one can construct a video semantic search resolution on Amazon Bedrock utilizing Nova Multimodal Embeddings that intelligently understands consumer intent and retrieves correct video outcomes throughout all sign varieties concurrently. We additionally share a reference implementation you’ll be able to deploy and discover with your personal content material.

Determine 1: Instance screenshot from last search resolution

Resolution overview

We constructed our resolution on Nova Multimodal Embeddings mixed with an clever hybrid search structure that fuses semantic and lexical alerts throughout all video modalities. Lexical search matches actual key phrases and phrases, whereas semantic search understands which means and context. We’ll clarify our alternative of this hybrid strategy and its efficiency advantages in later sections.

Figure 2: End-to-end solution architecture

Determine 2: Finish-to-end resolution structure

The structure consists of two phases: an ingestion pipeline (steps 1-6) that processes video into searchable embeddings, and a search pipeline (steps 7-10) that routes consumer queries intelligently throughout these representations and merges outcomes right into a ranked checklist. Listed below are particulars for every of the steps:

Add – Movies uploaded through browser are saved in Amazon Easy Storage Service (Amazon S3), triggering the Orchestrator AWS Lambda to replace Amazon DynamoDB standing and begin the AWS Step Capabilities pipeline
Shot segmentation – AWS Fargate makes use of FFmpeg scene detection to separate video into semantically coherent segments
Parallel processing – Three concurrent branches course of every section:
1. Embeddings: Nova Multimodal Embeddings generates 1024-dimensional vectors for visible and audio, saved in Amazon S3 Vectors
2. Transcription: Amazon Transcribe converts speech to textual content, aligned to segments. Amazon Nova Multimodal Embeddings generates textual content embeddings saved in Amazon S3 Vectors
3. Celeb detection: Amazon Rekognition identifies identified people, mapped to segments by timestamp
Caption & style technology – Amazon Nova 2 Lite synthesizes segment-level captions and style labels from visible content material and transcripts
Merge – AWS Lambda assembles all metadata (captions, transcripts, celebrities, style) and retrieves embeddings from Amazon S3 Vectors
Index – Full section paperwork with metadata and vectors which are bulk-indexed into Amazon OpenSearch Service
Authentication – Customers authenticate through Amazon Cognito and entry the entrance finish by Amazon CloudFront
Question processing – Amazon API Gateway routes requests to Search Lambda, which executes two parallel operations: intent evaluation and question embedding
Intent evaluation – Amazon Bedrock (utilizing Anthropic Claude Haiku) assigns relevance weights (0.0-1.0) throughout visible, audio, transcription, and metadata modalities
Question embedding – Nova Multimodal Embeddings embeds the question 3 times for visible, audio, and transcription similarity search

This versatile structure addresses 4 key design choices that the majority video search programs overlook: sustaining temporal context, dealing with multimodal queries, scaling throughout large content material libraries, and optimizing retrieval accuracy. A whole reference implementation is out there on GitHub, and we encourage you to comply with together with the next walkthrough to see how every choice contributes to correct, scalable search throughout all sign varieties.

Segmentation for context continuity

Earlier than producing any embeddings, you could divide your video into searchable models, and the boundaries you draw have a direct impression on search accuracy. Every section turns into the atomic unit of retrieval. If a section is just too quick, it loses the encircling context that provides a second its which means. Whether it is too lengthy, it fuses a number of matters or scenes collectively, diluting relevance and making it more durable for the search system to floor the precise second. For simplicity, you can begin with fixed-length chunks. Nova Multimodal Embeddings helps as much as 30 seconds per embedding, providing you with flexibility to seize full scenes. Nevertheless, bear in mind that fastened boundaries could arbitrarily truncate a scene mid-action or break up a sentence mid-thought, disrupting the semantic which means that makes a second retrievable, as proven within the following determine.

Determine 3: Video segmentation methods

The purpose is semantic continuity: every section ought to symbolize a coherent unit of which means relatively than an arbitrary slice of time. Mounted 10-second blocks are easy to supply, however they ignore the pure construction of the content material. A scene change mid-segment splits a visible thought throughout two chunks, degrading each retrieval precision and embedding high quality.

To resolve this, we use FFmpeg‘s scene detection to determine the place the visible content material truly modifications. FFmpeg is an open supply multimedia framework broadly used for video processing, format conversion, and evaluation. The _detect_scenes perform that follows runs ffprobe (FFmpeg’s related instrument for media inspection) towards the video and returns a listing of timestamps, every marking a scene boundary:

def _detect_scenes(video_path):
    end result = subprocess.run(
        ['ffprobe', '-v', 'quiet', '-show_entries', 'frame=pts_time', '-of', 'csv=p=0',
         '-f', 'lavfi', f"movie={video_path},select="gt(scene,{SCENE_THRESHOLD})""],
        capture_output=True, textual content=True
    )

The output is an easy checklist of timestamps like 12.345, 28.901, 45.678, every marking a pure boundary the place the scene shifts.

With these boundaries in hand, the segmentation algorithm snaps every reduce to the closest scene change inside a suitable window, concentrating on round 10 seconds with a minimal of 5 seconds and a most of 15 seconds from the present begin. If no scene modifications fall in that vary, it falls again to a tough reduce on the goal period. The result’s a set of segments that really feel pure: 8.3s, 11.1s, 9.8s, 12.4s, 7.6s, every aligned to an actual scene boundary relatively than a hard and fast ticker.

This straightforward shot-based segmentation makes certain section boundaries align with pure visible transitions relatively than slicing arbitrarily. The goal section period needs to be calibrated based mostly in your content material kind and use case: action-heavy content material with frequent cuts could profit from visible segmentation like this, whereas documentary or interview content material with longer takes may match higher with longer, topic-based segmentation. For extra superior segmentation strategies, together with audio-based matter segmentation and mixed visible and audio approaches, we advocate studying Media2Cloud on AWS Steering: Scene and Ad-Break Detection and Contextual Understanding for Promoting Utilizing Generative AI.

Generate separate embeddings for visible, audio, and transcript alerts

With segments outlined, the selection of embedding mannequin is the place the most important high quality hole opens between approaches. The dominant strategy immediately grounds all video alerts into textual content earlier than producing embeddings, however as we established earlier, video carries much more which means than any transcript or caption can categorical. Visible motion, ambient sound, on-screen textual content, and entity context are both misplaced totally or approximated by imprecise descriptions.

Nova Multimodal Embeddings modifications this basically as a result of it’s a video-native mannequin that may generate embeddings in two modes. The mixed mode fuses visible and audio alerts right into a unified illustration, capturing a very powerful alerts collectively. This strategy advantages storage value and retrieval latency by requiring solely a single embedding per section. Alternatively, the AUDIO_VIDEO_SEPARATE mode generates distinct visible and audio embeddings. This strategy offers most illustration in modality-specific embeddings and provides you higher management over when to look visible content material versus audio content material.

In our implementation, we even added a 3rd speech embedding derived from Amazon Transcribe. This embedding is created from aligning full sentence transcripts to the embedding section timestamps, earlier than and after, preserving the semantic integrity of spoken language and guaranteeing {that a} full thought is rarely break up throughout two embeddings.

Figure 4: Visual, audio, and speech embedding generation per video segment

Determine 4: Visible, audio, and speech embedding technology per video section

Collectively, these three embeddings cowl the total sign house of a video section. The visible embedding captures what the digicam sees: objects, scenes, actions, colours, and spatial composition. The audio embedding captures what the microphone hears: music, sound results, ambient noise, and the acoustic texture of a scene. The transcript embedding captures what individuals say, representing the semantic which means of spoken dialogue and narration. Collapsing all three alerts right into a single mixed embedding compresses distinct modalities into one vector. This blurs the boundaries between what’s seen, heard, and spoken, and loses the fine-grained element that makes every sign helpful by itself. Retaining them separate provides you exact management to dial every modality up or down based mostly on question intent, permitting the search pipeline to match towards the modality most definitely to comprise the reply.

Mix metadata and embeddings for hybrid search

Even with three impartial embeddings masking visible, audio, and spoken content material, there may be nonetheless a category of queries the system can not reply properly. Embeddings are designed to seize semantic similarity. They excel at discovering a “tense crowd second” or a “solar setting over water” as a result of these are ideas with wealthy visible and audio which means. However when a consumer searches for a selected identify, product mannequin quantity, geolocation, or a specific date, embeddings will probably fail. These are discrete entities with little semantic alerts on their very own. That is the place hybrid search is available in. Relatively than counting on embeddings alone, the system runs two parallel retrieval paths as proven within the following determine: a semantic path that matches towards your visible, audio, and transcript embeddings to seize conceptual similarity, and a lexical path that performs actual key phrase and entity matching towards structured metadata.

Determine 5: Hybrid search pipeline combining semantic and lexical retrieval

How a lot metadata do you want? The reply will depend on your content material kind, group, and use case, and capturing every part upfront is impractical. For illustration functions, we chosen just a few classes of metadata to symbolize frequent kinds of metadata in media and leisure content material.

First, we chosen video title and datetime to symbolize technical metadata extracted straight from the content material catalog or file metadata. Then we added section captions, style, and superstar recognition to symbolize contextual metadata, generated utilizing Amazon Nova 2 Lite and Amazon Rekognition. Captions are generated from the video and transcript of every section, giving the mannequin each visible and spoken context. Style is predicted from the total video transcript throughout all segments, which is cheaper and extra dependable than re-sending all video clips. Celeb identification is dealt with by Amazon Rekognition, which acknowledges identified public figures showing on display with out requiring customized coaching.

Instance prompts used for caption technology and style classification are proven within the following examples:

# Caption technology
Describe this video clip in 3-5 sentences. Embrace:
- What is going on, who's seen, actions, setting, and atmosphere
- Any textual content on display: titles, subtitles, indicators, logos, watermarks, or credit
- If the display is generally black or clean, state "Black body" or "Clean display"
Transcription: {segment_transcript}
Return ONLY the descriptive caption, nothing else.

# Style classification
Primarily based on all of the video segments described beneath, classify the general video
into precisely ONE style from this checklist: Sports activities, Information, Leisure,
Documentary, Training, Music, Gaming, Cooking, Journey, Expertise,
Enterprise, Life-style, Sci-Fi, Thriller, Different

Section descriptions:
{all_captions}

Return ONLY the style identify, nothing else.

The idea extends naturally to different metadata varieties. Technical metadata could embody decision or file dimension, whereas contextual metadata may embody location, temper, or model. The appropriate stability will depend on your search use case. Moreover, overlaying metadata filters throughout retrieval can additional improve search scalability and accuracy by narrowing the search house earlier than semantic matching.

Optimize search relevance with intent-aware question routing

Now you might have three embeddings and metadata, 4 searchable dimensions. However how have you learnt when to make use of which for a given question? Intent is every part. To resolve this, we constructed an clever intent evaluation router that makes use of the Haiku mannequin to investigate every incoming question and assign weight to every modality channel: visible, audio, transcript, and metadata. See the instance search question within the following determine.

“Kevin taking a cellphone name subsequent to a classic automotive”

Determine 6: Instance question with intelligently weights assigned based mostly on search intent

The Haiku mannequin is prompted to return a JSON object with weights that sum to 1.0, together with a quick reasoning hint explaining the project. See the next immediate:

Analyze this video search question and assign weights (0.0–1.0) for 4 modalities.
Weights should sum to 1.0.

Return ONLY legitimate JSON on this actual format:
{"visible": 0.0, "audio": 0.0, "transcription": 0.0, "metadata": 0.0, "reasoning": "..."}

Pointers:
- visible: look, colours, objects, actions, scenes
- audio: sounds, music, noise, non-speech audio
- transcription: spoken phrases, dialogue, narration
- metadata: particular person names, style, captions, factual attributes

Examples:
- "pink automotive driving"      → visible=0.9, metadata=0.1
- "particular person saying hey"  → transcription=0.5, visible=0.2, audio=0.2, metadata=0.1
- "canine barking loudly"   → audio=0.6, visible=0.3, metadata=0.1

The weights straight management which sub-queries execute. Any modality beneath a 5% weight threshold is skipped totally, eliminating pointless embedding API calls and lowering search latency with out sacrificing accuracy. The remaining channels execute in parallel, every looking its personal index independently. Outcomes from all lively channels are then scored utilizing a weighted arithmetic imply. BM25 scores (a lexical relevance measure based mostly on time period frequency and doc size) and cosine similarity scores (a geometrical measure of how carefully two embedding vectors level in the identical route) reside on very totally different scales. To deal with this, every sub-query’s scores are first normalized to a 0-1 vary, then mixed utilizing the router’s intent weights:

final_score = w₁ × norm_bm25 + w₂ × norm_visual + w₃ × norm_audio + w₄ × norm_transcription

We selected the weighted arithmetic imply as our reranking method as a result of it straight incorporates question intent by the router’s weights. In contrast to Reciprocal Rank Fusion (RRF), which treats all lively channels equally no matter intent, the weighted imply amplifies channels the router deems most related for a given question. From our testing, this produced extra correct outcomes for our search duties.

Select the precise storage technique for vectors and metadata

The ultimate design choice is the place and how one can retailer all of it. Every video section produces as much as three embeddings and a set of metadata fields, and the way you retailer them determines each your search efficiency and your value at scale. We break up this throughout two companies with complementary roles: Amazon S3 Vectors for vector storage, and Amazon OpenSearch Service for hybrid search.

S3 Vectors shops three vector indices per challenge, one for every embedding kind:

nova-visual-{project_id} # visible embeddings
nova-audio-{project_id} # audio embeddings
nova-transcription-{project_id} # transcript embeddings

OpenSearch holds one index per challenge, the place every doc represents a single video section containing each textual content fields for BM25 search and vector fields for k-nearest neighbors (kNN) search:

{
  "_id": "f953ceba_seg_0012",
  "start_sec": 118.45,
  "end_sec":   128.72,
  "caption":   "A presenter walks by a rice paddy in rural Jakarta, discussing how rice cultivation has formed native civilization for 1000's of years.",
  "individuals":    ["presenter_name"],
  "style":     "Documentary",
  "visual_vector":        [0.023, -0.118, 0.045, ...],
  "audio_vector":         [0.045,  0.091, -0.033, ...],
  "transcription_vector": [-0.067, 0.134,  0.012, ...]
}

We selected S3 Vectors for its cost-to-performance advantages. Amazon S3 Vectors reduces the price for storing and querying vectors by as much as 90% in comparison with different specialised options. If search latency isn’t crucial in your use case, S3 Vectors is a powerful default alternative. For those who want the bottom potential latency, we advocate utilizing vectors in reminiscence with the OpenSearch Hierarchical Navigable Small World (HNSW) engine.

Lastly, it’s value calling out that some use circumstances require looking inside longer, semantically dense video segments comparable to a full interview, a multi-minute documentary scene, or an prolonged product demonstration. Most multimodal embedding fashions, together with Nova Multimodal Embeddings, have a most enter period of 30 seconds, which suggests a 3-minute clip can’t be embedded as a single unit. Making an attempt to take action would both fail or drive chunking that loses the broader context.

The nested vector assist in OpenSearch addresses this by permitting a single doc to comprise a number of sub-segment embeddings:

{
  "_id": "f953ceba_scene_003",
  "start_sec": 118.45,
  "end_sec":   298.10,
  "sub_segments": [
    { "start_sec": 118.45, "end_sec": 128.72, "visual_vector": [...] },
    { "start_sec": 128.72, "end_sec": 139.10, "visual_vector": [...] },
    { "start_sec": 139.10, "end_sec": 150.30, "visual_vector": [...] }
  ]
}

At question time, OpenSearch scores the doc based mostly on the best-matching sub-segment relatively than a single averaged illustration, so an extended scene can match a selected visible second inside it whereas nonetheless being returned as one coherent end result.

Efficiency outcomes: How the optimized strategy outperforms the baseline

To validate our design choices, we benchmarked the optimized hybrid search towards Nova Multimodal Embeddings baseline AUDIO_VIDEO_COMBINED mode utilizing 10 inside long-form movies (5-20 minutes) evaluated throughout 20 queries spanning visible, audio, transcript, and metadata-focused searches. The baseline makes use of a single unified vector per 10-second section with one index and one kNN question. Our optimized strategy generates separate visible, audio, and transcript embeddings, enriches segments with structured metadata, and applies intent-aware routing that dynamically weights modality channels. The next determine exhibits outcomes throughout 4 commonplace retrieval metrics:

Determine 7: Efficiency Comparability Throughout Retrieval Metrics for Hybrid Search with Nova MME vs. Baseline

The next desk captures key metrics:

	Recall@5	Recall@10	MRR	NDCG@10
Hybrid search W/ Nova Multimodal Embeddings	90%	95%	90%	88%
Baseline	51%	64%	48%	54%

Key metrics defined:

Recall@5: Of all related segments, what fraction seems within the high 5 outcomes? This implies the protection of the search outcomes.
Recall@10: Of all related segments, what fraction seems within the high 10 outcomes? This implies the protection of the search outcomes.
MRR (Imply Reciprocal Rank): 1/rank of the primary related end result, averaged throughout queries. This measures how rapidly you discover one thing related.
NDCG@10: Normalized Discounted Cumulative Acquire rewards related outcomes ranked larger and penalizes these ranked decrease. It is a commonplace rating high quality metric.

The outcomes present substantial enhancements throughout all metrics. The optimized hybrid search achieved 90+% Recall@5 and Recall@10 versus 51% and 64% for the baseline (~40% elevate on protection accuracy). MRR jumped from 48% to 90%, and NDCG@10 rose from 54% to 88%. These 30-40 proportion level features validate our core architectural choices: semantic segmentation preserves content material continuity, separate embeddings present exact search management, metadata enrichment captures factual entities, and intent-aware routing makes certain the precise alerts drive every question. By treating every modality independently whereas intelligently combining them based mostly on question intent, the system adapts to numerous search patterns and delivers persistently related outcomes as your video archive scales.

Clear up

To keep away from incurring future expenses, delete the sources used on this resolution by eradicating the AWS CloudFormation stack. For detailed instructions, confer with the GitHub repository.

Conclusion

On this publish, we confirmed how one can construct a video semantic search resolution on AWS utilizing Nova Multimodal Embeddings, masking 4 key design choices: segmentation for semantic continuity, multimodal embeddings that seize visible, audio, and speech alerts independently, metadata that fills the precision hole for entity-specific queries, and an information construction that organizes every part for environment friendly retrieval at scale. Along with an clever intent evaluation router and weighted reranking, these choices rework a fragmented set of alerts right into a unified, correct search expertise that understands video. Extra optimizations will be carried out to additional tune search accuracy, together with mannequin customization for the intent routing layer. Learn Half 2 to go deeper on these strategies. For a production-ready implementation of this video search and metadata administration method at scale, see the Steering for a Media Lake on AWS.