
The idea behind Deep Search
Deep Search takes a natural language request and returns matching clips. The difference is how it gets there. Instead of running a single search and hoping it lands, we convert the request into a structured Plan. We run multiple targeted searches across different indexes, combine the results, and verify whether the clips actually match the intent. If the clips look good, we show them. If the clips look wrong, we do not ask you to start over. We revise the Plan and try again. Sometimes that happens automatically. Sometimes we ask a short question when a missing detail blocks the search. Once results are shown, you can steer with follow-ups like:- keep the same character, but switch to outdoor scenes
- more like clip 2
- show only the parts where they are actually speaking
- find the version where the camera stays wide
Indexing: how we make clips searchable

Step 1: turn a video into scenes
We ingest the video and audio into VideoDB, then run scene detection using VideoDB. This gives us consistent clip boundaries for everything that follows.Step 2: generate base signals per clip
For each clip, we generate foundational signals:- Transcript generation using Gemini 2.5 Flash (128 Thinking Budget)
- Object detection using RT-DETR-V2
- Face detection using YOLO Face
- They are directly searchable in some cases.
- They provide structured inputs for higher-level semantic extraction.
Step 3: extract structured meaning with a VLM
For each clip, we OpenAI o3. The model fuses:- clip frames (in order)
- the full transcript
- detected objects and faces
| Extraction entity | What it captures |
|---|---|
| location | Setting and environment. Interior or exterior, style, time of day, weather cues, and scene scale. |
| action | Dominant actions and interactions. Key verbs, motion, and actor object interactions. |
| scene description | Broad visual description. Costumes, colors, ambience, staging, plus on-screen text when present. |
| character description | Appearance and identity traits. Age cues, clothing, accessories, distinguishing features, and body language. |
| shot type | Dominant camera framing over the clip. For example wide, close-up, establishing. |
| emotion | Primary emotion signal for the clip with confidence and evidence source. |
| topic | What is being discussed or sung about, not the exact words. |
| transcript | The exact spoken words in the clip. |
| object description | Main objects and their attributes. Condition, color, distinctive markings, and relevance in the scene. |
| sound effects | Short, nameable audio events when present. For example gunshots, footsteps, sirens. |
Step 4: video level structure
Once we have validated clip JSON objects across the timeline, we generate higher level structure:- subplot summaries that break the video into contiguous story segments
- a final summary that describes the full arc
Step 5: build separate semantic indexes
Finally, we create semantic indexes per field. Deep Search can then choose the best index for a given intent, instead of forcing everything through one embedding space. With these indexes in place, a user request is no longer a single vector search. Deep Search interprets the request, picks the right mix of indexes and filters, and turns it into an executable plan. The next step is how that plan gets executed, how results from multiple indexes are combined, and how the system decides what to show.Orchestrating Retrieval in Deep Search
Deep Search is not a single function call. It is a stateful orchestrator that runs a controlled loop. It continues execution until one of two conditions is met:- It has clips that are good enough to show
- It needs one missing piece of input to continue
The graph structure
Deep Search consists of the following nodes:- PlanInit: convert user intent into a structured Plan
- SearchJoin: execute the Plan and combine subquery results
- Validator: evaluate whether candidate clips satisfy intent
- NoneAnalyzer: handle empty results by broadening or clarifying
- Interpreter: convert feedback or follow-ups into controlled Plan edits
- Rerank: reorder accepted clips for display
- PreviewPage and ClarifyPause: the only two pause states, to show the ranked clips or to ask a clarification question before continuing respectively.
- a fresh request (start at PlanInit), or
- a resumed session (start at Interpreter with prior state)
The outer loop

- PreviewPage: show ranked clips
- ClarifyPause: ask a short clarification question
- The current graph state is persisted
- The system waits for user input
- Execution resumes from the saved state
The inner loop (internal retries)

- Execute the current Plan in SearchJoin
- If candidates exist, send them to Validator
- If the join is empty or Validator rejects the candidates, revise the Plan
- Retry execution
Recursion limit
We cap the inner loop at 12 steps. This limit exists becuase latency increases with each retry, also the cost compounds. If the cap is reached, we stop and surface a timeout instead of looping indefinitely.Let’s understand the internals using one example query and follow how it moves through these nodes. Query: “Find clips where Tom Cruise is walking through a hotel corridor while talking on the phone.”
PlanInit: Turning a request into a Plan
The first thing Deep Search does is PlanInit. This node converts the request into something executable. The output is a Plan object that answers three questions:- What to search for (expressed as a small set of subqueries)
- Where to search (meaning which indexes each subquery should target)
- How strict to be (meaning filters and how results should be combined)
- Subqueries: Each subquery has an id, a query string, and a list of indexes to search
- JoinPlan: How to combine subquery result sets, AND for intersection or OR for union
- Metadata Filters: Faceted filters applied to every search call, like actors, characters, shot_type, emotion, objects
- Fallback Order: The order to relax constraints when results are poor
How fallback is decided
Fallback is not a fixed priority list. Instead, we treat the Plan as a hierarchy of constraints. Some constraints define the core moment (for example, location + action). Others refine or enrich it (for example, shot style or emotional tone). When we broaden, we relax constraints that are most likely to be over-restrictive in context, based on:- which subqueries returned zero hits
- which constraints caused empty intersections
- Validator feedback
- session history
SearchJoin: Executing the Plan and combining results
Once PlanInit produces a Plan, the next node is SearchJoin. This is where the plan turns into clip candidates. SearchJoin does two things:- execute each subquery against the right indexes
- combine the result sets using the JoinPlan
Step 1: run subqueries in parallel
Each subquery targets one or more indexes. SearchJoin runs them independently. Before it queries an index, it generates a few alternative phrasings of the subquery. This improves recall because the wording that matches the indexed text is not always the wording the user typed. By default, we generate a small number of variants per subquery. If the subquery targets dialogue indexes like transcript or topic, we also keep the original phrasing as an extra variant, since exact wording often matters for dialogue. For each variant, we call the VideoDB search API with the plan’s metadata_filters applied. That means the actor filter from the example is active for every call. Each subquery produces a set of clip hits with scores. If the same clip appears across multiple variants for the same subquery, we fuse those hits into one clip entry and keep the best score. At the end of this step, we have one result set per subquery.Step 2: boolean join across subqueries
Now SearchJoin applies the join_plan. If the join_plan uses AND, we take the intersection. A clip survives only if it appears in every subquery result set. If the join_plan uses OR, we take the union. A clip survives if it appears in any subquery result set. The join step produces JoinedShot objects. Each JoinedShot keeps:- the clip boundary (video_id, start, end)
- which subquery contributed the highest score for that clip, called the primary subquery
- which other subqueries also matched that clip, called support subqueries
Where we are in the flow
At this point, PlanInit has produced a Plan and SearchJoin has executed it. From here the graph splits based on whether we got any candidate clips.
Validator: Verifying candidate clips

- the action matches but the location does not
- the character appears but is not performing the requested action
- dialogue contains similar wording but refers to something else
What Validator does
For each candidate clip, Validator asks: Does this clip satisfy the user’s intent under the current Plan? Validator is LLM-based. For each batch (up to 8 clips), we provide:- the original user query
- the current Plan snapshot (subqueries, joins, filters)
- structured clip-level signals (location, action, transcript snippet, shot type, etc.)
- relevant session history
Evidence grounding
Validator does not reprocess video frames. It operates only on previously extracted structured signals and transcript snippets. This constrains decisions to known evidence and reduces hallucinated matches.Verdict structure
Each clip receives one of three labels:- Pass: strong alignment with the full intent
- Ambiguous: partial alignment or missing evidence
- Fail: contradiction or mismatch
When all candidates fail
If every candidate is labeled Fail, Validator produces a structured feedback object describing the mismatch pattern, for example:- “location constraint satisfied but action missing”
- “dialogue matched but no speaking action detected”
- “actor filter overly restrictive”
NoneAnalyzer: When the join returns nothing
NoneAnalyzer runs only in one situation: SearchJoin produced zero candidate clips. This usually means the Plan is too strict somewhere. The join may be intersecting signals that rarely co-occur. A filter may be narrowing too hard. A subquery may be phrased in a way that does not match the collection’s vocabulary. NoneAnalyzer looks at the current query, the current Plan, and what has already been tried in the session, then chooses one of two outcomes:- revise the Plan to broaden recall, then retry SearchJoin
- pause and ask a short clarification question, because a missing detail is blocking the search
- relaxing low priority constraints first, like objects or emotion
- weakening the join strategy, for example switching part of an AND into an OR (if it makes sense)
- rewording a subquery to be less specific
- adding an extra index to a subquery to give it another source of evidence
Where we are in the flow
At this point we have covered how the plan is built, how it is executed, and how we branch based on results. Next is the Interpreter, it is the junction that converts feedback into the next attempt.
Interpreter: Turning feedback into plan edits
interpreter is the node that makes the loop move. It runs in two situations:- after a pause, when the user sends a follow up or answers a clarification question
- after validator rejects all candidates and returns feedback
What the interpreter reads
interpreter looks at the full context it needs to make a good decision:- the original query
- the current Plan
- the most recent results shown to the user, if any
- the user input, if we are resuming after a pause
- validator feedback, if we are in the all-rejected path
- the accumulated history of plan changes and Question and Answers in the session
What the interpreter outputs
The interpreter produces one of three outcomes:- a batch of plan edits
- a clarification question, if it still needs a missing detail
How the loop reconnects to SearchJoin
Any time the system decides the Plan needs to change, it routes back to SearchJoin. That includes:- NoneAnalyzer broadening the Plan after an empty join
- Interpreter applying user follow ups after a pause
- Interpreter applying validator feedback when all candidates are rejected
What happens when validator accepts
When validator returns at least one candidate as pass or ambiguous, the system has something usable. At that point the flow stops being about recovery and becomes about presentation. The next node is rerank. Rerank takes the accepted candidates and reorders them into a final ranked list for display. The input to rerank is not only retrieval scores. It includes the original query, the current Plan, and the session history, so reranking can reflect intent and preferences, not just similarity. Rerank returns a permutation of clip ids. After rerank, the graph pauses at preview_page and returns the ranked clips.The Entire Flow
