Skip to content
videodb
VideoDB Documentation
  • Pages
    • Welcome to VideoDB Docs
    • Quick Start Guide
      • Video Indexing Guide
      • Semantic Search
      • Collections
      • Public Collections
      • Callback Details
      • Ref: Subtitle Styles
      • Language Support
      • Guide: Subtitles
      • How Accurate is Your Search?
    • Visual Search and Indexing
      • Scene Extraction Algorithms
      • Custom Annotations
      • Scene-Level Metadata: Smarter Video Search & Retrieval
      • Advanced Visual Search Pipelines
      • Playground for Scene Extractions
      • Deep Dive into Prompt Engineering : Mastering Visual Indexing
      • How VideoDB Solves Complex Visual Analysis Tasks
      • icon picker
        Multimodal Search: Quickstart
      • Conference Slide Scraper with VideoDB
    • Examples and Tutorials
      • Dubbing - Replace Soundtrack with New Audio
      • Beep curse words in real-time
      • Remove Unwanted Content from videos
      • Instant Clips of Your Favorite Characters
      • Insert Dynamic Ads in real-time
      • Adding Brand Elements with VideoDB
      • Elevating Trailers with Automated Narration
      • Add Intro/Outro to Videos
      • Audio overlay + Video + Timeline
      • Building Dynamic Video Streams with VideoDB: Integrating Custom Data and APIs
      • AI Generated Ad Films for Product Videography: Wellsaid, Open AI & VideoDB
      • Fun with Keyword Search
      • Overlay a Word-Counter on Video Stream
      • Generate Automated Video Outputs with Text Prompts | DALL-E + ElevenLabs + OpenAI + VideoDB
      • Eleven Labs x VideoDB: Adding AI Generated voiceovers to silent footage
      • VideoDB x TwelveLabs: Real-Time Video Understanding
      • Multimodal Search
      • How I Built a CRM-integrated Sales Assistant Agent in 1 Hour
      • Make Your Video Sound Studio Quality with Voice Cloning
      • Automated Traffic Violation Reporter
    • Live Video→ Instant Action
    • Generative Media Quickstart
      • Generative Media Pricing
    • Video Editing Automation
      • Fit & Position: Aspect Ratio Control
      • Trimming vs Timing: Two Independent Timelines
      • Advanced Clip Control: The Composition Layer
      • Caption & Subtitles: Auto-Generated Speech Synchronization
      • Notebooks
    • Transcoding Quickstart
    • director-light
      Director - Video Agent Framework
      • Agent Creation Playbook
      • Setup Director Locally
    • Workflows and Integrations
      • zapier
        Zapier Integration
        • Auto-Dub Videos & Save to Google Drive
        • Create & Add Intelligent Video Highlights to Notion
        • Create GenAI Video Engine - Notion Ideas to Youtube
        • Automatically Detect Profanity in Videos with AI - Update on Slack
        • Generate and Store YouTube Video Summaries in Notion
        • Automate Subtitle Generation for Video Libraries
        • Solve customers queries with Video Answers
      • n8n
        N8N Workflows
        • AI-Powered Meeting Intelligence: Recording to Insights Automation
        • AI Powered Dubbing Workflow for Video Content
        • Automate Subtitle Generation for Video Libraries
        • Automate Interview Evaluations with AI
        • Turn Meeting Recordings into Actionable Summaries
        • Auto-Sync Sales Calls to HubSpot CRM with AI
        • Instant Notion Summaries for Your Youtube Playlist
    • Meeting Recording SDK
    • github
      Open Source
      • llama
        LlamaIndex VideoDB Retriever
      • PromptClip: Use Power of LLM to Create Clips
      • StreamRAG: Connect ChatGPT to VideoDB
    • mcp
      VideoDB MCP Server
    • videodb
      Give your AI, Eyes and Ears
      • Building Infrastructure that “Sees” and “Edits”
      • Agents with Video Experience
      • From MP3/MP4 to the Future with VideoDB
      • Dynamic Video Streams
      • Why do we need a Video Database Now?
      • What's a Video Database ?
      • Enhancing AI-Driven Multimedia Applications
      • Beyond Traditional Video Infrastructure
    • Customer Love
    • Join us
      • videodb
        Internship: Build the Future of AI-Powered Video Infrastructure
      • Ashutosh Trivedi
        • Playlists
        • Talks - Solving Logical Puzzles with Natural Language Processing - PyCon India 2015
      • Ashish
      • Shivani Desai
      • Gaurav Tyagi
      • Rohit Garg
      • Edge of Knowledge
        • Language Models to World Models: The Next Frontier in AI
        • Society of Machines
          • Society of Machines
          • Autonomy - Do we have the choice?
          • Emergence - An Intelligence of the collective
        • Building Intelligent Machines
          • Part 1 - Define Intelligence
          • Part 2 - Observe and Respond
          • Part 3 - Training a Model
      • Updates
        • VideoDB Acquires Devzery: Expanding Our AI Infra Stack with Developer-First Testing Automation

Multimodal Search: Quickstart

Introduction

Let’s first look at the example query that we want to unlock in our video library.
📸🗣️ Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy

Implementing this multimodal search query involves following steps with VideoDB:
🎬 Upload and Index the Video:
Upload the video and get the video object.
index_scenes function to detect and recognize events, such as theft, within the video footage.
index_spoken_words function to index spoken words of the news anchor to enable keyword search.
🧩 Query Transformation: Divide query into two parts that can be used with respective scene and spoken indexes.
🔎 Perform Search: Using the queries search relevant segments in the indexes.
🔀 Combine Search Results of Both Modalities: Integrating the results from both indexes for precise video segment identification.
Stream the Footage: Generate and play video streams using the segments.

Setup

📦 Installing packages

%pip install openai
%pip install videodb

🔑 API Keys

Before proceeding, ensure access to , API key. If not, sign up for API access on the respective platforms.
light
Get your API key from . ( Free for first 50 uploads, No credit card required ) 🎉
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["VIDEO_DB_API_KEY"] = ""


📋 Step 0: Connect to VideoDB

Gear up by establishing a connection to VideoDB
from videodb import connect

conn = connect()
coll = conn.get_collection()

🎬 Step 1: Upload and Index the Video

Let's upload our sample educational video about the solar system:
# Upload a video by URL
video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

Now, let's index both the spoken content and scene content:
from videodb import SceneExtractionType

# Index spoken content
video.index_spoken_words()

# Index scene content
index_id = video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 2, "select_frames": ['first', 'last']},
prompt="Describe the scene in detail"
)
video.get_scene_index(index_id)


🧩 Step 2: Query Transformation

⚡️ Query transformation or processing is a crucial aspect of enhancing RAG pipelines, especially when dealing with multimodal information. By breaking down queries into their spoken and visual components, you can create more targeted and efficient search capabilities. ⚡️
While manual breakdown is a good starting point, automating this process with LLMs can greatly improve scalability and accuracy, making your systems more powerful and user-friendly.
# Manual query breaking

spoken_query = "Show me where the narrator discusses the formation of the solar system"
visual_query = "Visualize the Milky Way galaxy"

#Using LLM to transform the query

from openai import OpenAI

transformation_prompt = """
Divide the following query into two distinct parts: one for spoken content and one for visual content. The spoken content should refer to any narration, dialogue, or verbal explanations and The visual content should refer to any images, videos, or graphical representations. Format the response strictly as:\nSpoken: <spoken_query>\nVisual: <visual_query>\n\nQuery: {query}
"""

# Initialize OpenAI client
client = OpenAI()


def divide_query(query):
# Use the OpenAI client to create a chat completion with a structured prompt
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": transformation_prompt.format(query=query)}
],
)

message = response.choices[0].message.content
divided_query = message.strip().split("\n")
spoken_query = divided_query[0].replace("Spoken:", "").strip()
visual_query = divided_query[1].replace("Visual:", "").strip()

return spoken_query, visual_query


# Test the query
query = "Show me the footage where the narrator talks about the terrestrial planets and Mercury, Venus, Earth are visible on the screen"


spoken_query, visual_query = divide_query(query)
print(f"Spoken Query: {spoken_query}")
print(f"Visual Query: {visual_query}")


🔎 Step 3: Perform Searches

Now that we have our divided the queries, let's perform searches on both the spoken word and scene indexes:
from videodb import SearchType, IndexType

# Perform the search using the spoken query
spoken_results = video.search(
query=spoken_query,
index_type=IndexType.spoken_word,
search_type=SearchType.semantic
)

# Perform the search using the visual query, change default parameters.
scene_results = video.search(
query=visual_query,
index_type=IndexType.scene,
search_type=SearchType.semantic,
score_threshold=0.1,
dynamic_score_percentage=100,
)

# Optionally, you can play the results to see what was found
spoken_results.play()
scene_results.play()

🔀 Step 4: Combine Search Results of Both Modalities

Each search result provides a list of timestamps that are relevant to the query in the relevant modality (semantic & scene/ visual, in this case).
There are two ways to combine these search results:
Union: This method takes all the timestamps from every search result, creating a comprehensive list that includes every relevant time, even if some timestamps appear in only one result. ​
image.png
Intersection: This method only includes timestamps that appear in both the search results, resulting in a smaller list with times which are universally relevant for both queries. ​
image.png
Depending on the method you prefer, you can pass the appropriate argument to the combine_results() function below.
def process_shots(l1, l2, operation):
def merge_intervals(intervals):
if not intervals:
return []
intervals.sort(key=lambda x: x[0])
merged = [intervals[0]]
for interval in intervals[1:]:
if interval[0] <= merged[-1][1]:
merged[-1][1] = max(merged[-1][1], interval[1])
else:
merged.append(interval)
return merged

def intersection(intervals1, intervals2):
i, j = 0, 0
result = []
while i < len(intervals1) and j < len(intervals2):
low = max(intervals1[i][0], intervals2[j][0])
high = min(intervals1[i][1], intervals2[j][1])
if low < high:
result.append([low, high])
if intervals1[i][1] < intervals2[j][1]:
i += 1
else:
j += 1
return result

if operation.lower() == 'intersection':
return intersection(merge_intervals(l1), merge_intervals(l2))
elif operation.lower() == 'union':
return merge_intervals(l1 + l2)
else:
raise ValueError("Invalid operation. Please choose 'intersection' or 'union'.")

def combine_results(spoken_results, scene_results, operation):
spoken_timestamps = [(shot.start, shot.end) for shot in spoken_results.get_shots()]
scene_timestamps = [(shot.start, shot.end) for shot in scene_results.get_shots()]
print("Spoken Results : ", spoken_timestamps)
print("Scene Results : ", scene_timestamps)
result = process_shots(spoken_timestamps, scene_timestamps, operation)
return result

# Get intersection points
results = combine_results(spoken_results, scene_results, "intersection")

🪄 Step 5: Stream the Footage

Finally, let's generate a stream of the intersecting segments and watch it:
from videodb import play_stream
print(f"Multimodal Query: {query}")
stream_link = video.generate_stream(results)
play_stream(stream_link)
This would play a video stream containing only the segments where both the spoken content and visual content match our original multimodal query.

Conclusion

Congratulations 🙌🏼 You've successfully implemented a multimodal search workflow. This powerful technique allows for precise identification of video segments that match both spoken and visual criteria, opening up new possibilities for:
Law Enforcement: Helps in quickly retrieving crucial evidence from vast amounts of surveillance and news footage.
Media and Journalism: Facilitates the process of locating specific segments within hours of news broadcasts, aiding in efficient reporting and fact-checking.
Public Safety: Enhances the ability of authorities to disseminate important information to the public by quickly identifying and sharing relevant content.
... and much more!

There can be more methods to enable multimodal search queries and we’ll be adding detailed guide for each method.
light
Remember, the key to mastering this technique is experimentation. Try different queries, adjust your search parameters, and see how you can fine-tune the results for your specific use case.
To learn more about Scene Index, explore the following guides:
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.