videodb
VideoDB Documentation
Pages
Multimodal Search

icon picker
Multimodal Search: Quickstart

Introduction

Let’s first look at the example query that we want to unlock in our video library.
📸🗣️ Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy

Implementing this multimodal search query involves following steps with VideoDB:
🎬 Upload and Index the Video:
Upload the video and get the video object.
index_scenes function to detect and recognize events, such as theft, within the video footage.
index_spoken_words function to index spoken words of the news anchor to enable keyword search.
🧩 Query Transformation: Divide query into two parts that can be used with respective scene and spoken indexes.
🔎 Perform Search: Using the queries search relevant segments in the indexes.
🔀 Combine Search Results of Both Modalities: Integrating the results from both indexes for precise video segment identification.
Stream the Footage: Generate and play video streams using the segments.

Setup

📦 Installing packages

%pip install openai
%pip install videodb

🔑 API Keys

Before proceeding, ensure access to , API key. If not, sign up for API access on the respective platforms.
light
Get your API key from . ( Free for first 50 uploads, No credit card required ) 🎉
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["VIDEO_DB_API_KEY"] = ""


📋 Step 0: Connect to VideoDB

Gear up by establishing a connection to VideoDB
from videodb import connect

conn = connect()
coll = conn.get_collection()

🎬 Step 1: Upload and Index the Video

Let's upload our sample educational video about the solar system:
# Upload a video by URL
video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

Now, let's index both the spoken content and scene content:
from videodb import SceneExtractionType

# Index spoken content
video.index_spoken_words()

# Index scene content
index_id = video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 2, "select_frames": ['first', 'last']},
prompt="Describe the scene in detail"
)
video.get_scene_index(index_id)


🧩 Step 2: Query Transformation

⚡️ Query transformation or processing is a crucial aspect of enhancing RAG pipelines, especially when dealing with multimodal information. By breaking down queries into their spoken and visual components, you can create more targeted and efficient search capabilities. ⚡️
While manual breakdown is a good starting point, automating this process with LLMs can greatly improve scalability and accuracy, making your systems more powerful and user-friendly.
# Manual query breaking

spoken_query = "Show me where the narrator discusses the formation of the solar system"
visual_query = "Visualize the Milky Way galaxy"

#Using LLM to transform the query

from openai import OpenAI

transformation_prompt = """
Divide the following query into two distinct parts: one for spoken content and one for visual content. The spoken content should refer to any narration, dialogue, or verbal explanations and The visual content should refer to any images, videos, or graphical representations. Format the response strictly as:\nSpoken: <spoken_query>\nVisual: <visual_query>\n\nQuery: {query}
"""

# Initialize OpenAI client
client = OpenAI()


def divide_query(query):
# Use the OpenAI client to create a chat completion with a structured prompt
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": transformation_prompt.format(query=query)}
],
)

message = response.choices[0].message.content
divided_query = message.strip().split("\n")
spoken_query = divided_query[0].replace("Spoken:", "").strip()
visual_query = divided_query[1].replace("Visual:", "").strip()

return spoken_query, visual_query


# Test the query
query = "Show me the footage where the narrator talks about the terrestrial planets and Mercury, Venus, Earth are visible on the screen"


spoken_query, visual_query = divide_query(query)
print(f"Spoken Query: {spoken_query}")
print(f"Visual Query: {visual_query}")


🔎 Step 3: Perform Searches

Now that we have our divided the queries, let's perform searches on both the spoken word and scene indexes:
from videodb import SearchType, IndexType

# Perform the search using the spoken query
spoken_results = video.search(
query=spoken_query,
index_type=IndexType.spoken_word,
search_type=SearchType.semantic
)

# Perform the search using the visual query, change default parameters.
scene_results = video.search(
query=visual_query,
index_type=IndexType.scene,
search_type=SearchType.semantic,
score_threshold=0.1,
dynamic_score_percentage=100,
)

# Optionally, you can play the results to see what was found
spoken_results.play()
scene_results.play()

🔀 Step 4: Combine Search Results of Both Modalities

Each search result provides a list of timestamps that are relevant to the query in the relevant modality (semantic & scene/ visual, in this case).
There are two ways to combine these search results:
Union: This method takes all the timestamps from every search result, creating a comprehensive list that includes every relevant time, even if some timestamps appear in only one result. ​
image.png
Intersection: This method only includes timestamps that appear in both the search results, resulting in a smaller list with times which are universally relevant for both queries. ​
image.png
Depending on the method you prefer, you can pass the appropriate argument to the combine_results() function below.
def process_shots(l1, l2, operation):
def merge_intervals(intervals):
if not intervals:
return []
intervals.sort(key=lambda x: x[0])
merged = [intervals[0]]
for interval in intervals[1:]:
if interval[0] <= merged[-1][1]:
merged[-1][1] = max(merged[-1][1], interval[1])
else:
merged.append(interval)
return merged

def intersection(intervals1, intervals2):
i, j = 0, 0
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.