videodb
VideoDB Documentation
videodb
VideoDB Documentation
Multimodal Guide

icon picker
Multimodal Search: Quickstart

Introduction

Let’s first look at the example query that we want to unlock in our video library.
📸🗣️ Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy

Implementing this multimodal search query involves following steps with VideoDB:
🎬 Upload and Index the Video:
Upload the video and get the video object.
index_scenes function to detect and recognize events, such as theft, within the video footage.
index_spoken_words function to index spoken words of the news anchor to enable keyword search.
🧩 Query Transformation: Divide query into two parts that can be used with respective scene and spoken indexes.
🔎 Perform Search: Using the queries search relevant segments in the indexes.
🔀 Combine Search Results of Both Modalities: Integrating the results from both indexes for precise video segment identification.
Stream the Footage: Generate and play video streams using the segments.

Setup

📦 Installing packages

%pip install openai
%pip install videodb

🔑 API Keys

Before proceeding, ensure access to , API key. If not, sign up for API access on the respective platforms.
light
Get your API key from . ( Free for first 50 uploads, No credit card required ) 🎉
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["VIDEO_DB_API_KEY"] = ""


📋 Step 0: Connect to VideoDB

Gear up by establishing a connection to VideoDB
from videodb import connect

conn = connect()
coll = conn.get_collection()

🎬 Step 1: Upload and Index the Video

Let's upload our sample educational video about the solar system:
# Upload a video by URL
video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

Now, let's index both the spoken content and scene content:
from videodb import SceneExtractionType

# Index spoken content
video.index_spoken_words()

# Index scene content
index_id = video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 2, "select_frames": ['first', 'last']},
prompt="Describe the scene in detail"
)
video.get_scene_index(index_id)


🧩 Step 2: Query Transformation

⚡️ Query transformation or processing is a crucial aspect of enhancing RAG pipelines, especially when dealing with multimodal information. By breaking down queries into their spoken and visual components, you can create more targeted and efficient search capabilities. ⚡️
While manual breakdown is a good starting point, automating this process with LLMs can greatly improve scalability and accuracy, making your systems more powerful and user-friendly.
# Manual query breaking

spoken_query = "Show me where the narrator discusses the formation of the solar system"
visual_query = "Visualize the Milky Way galaxy"

#Using LLM to transform the query

from openai import OpenAI

transformation_prompt = """
Divide the following query into two distinct parts: one for spoken content and one for visual content. The spoken content should refer to any narration, dialogue, or verbal explanations and The visual content should refer to any images, videos, or graphical representations. Format the response strictly as:\nSpoken: <spoken_query>\nVisual: <visual_query>\n\nQuery: {query}
"""

# Initialize OpenAI client
client = OpenAI()


def divide_query(query):
# Use the OpenAI client to create a chat completion with a structured prompt
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": transformation_prompt.format(query=query)}
],
)

message = response.choices[0].message.content
divided_query = message.strip().split("\n")
spoken_query = divided_query[0].replace("Spoken:", "").strip()
visual_query = divided_query[1].replace("Visual:", "").strip()

return spoken_query, visual_query


# Test the query
query = "Show me the footage where the narrator talks about the terrestrial planets and Mercury, Venus, Earth are visible on the screen"


spoken_query, visual_query = divide_query(query)
print(f"Spoken Query: {spoken_query}")
print(f"Visual Query: {visual_query}")


🔎 Step 3: Perform Searches

Now that we have our divided the queries, let's perform searches on both the spoken word and scene indexes:
from videodb import SearchType, IndexType

# Perform the search using the spoken query
spoken_results = video.search(
query=spoken_query,
index_type=IndexType.spoken_word,
search_type=SearchType.semantic
)

# Perform the search using the visual query, change default parameters.
scene_results = video.search(
query=visual_query,
index_type=IndexType.scene,
search_type=SearchType.semantic,
score_threshold=0.1,
dynamic_score_percentage=100,
)

# Optionally, you can play the results to see what was found
spoken_results.play()
scene_results.play()

🔀 Step 4: Combine Search Results of Both Modalities

Each search result provides a list of timestamps that are relevant to the query in the relevant modality (semantic & scene/ visual, in this case).
There are two ways to combine these search results:
Union: This method takes all the timestamps from every search result, creating a comprehensive list that includes every relevant time, even if some timestamps appear in only one result. ​
image.png
Intersection: This method only includes timestamps that appear in both the search results, resulting in a smaller list with times which are universally relevant for both queries. ​
image.png
Depending on the method you prefer, you can pass the appropriate argument to the combine_results() function below.
def process_shots(l1, l2, operation):
def merge_intervals(intervals):
if not intervals:
return []
intervals.sort(key=lambda x: x[0])
merged = [intervals[0]]
for interval in intervals[1:]:
if interval[0] <= merged[-1][1]:
merged[-1][1] = max(merged[-1][1], interval[1])
else:
merged.append(interval)
return merged

def intersection(intervals1, intervals2):
i, j = 0, 0
result = []
while i < len(intervals1) and j < len(intervals2):
low = max(intervals1[i][0], intervals2[j][0])
high = min(intervals1[i][1], intervals2[j][1])
if low < high:
result.append([low, high])
if intervals1[i][1] < intervals2[j][1]:
i += 1
else:
j += 1
return result

if operation.lower() == 'intersection':
return intersection(merge_intervals(l1), merge_intervals(l2))
elif operation.lower() == 'union':
return merge_intervals(l1 + l2)
else:
raise ValueError("Invalid operation. Please choose 'intersection' or 'union'.")

def combine_results(spoken_results, scene_results, operation):
spoken_timestamps = [(shot.start, shot.end) for shot in spoken_results.get_shots()]
scene_timestamps = [(shot.start, shot.end) for shot in scene_results.get_shots()]
print("Spoken Results : ", spoken_timestamps)
print("Scene Results : ", scene_timestamps)
result = process_shots(spoken_timestamps, scene_timestamps, operation)
return result

# Get intersection points
results = combine_results(spoken_results, scene_results, "intersection")

🪄 Step 5: Stream the Footage

Finally, let's generate a stream of the intersecting segments and watch it:
from videodb import play_stream
print(f"Multimodal Query: {query}")
stream_link = video.generate_stream(results)
play_stream(stream_link)
This would play a video stream containing only the segments where both the spoken content and visual content match our original multimodal query.
Load content from console.videodb.io?
Loading external content may reveal information to 3rd parties. Learn more
Allow

Conclusion

Congratulations 🙌🏼 You've successfully implemented a multimodal search workflow. This powerful technique allows for precise identification of video segments that match both spoken and visual criteria, opening up new possibilities for:
Law Enforcement: Helps in quickly retrieving crucial evidence from vast amounts of surveillance and news footage.
Media and Journalism: Facilitates the process of locating specific segments within hours of news broadcasts, aiding in efficient reporting and fact-checking.
Public Safety: Enhances the ability of authorities to disseminate important information to the public by quickly identifying and sharing relevant content.
... and much more!

There can be more methods to enable multimodal search queries and we’ll be adding detailed guide for each method.
light
Remember, the key to mastering this technique is experimentation. Try different queries, adjust your search parameters, and see how you can fine-tune the results for your specific use case.
To learn more about Scene Index, explore the following guides:
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.