videodb
VideoDB Documentation
videodb
VideoDB Documentation
Multimodal Search

icon picker
Multimodal Search: Quickstart

Introduction

Let’s first look at the example query that we want to unlock in our video library.
📸🗣️ Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy

Implementing this multimodal search query involves following steps with VideoDB:
🎬 Upload and Index the Video:
Upload the video and get the video object.
index_scenes function to detect and recognize events, such as theft, within the video footage.
index_spoken_words function to index spoken words of the news anchor to enable keyword search.
🧩 Query Transformation: Divide query into two parts that can be used with respective scene and spoken indexes.
🔎 Perform Search: Using the queries search relevant segments in the indexes.
🔀 Combine Search Results of Both Modalities: Integrating the results from both indexes for precise video segment identification.
Stream the Footage: Generate and play video streams using the segments.

Setup

📦 Installing packages

%pip install openai
%pip install videodb

🔑 API Keys

Before proceeding, ensure access to , API key. If not, sign up for API access on the respective platforms.
light
Get your API key from . ( Free for first 50 uploads, No credit card required ) 🎉
import os

os.environ["OPENAI_API_KEY"] = ""
os.environ["VIDEO_DB_API_KEY"] = ""


📋 Step 0: Connect to VideoDB

Gear up by establishing a connection to VideoDB
from videodb import connect

conn = connect()
coll = conn.get_collection()

🎬 Step 1: Upload and Index the Video

Let's upload our sample educational video about the solar system:
# Upload a video by URL
video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

Now, let's index both the spoken content and scene content:
from videodb import SceneExtractionType

# Index spoken content
video.index_spoken_words()

# Index scene content
index_id = video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 2, "select_frames": ['first', 'last']},
prompt="Describe the scene in detail"
)
video.get_scene_index(index_id)


🧩 Step 2: Query Transformation

⚡️ Query transformation or processing is a crucial aspect of enhancing RAG pipelines, especially when dealing with multimodal information. By breaking down queries into their spoken and visual components, you can create more targeted and efficient search capabilities. ⚡️
While manual breakdown is a good starting point, automating this process with LLMs can greatly improve scalability and accuracy, making your systems more powerful and user-friendly.
# Manual query breaking

spoken_query = "Show me where the narrator discusses the formation of the solar system"
visual_query = "Visualize the Milky Way galaxy"

#Using LLM to transform the query

from openai import OpenAI

transformation_prompt = """
Divide the following query into two distinct parts: one for spoken content and one for visual content. The spoken content should refer to any narration, dialogue, or verbal explanations and The visual content should refer to any images, videos, or graphical representations. Format the response strictly as:\nSpoken: <spoken_query>\nVisual: <visual_query>\n\nQuery: {query}
"""

# Initialize OpenAI client
client = OpenAI()


def divide_query(query):
# Use the OpenAI client to create a chat completion with a structured prompt
response = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": transformation_prompt.format(query=query)}
],
)

message = response.choices[0].message.content
divided_query = message.strip().split("\n")
spoken_query = divided_query[0].replace("Spoken:", "").strip()
visual_query = divided_query[1].replace("Visual:", "").strip()

return spoken_query, visual_query


# Test the query
query = "Show me the footage where the narrator talks about the terrestrial planets and Mercury, Venus, Earth are visible on the screen"


Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.