videodb
VideoDB Documentation
videodb
VideoDB Documentation
Multimodal Guide

icon picker
Conference Slide Scraper with VideoDB

When watching a talk or presentation, it's common to take notes or share interesting points with others. Often, the content on the slides captures our attention. At VideoDB, we follow many top engineering processes and regularly take notes from talks and conferences, which we share internally on our Slack channel.
However, when trying to recall specific parts of these talks, only a few keywords might come to mind. To address this, we built an internal tool that stores all these talks in VideoDB, allowing us to find and share the content on the screen in text form using a search query.

Let's explore the problem:
What was on the screen when the speaker discussed the "hard and fast rule" in the following video?

🔔 This notebook is a step towards creating a Slack bot that posts valuable engineering practices from top tech talks daily. Stay Tuned!

Introduction

In this tutorial, we'll explore an advanced yet accessible technique for retrieving visual information from video content based on what speaker was discussing. Specifically, we'll focus on finding information on slides in a video recording of a speech.
As video content continues to grow in volume and importance, being able to quickly find specific information within videos becomes crucial. Imagine being able to locate a particular statistic mentioned in a hour-long presentation without watching the entire video. That's the power of multimodal video search!
This approach combines VideoDB's powerful scene indexing capabilities with spoken word search to create a robust, multimodal search pipeline. Don't worry if these terms sound complex - we'll break everything down step by step!

Setup

📦 Installing packages

%pip install videodb

🔑 API Keys

Before proceeding, ensure access to . If not, sign up for API access on the respective platforms.
light
Get your API key from . ( Free for first 50 uploads, No credit card required ) 🎉
import os
os.environ["VIDEO_DB_API_KEY"] = ""

📋 Step 1: Connect to VideoDB

Gear up by establishing a connection to VideoDB
from videodb import connect

conn = connect()
coll = conn.get_collection()

🎬 Step 2: Upload the Video

Next, let's upload our sample video:
# Upload a video by URL
video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

📸🗣️ Step 3: Index the Video on different Modalities

Now comes the exciting part - we're going to index our video in two ways:
Indexing spoken content (what's being said in the video)
Indexing visual content (what's being shown in the video)

🗣️ Indexing Spoken Content

# Index spoken content

video.index_spoken_words()
This function transcribes the speech in the video and indexes it, making it searchable.

📸️ Find Right Configuration for Scene Indexing

To learn more about Scene Index, explore the following guides:
guide provides a step-by-step introduction to Scene Index. It's ideal for getting started quickly and understanding the primary functions.
delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.
Finding the Best Configuration for Scene Extraction
from IPython.display import Image, display
import requests


# Helper function that will help us view the Scene Collection Images
def display_scenes(scenes, images=True):
for scene in scenes:
print(f"{scene.id} : {scene.start}-{scene.end}")
if images:
for frame in scene.frames:
im = Image(requests.get(frame.url, stream=True).content)
display(im)
print("----")


scene_collection_default = video.extract_scenes()
display_scenes(scene_collection_default.scenes)

For conference videos, we would like to lower threshold to capture all slides. Let’s run the scene extraction again and see the results.
from videodb import SceneExtractionType

scene_collection = video.extract_scenes(
extraction_type=SceneExtractionType.shot_based,
extraction_config={
"threshold": 10,
},
)
display_scenes(scene_collection.scenes)

✍️ 2. Finding the Right prompt for Indexing
Testing the prompt on some sample scenes first is a sensible approach. It allows you to experiment and make adjustments without committing to the entire video, which can help manage costs.
The prompt guides the visual model to identify and describe the content of slides in each scene, outputting "None" if no slides are visible. This targeted testing can help fine-tune the model's performance before applying it to the entire video.
for scene in scene_collection.scenes[20:23]:
description = scene.describe(
"Give the content writen on the slides, output None if it isn't the slides."
)
print(f"{scene.id} : {scene.start}-{scene.end}")
print(description)
print("-----")
Now that we have found the right configuration for Scene Indexing, it's like we've found the perfect match—let's commit to indexing those scenes ✨!

🎥 Index Scenes With The Finalized Config and Prompt

This function fits all the steps above into a single cell and processes the entire video accordingly:
It breaks the video into scenes using a shot-based approach.
For each scene, it analyzes the visual content based on the given prompt.
It creates an index of these scene descriptions.
# Help function to View the Scene Index
def display_scene_index(scene_index):
for scene in scene_index:
print(f"{scene['start']} - {scene['end']}")
print(scene["description"])
print("----")


scene_index_id = video.index_scenes(
prompt="Give the content writen on the slides, output None if it isn't the slides.",
name="slides_index",
extraction_type=SceneExtractionType.shot_based,
extraction_config={
"threshold": 10,
},
)
print(scene_index_id)
scene_index = video.get_scene_index(scene_index_id)
display_scene_index(scene_index)

🔍 Step 4: Search Pipeline Implementation

The heart of this approach is the search pipeline, which combines spoken word search with scene indexing:
This pipeline does the following:
Performs a keyword search on the spoken word index.
Extracts time ranges from the search results.
Retrieves the scenes.
Filters scenes based on overlaps with the time ranges from the spoken word search.
Returns the descriptions of these scenes (our slide content) and their time ranges
def simple_filter_scenes(time_ranges, scene_dicts):
def is_in_range(scene, range_start, range_end):
scene_start = scene["start"]
scene_end = scene["end"]
return (
(range_start <= scene_start <= range_end)
or (range_start <= scene_end <= range_end)
or (scene_start <= range_start and scene_end >= range_end)
)

filtered_scenes = []
for start, end in time_ranges:
filtered_scenes.extend(
[scene for scene in scene_dicts if is_in_range(scene, start, end)]
)

# Remove duplicates while preserving order
seen = set()
return [
scene
for scene in filtered_scenes
if not (tuple(scene.items()) in seen or seen.add(tuple(scene.items())))
]
from videodb import IndexType, SearchType


def search_pipeline(query, video):
# Search Query in Spoken Word Index
search_result = video.search(
query=query,
index_type=IndexType.spoken_word,
search_type=SearchType.keyword
)
time_ranges = [(shot.start, shot.end) for shot in search_result.get_shots()]

scenes = scene_index

for scene in scenes:
scene["start"] = float(scene["start"])
scene["end"] = float(scene["end"])

# Filter Scene on the basis of Spoken results
final_result = simple_filter_scenes(time_ranges, scenes)

# Return Scene descriptions and Video Timelines of result
result_text = "\n\n".join(
result_entry["description"]
for result_entry in final_result
if result_entry.get("description", "").lower().strip() != "none"
)
result_timeline = [
(result_entry.get("start"), result_entry.get("end"))
for result_entry in final_result
]

return result_text, result_timeline

👀 Step 5: Viewing the Search Results

Finally, let's use our search pipeline:
from videodb import play_stream

query = "hard and fast rule"

result_text, result_timeline = search_pipeline(query, video)

stream_link = video.generate_stream(result_timeline)
play_stream(stream_link)

print(result_text)
It returns scenes where the spoken words match your query, along with the content of any slides visible in those scenes.


Here’s the result for this particular search query “hard and fast rule”:

The content written on the slide is:


"IT'S ALL IN THE DETAILS
Want to print your doc?
This is not the way.
Try clicking the ⋯ next to your doc name or using a keyboard shortcut (
CtrlP
) instead.