VideoDB Documentation

Pages

Multimodal Search

Conference Slide Scraper with VideoDB

⁠

When watching a talk or presentation, it's common to take notes or share interesting points with others. Often, the content on the slides captures our attention. At VideoDB, we follow many top engineering processes and regularly take notes from talks and conferences, which we share internally on our Slack channel.

However, when trying to recall specific parts of these talks, only a few keywords might come to mind. To address this, we built an internal tool that stores all these talks in VideoDB, allowing us to find and share the content on the screen in text form using a search query.

Let's explore the problem:

What was on the screen when the speaker discussed the "hard and fast rule" in the following video?

⁠

🔔 This notebook is a step towards creating a Slack bot that posts valuable engineering practices from top tech talks daily. Stay Tuned!

Introduction

⁠

In this tutorial, we'll explore an advanced yet accessible technique for retrieving visual information from video content based on what speaker was discussing. Specifically, we'll focus on finding information on slides in a video recording of a speech.

As video content continues to grow in volume and importance, being able to quickly find specific information within videos becomes crucial. Imagine being able to locate a particular statistic mentioned in a hour-long presentation without watching the entire video. That's the power of multimodal video search!

This approach combines VideoDB's powerful scene indexing capabilities with spoken word search to create a robust, multimodal search pipeline. Don't worry if these terms sound complex - we'll break everything down step by step!

Setup

⁠

📦 Installing packages

%pip install videodb

🔑 API Keys

Before proceeding, ensure access to

VideoDB⁠

. If not, sign up for API access on the respective platforms.

Get your API key from

VideoDB Console⁠

. ( Free for first 50 uploads, No credit card required ) 🎉

import os

os.environ["VIDEO_DB_API_KEY"] = ""

⁠

📋 Step 1: Connect to VideoDB

Gear up by establishing a connection to VideoDB

from videodb import connect

conn = connect()

coll = conn.get_collection()

🎬 Step 2: Upload the Video

Next, let's upload our sample video:

# Upload a video by URL

video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

📸🗣️ Step 3: Index the Video on different Modalities

Now comes the exciting part - we're going to index our video in two ways:

Indexing spoken content (what's being said in the video)

Indexing visual content (what's being shown in the video)

🗣️ Indexing Spoken Content

# Index spoken content

video.index_spoken_words()

This function transcribes the speech in the video and indexes it, making it searchable.

📸️ Find Right Configuration for Scene Indexing

To learn more about Scene Index, explore the following guides:

⁠

Quickstart Guide⁠

guide provides a step-by-step introduction to Scene Index. It's ideal for getting started quickly and understanding the primary functions.

⁠

Scene Extraction Options Guide⁠

delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.

Finding the Best Configuration for Scene Extraction

from IPython.display import Image, display

import requests

# Helper function that will help us view the Scene Collection Images

def display_scenes(scenes, images=True):

for scene in scenes:

print(f"{scene.id} : {scene.start}-{scene.end}")

if images:

for frame in scene.frames:

im = Image(requests.get(frame.url, stream=True).content)

display(im)

print("----")

scene_collection_default = video.extract_scenes()

display_scenes(scene_collection_default.scenes)

For conference videos, we would like to lower threshold to capture all slides. Let’s run the scene extraction again and see the results.

from videodb import SceneExtractionType

scene_collection = video.extract_scenes(

extraction_type=SceneExtractionType.shot_based,

extraction_config={

"threshold": 10,

)

display_scenes(scene_collection.scenes)

✍️ 2. Finding the Right prompt for Indexing

Testing the prompt on some sample scenes first is a sensible approach. It allows you to experiment and make adjustments without committing to the entire video, which can help manage costs.

The prompt guides the visual model to identify and describe the content of slides in each scene, outputting "None" if no slides are visible. This targeted testing can help fine-tune the model's performance before applying it to the entire video.

for scene in scene_collection.scenes[20:23]:

description = scene.describe(

"Give the content writen on the slides, output None if it isn't the slides."

)

print(f"{scene.id} : {scene.start}-{scene.end}")

print(description)

print("-----")

Now that we have found the right configuration for Scene Indexing, it's like we've found the perfect match—let's commit to indexing those scenes ✨!

🎥 Index Scenes With The Finalized Config and Prompt

This function fits all the steps above into a single cell and processes the entire video accordingly:

It breaks the video into scenes using a shot-based approach.

For each scene, it analyzes the visual content based on the given prompt.

It creates an index of these scene descriptions.

# Help function to View the Scene Index

def display_scene_index(scene_index):

for scene in scene_index:

print(f"{scene['start']} - {scene['end']}")

print(scene["description"])

print("----")

scene_index_id = video.index_scenes(

prompt="Give the content writen on the slides, output None if it isn't the slides.",

name="slides_index",

extraction_type=SceneExtractionType.shot_based,

extraction_config={

"threshold": 10,

)

print(scene_index_id)

scene_index = video.get_scene_index(scene_index_id)

display_scene_index(scene_index)

🔍 Step 4: Search Pipeline Implementation

The heart of this approach is the search pipeline, which combines spoken word search with scene indexing:

This pipeline does the following:

Performs a keyword search on the spoken word index.

Extracts time ranges from the search results.

Retrieves the scenes.

Filters scenes based on overlaps with the time ranges from the spoken word search.

Returns the descriptions of these scenes (our slide content) and their time ranges

def simple_filter_scenes(time_ranges, scene_dicts):

def is_in_range(scene, range_start, range_end):

scene_start = scene["start"]

scene_end = scene["end"]

return (

(range_start <= scene_start <= range_end)

or (range_start <= scene_end <= range_end)

or (scene_start <= range_start and scene_end >= range_end)

)

filtered_scenes = []

for start, end in time_ranges:

filtered_scenes.extend(

[scene for scene in scene_dicts if is_in_range(scene, start, end)]

)

# Remove duplicates while preserving order

seen = set()

return [

scene

for scene in filtered_scenes

if not (tuple(scene.items()) in seen or seen.add(tuple(scene.items())))

]

from videodb import IndexType, SearchType

def search_pipeline(query, video):

# Search Query in Spoken Word Index

search_result = video.search(

query=query,

index_type=IndexType.spoken_word,

search_type=SearchType.keyword

)

time_ranges = [(shot.start, shot.end) for shot in search_result.get_shots()]

scenes = scene_index

for scene in scenes:

scene["start"] = float(scene["start"])

scene["end"] = float(scene["end"])

# Filter Scene on the basis of Spoken results

final_result = simple_filter_scenes(time_ranges, scenes)

# Return Scene descriptions and Video Timelines of result

result_text = "\n\n".join(

result_entry["description"]

for result_entry in final_result

if result_entry.get("description", "").lower().strip() != "none"

)

result_timeline = [

(result_entry.get("start"), result_entry.get("end"))

for result_entry in final_result

]

return result_text, result_timeline

👀 Step 5: Viewing the Search Results

Finally, let's use our search pipeline:

from videodb import play_stream

query = "hard and fast rule"

result_text, result_timeline = search_pipeline(query, video)

stream_link = video.generate_stream(result_timeline)

play_stream(stream_link)

print(result_text)

It returns scenes where the spoken words match your query, along with the content of any slides visible in those scenes.

Here’s the result for this particular search query “hard and fast rule”:

The content written on the slide is:

"IT'S ALL IN THE DETAILS

- Prefer American English for naming- Avoid payment-industry jargon- Timestamp fields should use <verbed_at>- Amount properties should also provide a currency- API resources with IDs are top-level- New API resources should be retrieved and listed one way- API resource mutations should be reflected in API responses- Use nested structures for future extensibility- Prefer enums to booleans for new properties- Use a type field for polymorphic objects- Use verbs for properties with side effects- Use top-level namespaces for product APIs- Evaluate new features in the Dashboard before building an API- Use simple, unambiguous language- Always paginate unbounded lists- Iterate on designs with beta users with the feature behind a gate"

⁠

Here are some other query outputs using the same search pipeline:

Search for “stripe api review”

API REVIEW CHECKLIST

**API Review: [Insert Title Here]**_Ziec: Gavel jar, link for API review join creation_

**Gavel block**To ping PM when Q/A and other stakeholders have

**Summary**

(Please include a short description of the change you would like to make. You may want to include what is and what they are impacted. Consider what the changes you are making into any broader system or related changes. Also, are there additional  reviews we need any updated?)

If you are adding new parameters or fields, please include the descriptions as you will see in the API ref docs.

Please include example API request and response for your API change.

**Type of Change**_(Deprecation, Change to existing feature, New feature)_

**Rollout Plan**

(Please define the timeline of your change, and how you would like to roll it out to users. Is the change something we can roll out to 100% of the Bentley?  Will this be something in a migration?)

**Interaction With Other API Products**(If application, please comments on how your change does/doesn’t work with:• APIs• linea• APIs• APIs)

**User Impact**

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.