When watching a talk or presentation, it's common to take notes or share interesting points with others. Often, the content on the slides captures our attention. At VideoDB, we follow many top engineering processes and regularly take notes from talks and conferences, which we share internally on our Slack channel.
However, when trying to recall specific parts of these talks, only a few keywords might come to mind. To address this, we built an internal tool that stores all these talks in VideoDB, allowing us to find and share the content on the screen in text form using a search query.
Let's explore the problem:
What was on the screen when the speaker discussed the "hard and fast rule" in the following video?
🔔 This notebook is a step towards creating a Slack bot that posts valuable engineering practices from top tech talks daily. Stay Tuned!
Introduction
In this tutorial, we'll explore an advanced yet accessible technique for retrieving visual information from video content based on what speaker was discussing. Specifically, we'll focus on finding information on slides in a video recording of a speech.
As video content continues to grow in volume and importance, being able to quickly find specific information within videos becomes crucial. Imagine being able to locate a particular statistic mentioned in a hour-long presentation without watching the entire video. That's the power of multimodal video search!
This approach combines VideoDB's powerful scene indexing capabilities with spoken word search to create a robust, multimodal search pipeline. Don't worry if these terms sound complex - we'll break everything down step by step!
delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.
Finding the Best Configuration for Scene Extraction
from IPython.display import Image, display
import requests
# Helper function that will help us view the Scene Collection Images
defdisplay_scenes(scenes, images=True):
for scene in scenes:
print(f"{scene.id} : {scene.start}-{scene.end}")
if images:
for frame in scene.frames:
im = Image(requests.get(frame.url, stream=True).content)
display(im)
print("----")
scene_collection_default = video.extract_scenes()
display_scenes(scene_collection_default.scenes)
For conference videos, we would like to lower threshold to capture all slides. Let’s run the scene extraction again and see the results.
from videodb import SceneExtractionType
scene_collection = video.extract_scenes(
extraction_type=SceneExtractionType.shot_based,
extraction_config={
"threshold":10,
},
)
display_scenes(scene_collection.scenes)
✍️ 2. Finding the Right prompt for Indexing
Testing the prompt on some sample scenes first is a sensible approach. It allows you to experiment and make adjustments without committing to the entire video, which can help manage costs.
The prompt guides the visual model to identify and describe the content of slides in each scene, outputting "None" if no slides are visible. This targeted testing can help fine-tune the model's performance before applying it to the entire video.
for scene in scene_collection.scenes[20:23]:
description = scene.describe(
"Give the content writen on the slides, output None if it isn't the slides."
)
print(f"{scene.id} : {scene.start}-{scene.end}")
print(description)
print("-----")
Now that we have found the right configuration for Scene Indexing, it's like we've found the perfect match—let's commit to indexing those scenes ✨!
🎥 Index Scenes With The Finalized Config and Prompt
This function fits all the steps above into a single cell and processes the entire video accordingly:
It breaks the video into scenes using a shot-based approach.
For each scene, it analyzes the visual content based on the given prompt.
It creates an index of these scene descriptions.
# Help function to View the Scene Index
defdisplay_scene_index(scene_index):
for scene in scene_index:
print(f"{scene['start']} - {scene['end']}")
print(scene["description"])
print("----")
scene_index_id = video.index_scenes(
prompt="Give the content writen on the slides, output None if it isn't the slides.",