Skip to content
videodb
VideoDB Documentation
  • Pages
    • Welcome to VideoDB Docs
    • Quick Start Guide
      • Video Indexing Guide
      • Semantic Search
      • Collections
      • Public Collections
      • Callback Details
      • Ref: Subtitle Styles
      • Language Support
      • Guide: Subtitles
      • How Accurate is Your Search?
    • Visual Search and Indexing
      • Scene Extraction Algorithms
      • Custom Annotations
      • Scene-Level Metadata: Smarter Video Search & Retrieval
      • Advanced Visual Search Pipelines
      • Playground for Scene Extractions
      • Deep Dive into Prompt Engineering : Mastering Visual Indexing
      • How VideoDB Solves Complex Visual Analysis Tasks
      • Multimodal Search: Quickstart
      • icon picker
        Conference Slide Scraper with VideoDB
    • Examples and Tutorials
      • Dubbing - Replace Soundtrack with New Audio
      • VideoDB: Adding AI Generated voiceovers to silent footage
      • Beep curse words in real-time
      • Remove Unwanted Content from videos
      • Instant Clips of Your Favorite Characters
      • Insert Dynamic Ads in real-time
      • Adding Brand Elements with VideoDB
      • Elevating Trailers with Automated Narration
      • Add Intro/Outro to Videos
      • Audio overlay + Video + Timeline
      • Building Dynamic Video Streams with VideoDB: Integrating Custom Data and APIs
      • AI Generated Ad Films for Product Videography
      • Fun with Keyword Search
      • Overlay a Word-Counter on Video Stream
      • Generate Automated Video Outputs with Text Prompts | VideoDB
      • Eleven Labs x VideoDB: Adding AI Generated voiceovers to silent footage
      • VideoDB x TwelveLabs: Real-Time Video Understanding
      • Multimodal Search
      • How I Built a CRM-integrated Sales Assistant Agent in 1 Hour
      • Make Your Video Sound Studio Quality with Voice Cloning
      • Automated Traffic Violation Reporter
    • Live Video→ Instant Action
    • Generative Media Quickstart
      • Generative Media Pricing
    • Video Editing Automation
      • Fit & Position: Aspect Ratio Control
      • Trimming vs Timing: Two Independent Timelines
      • Advanced Clip Control: The Composition Layer
      • Caption & Subtitles: Auto-Generated Speech Synchronization
      • Notebooks
    • Transcoding Quickstart
    • director-light
      Director - Video Agent Framework
      • Agent Creation Playbook
      • Setup Director Locally
    • Workflows and Integrations
      • zapier
        Zapier Integration
        • Auto-Dub Videos & Save to Google Drive
        • Create & Add Intelligent Video Highlights to Notion
        • Create GenAI Video Engine - Notion Ideas to Youtube
        • Automatically Detect Profanity in Videos with AI - Update on Slack
        • Generate and Store YouTube Video Summaries in Notion
        • Automate Subtitle Generation for Video Libraries
        • Solve customers queries with Video Answers
      • n8n
        N8N Workflows
        • AI-Powered Meeting Intelligence: Recording to Insights Automation
        • AI Powered Dubbing Workflow for Video Content
        • Automate Subtitle Generation for Video Libraries
        • Automate Interview Evaluations with AI
        • Turn Meeting Recordings into Actionable Summaries
        • Auto-Sync Sales Calls to HubSpot CRM with AI
        • Instant Notion Summaries for Your Youtube Playlist
    • Meeting Recording SDK
    • github
      Open Source
      • llama
        LlamaIndex VideoDB Retriever
      • PromptClip: Use Power of LLM to Create Clips
      • StreamRAG: Connect ChatGPT to VideoDB
    • mcp
      VideoDB MCP Server
    • videodb
      Give your AI, Eyes and Ears
      • Building Infrastructure that “Sees” and “Edits”
      • Agents with Video Experience
      • From MP3/MP4 to the Future with VideoDB
      • Dynamic Video Streams
      • Why do we need a Video Database Now?
      • What's a Video Database ?
      • Enhancing AI-Driven Multimedia Applications
      • Beyond Traditional Video Infrastructure
    • Customer Love
    • Join us
      • videodb
        Internship: Build the Future of AI-Powered Video Infrastructure
      • Ashutosh Trivedi
        • Playlists
        • Talks - Solving Logical Puzzles with Natural Language Processing - PyCon India 2015
      • Ashish
      • Shivani Desai
      • Gaurav Tyagi
      • Rohit Garg
      • Edge of Knowledge
        • Language Models to World Models: The Next Frontier in AI
        • Society of Machines
          • Society of Machines
          • Autonomy - Do we have the choice?
          • Emergence - An Intelligence of the collective
        • Building Intelligent Machines
          • Part 1 - Define Intelligence
          • Part 2 - Observe and Respond
          • Part 3 - Training a Model
      • Updates
        • VideoDB Acquires Devzery: Expanding Our AI Infra Stack with Developer-First Testing Automation

Conference Slide Scraper with VideoDB

When watching a talk or presentation, it's common to take notes or share interesting points with others. Often, the content on the slides captures our attention. At VideoDB, we follow many top engineering processes and regularly take notes from talks and conferences, which we share internally on our Slack channel.
However, when trying to recall specific parts of these talks, only a few keywords might come to mind. To address this, we built an internal tool that stores all these talks in VideoDB, allowing us to find and share the content on the screen in text form using a search query.

Let's explore the problem:
What was on the screen when the speaker discussed the "hard and fast rule" in the following video?

🔔 This notebook is a step towards creating a Slack bot that posts valuable engineering practices from top tech talks daily. Stay Tuned!

Introduction

In this tutorial, we'll explore an advanced yet accessible technique for retrieving visual information from video content based on what speaker was discussing. Specifically, we'll focus on finding information on slides in a video recording of a speech.
As video content continues to grow in volume and importance, being able to quickly find specific information within videos becomes crucial. Imagine being able to locate a particular statistic mentioned in a hour-long presentation without watching the entire video. That's the power of multimodal video search!
This approach combines VideoDB's powerful scene indexing capabilities with spoken word search to create a robust, multimodal search pipeline. Don't worry if these terms sound complex - we'll break everything down step by step!

Setup

📦 Installing packages

%pip install videodb

🔑 API Keys

Before proceeding, ensure access to . If not, sign up for API access on the respective platforms.
light
Get your API key from . ( Free for first 50 uploads, No credit card required ) 🎉
import os
os.environ["VIDEO_DB_API_KEY"] = ""

📋 Step 1: Connect to VideoDB

Gear up by establishing a connection to VideoDB
from videodb import connect

conn = connect()
coll = conn.get_collection()

🎬 Step 2: Upload the Video

Next, let's upload our sample video:
# Upload a video by URL
video = coll.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")

📸🗣️ Step 3: Index the Video on different Modalities

Now comes the exciting part - we're going to index our video in two ways:
Indexing spoken content (what's being said in the video)
Indexing visual content (what's being shown in the video)

🗣️ Indexing Spoken Content

# Index spoken content

video.index_spoken_words()
This function transcribes the speech in the video and indexes it, making it searchable.

📸️ Find Right Configuration for Scene Indexing

To learn more about Scene Index, explore the following guides:
guide provides a step-by-step introduction to Scene Index. It's ideal for getting started quickly and understanding the primary functions.
delves deeper into the various options available for scene extraction within Scene Index. It covers advanced settings, customization features, and tips for optimizing scene extraction based on different needs and preferences.
Finding the Best Configuration for Scene Extraction
from IPython.display import Image, display
import requests


# Helper function that will help us view the Scene Collection Images
def display_scenes(scenes, images=True):
for scene in scenes:
print(f"{scene.id} : {scene.start}-{scene.end}")
if images:
for frame in scene.frames:
im = Image(requests.get(frame.url, stream=True).content)
display(im)
print("----")


scene_collection_default = video.extract_scenes()
display_scenes(scene_collection_default.scenes)

For conference videos, we would like to lower threshold to capture all slides. Let’s run the scene extraction again and see the results.
from videodb import SceneExtractionType

scene_collection = video.extract_scenes(
extraction_type=SceneExtractionType.shot_based,
extraction_config={
"threshold": 10,
},
)
display_scenes(scene_collection.scenes)

✍️ 2. Finding the Right prompt for Indexing
Testing the prompt on some sample scenes first is a sensible approach. It allows you to experiment and make adjustments without committing to the entire video, which can help manage costs.
The prompt guides the visual model to identify and describe the content of slides in each scene, outputting "None" if no slides are visible. This targeted testing can help fine-tune the model's performance before applying it to the entire video.
for scene in scene_collection.scenes[20:23]:
description = scene.describe(
"Give the content writen on the slides, output None if it isn't the slides."
)
print(f"{scene.id} : {scene.start}-{scene.end}")
print(description)
print("-----")
Now that we have found the right configuration for Scene Indexing, it's like we've found the perfect match—let's commit to indexing those scenes ✨!

🎥 Index Scenes With The Finalized Config and Prompt

This function fits all the steps above into a single cell and processes the entire video accordingly:
It breaks the video into scenes using a shot-based approach.
For each scene, it analyzes the visual content based on the given prompt.
It creates an index of these scene descriptions.
# Help function to View the Scene Index
def display_scene_index(scene_index):
for scene in scene_index:
print(f"{scene['start']} - {scene['end']}")
print(scene["description"])
print("----")


scene_index_id = video.index_scenes(
prompt="Give the content writen on the slides, output None if it isn't the slides.",
name="slides_index",
extraction_type=SceneExtractionType.shot_based,
extraction_config={
"threshold": 10,
},
)
print(scene_index_id)
scene_index = video.get_scene_index(scene_index_id)
display_scene_index(scene_index)

🔍 Step 4: Search Pipeline Implementation

The heart of this approach is the search pipeline, which combines spoken word search with scene indexing:
This pipeline does the following:
Performs a keyword search on the spoken word index.
Extracts time ranges from the search results.
Retrieves the scenes.
Filters scenes based on overlaps with the time ranges from the spoken word search.
Returns the descriptions of these scenes (our slide content) and their time ranges
def simple_filter_scenes(time_ranges, scene_dicts):
def is_in_range(scene, range_start, range_end):
scene_start = scene["start"]
scene_end = scene["end"]
return (
(range_start <= scene_start <= range_end)
or (range_start <= scene_end <= range_end)
or (scene_start <= range_start and scene_end >= range_end)
)

filtered_scenes = []
for start, end in time_ranges:
filtered_scenes.extend(
[scene for scene in scene_dicts if is_in_range(scene, start, end)]
)

# Remove duplicates while preserving order
seen = set()
return [
scene
for scene in filtered_scenes
if not (tuple(scene.items()) in seen or seen.add(tuple(scene.items())))
]
from videodb import IndexType, SearchType


def search_pipeline(query, video):
# Search Query in Spoken Word Index
search_result = video.search(
query=query,
index_type=IndexType.spoken_word,
search_type=SearchType.keyword
)
time_ranges = [(shot.start, shot.end) for shot in search_result.get_shots()]

scenes = scene_index

for scene in scenes:
scene["start"] = float(scene["start"])
scene["end"] = float(scene["end"])

# Filter Scene on the basis of Spoken results
final_result = simple_filter_scenes(time_ranges, scenes)

# Return Scene descriptions and Video Timelines of result
result_text = "\n\n".join(
result_entry["description"]
for result_entry in final_result
if result_entry.get("description", "").lower().strip() != "none"
)
result_timeline = [
(result_entry.get("start"), result_entry.get("end"))
for result_entry in final_result
]

return result_text, result_timeline

👀 Step 5: Viewing the Search Results

Finally, let's use our search pipeline:
from videodb import play_stream

query = "hard and fast rule"

result_text, result_timeline = search_pipeline(query, video)

stream_link = video.generate_stream(result_timeline)
play_stream(stream_link)

print(result_text)
It returns scenes where the spoken words match your query, along with the content of any slides visible in those scenes.


Here’s the result for this particular search query “hard and fast rule”:

The content written on the slide is:


"IT'S ALL IN THE DETAILS
Want to print your doc?
This is not the way.
Try clicking the ··· in the right corner or using a keyboard shortcut (
CtrlP
) instead.