Skip to main content

Quick Example

from videodb import SceneExtractionType

# Index spoken content
video.index_spoken_words()

# Index visual content with extraction strategy
video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 5, "frame_count": 3},
    prompt="Describe the scene, people, and any visible text"
)

Extraction Strategies

Time-Based Extraction

Split video into fixed intervals. Simple and predictable. Time-based extraction example showing consistent frame sampling at regular intervals
from videodb import SceneExtractionType

video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={
        "time": 10,           # Scene length in seconds
        "frame_count": 2      # Frames to analyze per scene
    },
    prompt="Describe what's happening"
)
ParameterTypeDefaultDescription
timeint10Interval in seconds
frame_countint1Frames per scene
select_frameslist["first"]Which frames: "first", "middle", "last"
Use either frame_count or select_frames, not both.
Best for:
  • Surveillance and monitoring
  • Live streams
  • Content with no clear scene boundaries
  • Consistent sampling across long videos

Shot-Based Extraction

Detect visual transitions (cuts, fades) to identify natural scene boundaries. Shot-based extraction example showing automatic detection of visual transitions and scene boundaries
from videodb import SceneExtractionType

video.index_scenes(
    extraction_type=SceneExtractionType.shot_based,
    extraction_config={
        "threshold": 20,      # Sensitivity (lower = more sensitive)
        "frame_count": 1      # Frames per detected shot
    },
    prompt="Describe the scene"
)
ParameterTypeDefaultDescription
thresholdint20Detection sensitivity
frame_countint1Frames per shot
Best for:
  • Movies and TV shows
  • Edited content with clear cuts
  • Music videos
  • Commercials

Prompt Engineering

The prompt shapes what gets extracted. Think of it as telling the vision model what to look for.

Basic Prompts

# General description
prompt = "Describe what's happening in this scene"

# Object-focused
prompt = "Identify all objects and people visible"

# Action-focused
prompt = "Describe the activities and movements"

Domain-Specific Prompts

# Retail / E-commerce
video.index_scenes(
    prompt="Identify products, brands, and pricing visible on screen"
)

# Sports
video.index_scenes(
    prompt="Describe the play, players involved, and outcome"
)

# Security
video.index_scenes(
    prompt="Identify people, vehicles, and any unusual activity"
)

# Education
video.index_scenes(
    prompt="Describe the topic being taught and any diagrams or text shown"
)

Structured Output Prompts

Guide the model to produce consistent, parseable output:
prompt = """
Describe this scene with the following structure:
- Setting: Where is this taking place?
- People: Who is present and what are they doing?
- Objects: What notable items are visible?
- Action: What is happening?
"""

Frame Selection Strategy

More frames = more detail but higher cost. Choose based on your content.

Static Content (1 frame)

For content where a single frame captures the scene:
# One frame is enough for static shots
video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 10, "frame_count": 1},
    prompt="Describe the scene"
)

Motion and Activity (3-5 frames)

For understanding movement and temporal changes:
# Multiple frames to capture motion
video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 5, "frame_count": 5},
    prompt="Describe the activity and how it progresses"
)

Key Moment Selection

Select specific frames within each scene:
# First and last frames only
video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 10, "select_frames": ["first", "last"]},
    prompt="Describe how the scene changes from start to end"
)

Combining Modalities

Index both spoken and visual content, then search across both:
from videodb import IndexType, SearchType

# Index both modalities
video.index_spoken_words()
video.index_scenes(prompt="Describe the visual content")

# Search spoken content
spoken_results = video.search(
    query="discusses climate change",
    index_type=IndexType.spoken_word
)

# Search visual content
visual_results = video.search(
    query="shows melting glaciers",
    index_type=IndexType.scene
)

Extraction Examples

Traffic Monitoring

# Detect vehicle colors (single frame sufficient)
video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 1, "frame_count": 1},
    prompt="Identify the color and type of each vehicle"
)

# Detect stopped vehicles (need multiple frames)
video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 4, "frame_count": 5},
    prompt="Identify if any vehicle has stopped or is moving slowly"
)

Educational Content

# Combine visual and spoken indexing
video.index_spoken_words()

video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 30, "select_frames": ["first", "middle", "last"]},
    prompt="Describe diagrams, equations, or visual aids shown"
)

Next Steps