Quick Example
from videodb import SceneExtractionType
# Index spoken content
video.index_spoken_words()
# Index visual content with extraction strategy
video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 5, "frame_count": 3},
prompt="Describe the scene, people, and any visible text"
)
Split video into fixed intervals. Simple and predictable.
from videodb import SceneExtractionType
video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={
"time": 10, # Scene length in seconds
"frame_count": 2 # Frames to analyze per scene
},
prompt="Describe what's happening"
)
| Parameter | Type | Default | Description |
|---|
time | int | 10 | Interval in seconds |
frame_count | int | 1 | Frames per scene |
select_frames | list | ["first"] | Which frames: "first", "middle", "last" |
Use either frame_count or select_frames, not both.
Best for:
- Surveillance and monitoring
- Live streams
- Content with no clear scene boundaries
- Consistent sampling across long videos
Detect visual transitions (cuts, fades) to identify natural scene boundaries.
from videodb import SceneExtractionType
video.index_scenes(
extraction_type=SceneExtractionType.shot_based,
extraction_config={
"threshold": 20, # Sensitivity (lower = more sensitive)
"frame_count": 1 # Frames per detected shot
},
prompt="Describe the scene"
)
| Parameter | Type | Default | Description |
|---|
threshold | int | 20 | Detection sensitivity |
frame_count | int | 1 | Frames per shot |
Best for:
- Movies and TV shows
- Edited content with clear cuts
- Music videos
- Commercials
Prompt Engineering
The prompt shapes what gets extracted. Think of it as telling the vision model what to look for.
Basic Prompts
# General description
prompt = "Describe what's happening in this scene"
# Object-focused
prompt = "Identify all objects and people visible"
# Action-focused
prompt = "Describe the activities and movements"
Domain-Specific Prompts
# Retail / E-commerce
video.index_scenes(
prompt="Identify products, brands, and pricing visible on screen"
)
# Sports
video.index_scenes(
prompt="Describe the play, players involved, and outcome"
)
# Security
video.index_scenes(
prompt="Identify people, vehicles, and any unusual activity"
)
# Education
video.index_scenes(
prompt="Describe the topic being taught and any diagrams or text shown"
)
Structured Output Prompts
Guide the model to produce consistent, parseable output:
prompt = """
Describe this scene with the following structure:
- Setting: Where is this taking place?
- People: Who is present and what are they doing?
- Objects: What notable items are visible?
- Action: What is happening?
"""
Frame Selection Strategy
More frames = more detail but higher cost. Choose based on your content.
Static Content (1 frame)
For content where a single frame captures the scene:
# One frame is enough for static shots
video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 10, "frame_count": 1},
prompt="Describe the scene"
)
Motion and Activity (3-5 frames)
For understanding movement and temporal changes:
# Multiple frames to capture motion
video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 5, "frame_count": 5},
prompt="Describe the activity and how it progresses"
)
Key Moment Selection
Select specific frames within each scene:
# First and last frames only
video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 10, "select_frames": ["first", "last"]},
prompt="Describe how the scene changes from start to end"
)
Combining Modalities
Index both spoken and visual content, then search across both:
from videodb import IndexType, SearchType
# Index both modalities
video.index_spoken_words()
video.index_scenes(prompt="Describe the visual content")
# Search spoken content
spoken_results = video.search(
query="discusses climate change",
index_type=IndexType.spoken_word
)
# Search visual content
visual_results = video.search(
query="shows melting glaciers",
index_type=IndexType.scene
)
Traffic Monitoring
# Detect vehicle colors (single frame sufficient)
video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 1, "frame_count": 1},
prompt="Identify the color and type of each vehicle"
)
# Detect stopped vehicles (need multiple frames)
video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 4, "frame_count": 5},
prompt="Identify if any vehicle has stopped or is moving slowly"
)
Educational Content
# Combine visual and spoken indexing
video.index_spoken_words()
video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 30, "select_frames": ["first", "middle", "last"]},
prompt="Describe diagrams, equations, or visual aids shown"
)
Next Steps