Skip to main content
Indexes turn raw video into structured, searchable data. Create a spoken word index for dialogue and narration, or a scene index for visual content.

Quick Example

import videodb

conn = videodb.connect()
coll = conn.get_collection()
video = coll.get_video("m-xxx")

# Index spoken content (dialogue, narration)
video.index_spoken_words()

# Index visual content (scenes, objects, actions)
scene_index_id = video.index_scenes(
    prompt="Describe what's happening in the scene"
)

# Search both
results = video.search("car chase through the city")
results.play()

Spoken Word Index

Transcribes audio into timestamped text using automatic speech recognition (ASR).
video.index_spoken_words()
What it captures:
  • Dialogue and conversations
  • Narration and voiceovers
  • Lectures and presentations
  • Interviews and podcasts

Language Support

Major languages are auto-detected. For others, pass the language code:
# Auto-detect (English, Spanish, French, German, Italian, Portuguese, Dutch)
video.index_spoken_words()

# Explicit language code
video.index_spoken_words(language_code="hi")  # Hindi
video.index_spoken_words(language_code="ja")  # Japanese
video.index_spoken_words(language_code="zh")  # Chinese
LanguageCode
English (Global)en
English (US/UK/AU)en_us, en_uk, en_au
Spanishes
Frenchfr
Germande
Hindihi
Japaneseja
Chinesezh
Koreanko
Russianru

Scene Index

Analyzes video frames using vision models to describe visual content.
scene_index_id = video.index_scenes(
    prompt="Describe the scene in detail"
)
What it captures:
  • Objects and people
  • Actions and activities
  • Environments and settings
  • Visual transitions

Prompt Shapes the Index

The prompt you provide determines what gets indexed:
# Focus on people
video.index_scenes(prompt="Describe the people and their actions")

# Focus on environment
video.index_scenes(prompt="Describe the location and setting")

# Focus on specific objects
video.index_scenes(prompt="Identify all vehicles and their colors")

Extraction Configuration

Control how frames are sampled - choose between frame segmentation (regular intervals) and scene segmentation (automatic transitions): Comparison of frame segmentation and scene segmentation extraction types
from videodb import SceneExtractionType

# Time-based: every N seconds
video.index_scenes(
    extraction_type=SceneExtractionType.time_based,
    extraction_config={"time": 10, "frame_count": 2},
    prompt="Describe the scene"
)

<img
  src="/assets/indexing/time-based-extraction.avif"
  style={{width: "auto", height: "auto"}}
  alt="Time-based extraction example showing consistent frame sampling at regular intervals"
/>

# Shot-based: detect visual transitions
video.index_scenes(
    extraction_type=SceneExtractionType.shot_based,
    extraction_config={"threshold": 20, "frame_count": 1},
    prompt="Describe the scene"
)
MethodBest For
Time-basedConsistent sampling, dynamic content
Shot-basedEdited videos with clear scene changes

Managing Indexes

List All Scene Indexes

indexes = video.list_scene_index()
for idx in indexes:
    print(f"{idx.id}: {idx.name} - {idx.status}")
List of scene indexes showing id, name, and status

Get Index Details

scene_index = video.get_scene_index(scene_index_id)
for scene in scene_index:
    print(f"{scene.start}-{scene.end}: {scene.description}")

Delete an Index

video.delete_scene_index(scene_index_id)

Async Processing with Callbacks

For long videos, use callbacks to get notified when indexing completes:
scene_index_id = video.index_scenes(
    prompt="Describe the scene",
    callback_url="https://your-backend.com/webhooks/index-complete"
)

What You Can Build


Next Steps