Skip to main content
MP4 was designed in 1998. Its job is simple: pack frames and audio into a file that plays sequentially from start to finish. That’s perfect for Netflix. It’s terrible for AI.

What MP4 Gives You

An MP4 file is a container. Inside:
  • Compressed video frames (H.264, H.265, etc.)
  • Compressed audio tracks (AAC, MP3, etc.)
  • Timing information for synchronization
  • Metadata (duration, resolution, codec info)
To access any content, you:
  1. Decode the video stream
  2. Extract individual frames
  3. Process each frame through your model
  4. Repeat for every second of footage
This works for short clips. It falls apart at scale.

The Problem with Frames

Say you have a 1-hour video at 30fps. That’s 108,000 frames. To answer “what happened at 23:45?”, your options are:
  1. Decode and process all 108,000 frames (expensive, slow)
  2. Sample frames and hope you don’t miss anything (lossy, unreliable)
  3. Process in real-time as the video plays (1 hour to process 1 hour)
None of these let you instantly query the content. Compare this to a database:
SELECT * FROM meetings WHERE topic = 'pricing' AND timestamp > '23:40'
Instant. Indexed. Queryable. MP4 doesn’t give you this. It gives you a blob.

What AI Actually Needs

AI agents don’t watch videos. They query them. The questions agents ask:
  • “What was said about the budget?”
  • “Show me the moment the error appeared on screen”
  • “When did the person enter the frame?”
  • “What happened between 10:30 and 10:45?”
These are queries, not playback commands. They need:
CapabilityMP4What AI Needs
Random access by contentNoYes
Semantic searchNoYes
Timestamped resultsLimitedPrecise
Multi-index queriesNoYes
Instant answersNoYes

The Transcoding Trap

The common workaround: transcode everything.
  1. Extract all frames
  2. Run each through a vision model
  3. Store the descriptions in a vector database
  4. Query the database
This “works” but:
  • Cost: Processing every frame is expensive
  • Latency: Hours of processing before you can query
  • Storage: Frame embeddings multiply storage costs
  • Staleness: Live content can’t be pre-processed
  • Loss: Descriptions lose visual fidelity
You’re converting video into text, then querying text. The video itself becomes a liability - something you keep around for playback but can’t actually use.

Indexes as the Right Primitive

What if the primitive wasn’t a file, but an index? An index is:
  • Prompt-defined - You specify what to extract
  • Timestamped - Every result maps to exact moments
  • Searchable - Natural language queries, instant results
  • Composable - Multiple indexes on the same media
  • Playable - Results link back to verifiable video
# Create an index with a prompt
index = video.index_scenes(prompt="Identify product demonstrations")

# Query it with natural language
results = video.search("demo of the new feature")

# Get timestamped, playable results
for shot in results.shots:
    print(f"{shot.start}s: {shot.text}")
    shot.play()  # Verify by watching
The video file still exists. But you don’t interact with it directly. You interact with indexes - semantic layers that make the content queryable.

Multiple Perspectives

The power of indexes: you can create multiple on the same video.
# Same video, different questions
safety_index = video.index_scenes(prompt="Identify safety violations")
activity_index = video.index_scenes(prompt="Track person movements")
text_index = video.index_scenes(prompt="Extract on-screen text")
Each index is a different lens on the same content. Query them separately or together. Try doing that with an MP4.

Beyond Files

The same model works for live streams:
rtstream.index_visuals(prompt="Describe what user is doing")
rtstream.start_transcript()
No files, no pre-processing, no waiting. Indexes build in real-time as media flows.

The Shift

Old ModelNew Model
File is the primitiveIndex is the primitive
Process then queryQuery without processing
Static, batchDynamic, real-time
One representationMultiple perspectives
Playback-orientedQuery-oriented
MP4 isn’t going away. But for AI, it’s the wrong level of abstraction.

What’s Next