What MP4 Gives You
An MP4 file is a container. Inside:- Compressed video frames (H.264, H.265, etc.)
- Compressed audio tracks (AAC, MP3, etc.)
- Timing information for synchronization
- Metadata (duration, resolution, codec info)
- Decode the video stream
- Extract individual frames
- Process each frame through your model
- Repeat for every second of footage
The Problem with Frames
Say you have a 1-hour video at 30fps. That’s 108,000 frames. To answer “what happened at 23:45?”, your options are:- Decode and process all 108,000 frames (expensive, slow)
- Sample frames and hope you don’t miss anything (lossy, unreliable)
- Process in real-time as the video plays (1 hour to process 1 hour)
What AI Actually Needs
AI agents don’t watch videos. They query them. The questions agents ask:- “What was said about the budget?”
- “Show me the moment the error appeared on screen”
- “When did the person enter the frame?”
- “What happened between 10:30 and 10:45?”
| Capability | MP4 | What AI Needs |
|---|---|---|
| Random access by content | No | Yes |
| Semantic search | No | Yes |
| Timestamped results | Limited | Precise |
| Multi-index queries | No | Yes |
| Instant answers | No | Yes |
The Transcoding Trap
The common workaround: transcode everything.- Extract all frames
- Run each through a vision model
- Store the descriptions in a vector database
- Query the database
- Cost: Processing every frame is expensive
- Latency: Hours of processing before you can query
- Storage: Frame embeddings multiply storage costs
- Staleness: Live content can’t be pre-processed
- Loss: Descriptions lose visual fidelity
Indexes as the Right Primitive
What if the primitive wasn’t a file, but an index? An index is:- Prompt-defined - You specify what to extract
- Timestamped - Every result maps to exact moments
- Searchable - Natural language queries, instant results
- Composable - Multiple indexes on the same media
- Playable - Results link back to verifiable video
Multiple Perspectives
The power of indexes: you can create multiple on the same video.Beyond Files
The same model works for live streams:The Shift
| Old Model | New Model |
|---|---|
| File is the primitive | Index is the primitive |
| Process then query | Query without processing |
| Static, batch | Dynamic, real-time |
| One representation | Multiple perspectives |
| Playback-oriented | Query-oriented |