Your agents can read text and static images. But the real world is live, continuous, and always changing. To operate with real context, your agent needs real-time access to video calls, camera feeds, screen recordings, and live internet streams. VideoDB is the perception layer that lets agents see, hear, remember, and act on continuous media. Most AI development focuses on text and static images, but video remains a significant hurdle because of its density and lack of structure. VideoDB turns raw pixel data into structured context that agents can query, reason about, and act upon in real time. For agents to move beyond text boxes and interact with the physical or digital world via screens and cameras, they need a way to parse continuous visual and auditory data. VideoDB provides this through a specialized database that indexes video at the scene level - making it possible for an agent to “recall” specific events or “see” real-time occurrences without excessive compute costs.Documentation Index
Fetch the complete documentation index at: https://docs.videodb.io/llms.txt
Use this file to discover all available pages before exploring further.
Quickstart
Give your agent perception in 5 minutes
Core Concepts
Understand the platform architecture
How It Works
The platform operates through three stages: See, Understand, and Act.| Stage | What Happens |
|---|---|
| See | Capture SDK or live stream integration takes in media from files, desktops, or cameras |
| Understand | Build specialized indexes for transcripts, visual scenes, or custom prompts |
| Act | Query, search, edit, and export - agents can generate summaries or clips based on findings |
Skills: Native Agent Experiences
Since VideoDB handles server-side video processing, indexing, and retrieval, developers can use skills to create agent workflows that feel native to their environment. Skills give agents like Claude Code and Codex structured perception primitives - capture, search, edit, stream - without writing infrastructure code.What You Can Build
Desktop Agents
Stream screen, mic, and camera. Get real-time context about what the user is doing and saying.Call.md →
Video Search
Search across hours of meetings, lectures, or archives. Get timestamped moments with playable evidence.Multimodal Search →
Real-time Monitoring
Connect RTSP cameras and drones. Detect events as they happen. Trigger alerts and automations.Intrusion Detection →
Media Automation
Compose videos with code. Generate voice, music, and images. Export to any format.Faceless Video Creator →
Agent Skills
Add real-time perception to coding assistants and autonomous agents. Screen capture, audio indexing, and searchable context.Agent Skills →
Browse All Examples
Explore examples across AI Copilots, Video Search, Live Intelligence, Content Factory, and more
Example: Real-time Alerting
Install the SDK
Python SDK
GitHub, PyPI, and setup guide
Node.js SDK
npm, TypeScript, and setup guide
Philosophy
Why perception is the next frontier for AI agents.Why AI Agents Are Blind Today
The gap between human perception and agent perception
Perception Is the Missing Layer
The stack that gives agents eyes and ears
MP4 Is the Wrong Primitive
Why video files don’t work for AI
What Episodic Memory Means for Agents
Remember experiences, not just facts