Skip to main content
LLMs gave us reasoning. Vector databases gave us retrieval. Tool calling gave us action. But when you look at the modern agent stack, there’s a glaring gap: perception.

The Current Agent Stack

Here’s what a typical agent architecture looks like:
User Input (text)

LLM (reasoning)

Tools (retrieval, actions)

Output (text)
Every layer is text-centric. Even “multimodal” models that accept images treat them as one-shot inputs - a single frame, processed once, discarded. There’s no:
  • Continuous media processing
  • Real-time event detection
  • Temporal understanding
  • Persistent perceptual memory

What Perception Actually Means

Perception isn’t just “can process an image.” Perception is:
  1. Continuous - Always on, not one-shot
  2. Temporal - Understands time, sequences, causality
  3. Multi-source - Video, audio, screen, mic, sensors
  4. Searchable - Can be queried after the fact
  5. Actionable - Triggers responses in real-time
When you perceive a meeting, you’re not taking a screenshot. You’re maintaining awareness of a time-evolving stream of visual and audio information, extracting meaning, and building memory.

The Perception Stack

Here’s what a perception-enabled agent stack looks like:
Continuous Media (screen, mic, camera, RTSP, files)

Perception Layer (VideoDB)

    ├── Indexes (searchable understanding)
    ├── Events (real-time triggers)
    └── Memory (episodic recall)

Agent (reasoning + action)

Output (grounded in observable evidence)
The perception layer sits between raw media and agent logic. It converts streams into structured context.

Three Input Modes

Perception works across different input types:
ModeSourceExample
FilesUploaded recordingsMeeting archives, training videos
Live StreamsRTSP, RTMP, camerasSecurity feeds, drones, IoT
Desktop CaptureScreen, mic, cameraUser sessions, support calls
Same architecture, same APIs, same mental model. Your agent can perceive a recorded file or a live stream the same way.

From Batch to Real-time

Traditional video AI is batch-oriented:
  1. Upload file
  2. Wait for processing
  3. Get results
Perception is real-time:
  1. Stream continuously
  2. Receive structured events as they happen
  3. Act immediately
# Events arrive in real-time
{"channel": "transcript", "text": "Let's talk about the budget..."}
{"channel": "scene_index", "text": "User opened the pricing spreadsheet"}
{"channel": "alert", "label": "budget_mention", "confidence": 0.95}
Your agent receives context as the world unfolds - not after processing completes.

Searchable Memory

Perception includes memory. Not just current awareness, but the ability to recall.
# What happened in this meeting about pricing?
results = video.search("pricing discussion")

for shot in results.shots:
    print(f"{shot.start}s - {shot.end}s: {shot.text}")
    shot.play()  # Play the exact moment
Every search result links to playable evidence. Your agent doesn’t just claim something happened - it can show you.

Why This Matters Now

Three trends are converging:
  1. Agents are going mainstream - Not research demos, but production systems
  2. Edge devices have cameras - Every laptop, phone, robot, and IoT device
  3. Users expect awareness - “Why doesn’t my AI know what I’m looking at?”
The agents that win will be the ones that can perceive. Text-only agents will feel blind in comparison.

The Promise

When perception becomes a first-class layer:
  • Desktop agents understand what you’re doing, not just what you type
  • Support agents see the user’s screen, not just their description
  • Monitoring agents react to events as they happen, not hours later
  • Meeting agents know what was said AND shown, with timestamps
The future of AI agents is perception-first.

What’s Next