Skip to main content
When you remember a meeting, you don’t recall a JSON object. You remember the moment - the room, the voice, the pause before someone made a key point. Humans have episodic memory. AI agents don’t. That’s about to change.

Two Kinds of Memory

Cognitive science distinguishes between: Semantic Memory - Facts and concepts
  • “The capital of France is Paris”
  • “Water boils at 100°C”
  • Timeless, context-free, declarative
Episodic Memory - Experienced events
  • “I remember the meeting where we discussed the budget”
  • “That call where the client mentioned timeline concerns”
  • Time-stamped, contextual, experiential
Most AI memory systems are semantic. Vector databases store embeddings of facts. RAG retrieves documents. But agents that perceive need episodic memory. They need to remember what they saw and heard, when it happened, and what the context was.

Why Episodic Matters

Consider these queries:
QueryMemory TypeWhat’s Needed
”What is our pricing model?”SemanticRetrieved from docs
”What did the client say about pricing last Tuesday?”EpisodicRetrieved from recordings
”How many people attended the meeting?”EpisodicVisual memory of the event
”What was on screen when they mentioned the deadline?”EpisodicMultimodal temporal context
Semantic memory can’t answer episodic questions. You need memory of experiences - not just facts.

Video as Natural Episodic Memory

Video is inherently episodic:
  • Time-indexed - Every frame has a timestamp
  • Multi-sensory - Visual + audio together
  • Contextual - Shows the environment, not just content
  • Continuous - Captures the flow of events
When you record a meeting, you’re creating episodic memory. The challenge is making it retrievable.

The Memory Problem

Raw recordings aren’t queryable. You can’t ask an MP4 file “what happened?” Traditional approaches:
  1. Full transcription - Converts audio to text, loses visual context
  2. Frame extraction - Expensive, loses temporal flow
  3. Manual notes - Doesn’t scale, subjective
  4. Just store it - Recording exists but no one can find anything
None of these create true episodic memory. They create archives.

Indexed Episodic Memory

The solution: indexes that understand what happened and when.
# Create episodic memory from a video
video.index_spoken_words()  # What was said
video.index_scenes(prompt="Describe activities and events")  # What happened

# Query episodic memory
results = video.search("budget discussion")

for shot in results.shots:
    print(f"At {shot.start}s: {shot.text}")
    shot.play()  # Relive the moment
The index is the memory. It captures:
  • What happened (semantic content)
  • When it happened (timestamps)
  • Evidence (playable links)

Ephemeral vs Persistent

Not all perception needs permanent memory. Ephemeral - Process but don’t store
  • Real-time event detection
  • Privacy-sensitive contexts
  • Temporary sessions
rtstream.index_visuals(
    prompt="Detect safety issues",
    ephemeral=True  # Don't persist
)
Persistent - Store for later recall
  • Meeting recordings
  • Training content
  • Compliance archives
video.index_spoken_words()  # Stored by default
You control what your agent remembers.

Desktop as Continuous Input

Desktop capture creates continuous episodic input:
cap = conn.create_capture_session(end_user_id="user_123")

# What the agent "experiences":
# - Screen content (visual)
# - Microphone (spoken)
# - System audio (ambient)
The agent perceives the user’s experience in real-time. With indexing, it builds memory. Later:
# Agent recall
"Remember when I was debugging that error? What file was I looking at?"

results = cap.search("debugging error")
shot.play()  # Show the moment

Multi-Session Memory

Episodic memory spans sessions:
# Search across all recordings
results = coll.search("product roadmap discussions")

# Results from any video in the collection
for shot in results.shots:
    print(f"Video: {shot.video_id}, Time: {shot.start}s")
    print(f"Content: {shot.text}")
The agent doesn’t just remember one meeting. It remembers all meetings.

Grounded Answers

Episodic memory enables grounded responses: Without episodic memory:
“I believe the pricing discussion happened last week…”
With episodic memory:
“At 14:32 in yesterday’s meeting, Sarah said ‘We need to revisit the enterprise tier pricing.’ Here’s the clip: [play]”
The difference is trust. Episodic memory provides verifiable evidence.

The Future

The agents we’re building will:
  • Perceive continuously (screens, mics, cameras)
  • Index what they perceive (spoken, visual, events)
  • Remember across sessions (episodic recall)
  • Answer with evidence (playable proof)
This isn’t science fiction. The architecture exists today.

What’s Next