Skip to main content

1. Backend Setup

Install

pip install videodb

Create a Capture Session

Your backend creates a session and generates a short-lived token for the desktop client:
import videodb

conn = videodb.connect()

# Create session for a user
cap = conn.create_capture_session(
    end_user_id="user_abc",
    callback_url="https://your-backend.com/webhooks/videodb",
    metadata={"app": "my-ai-copilot"}
)

# Generate token for desktop client (never share API key)
token = conn.generate_client_token(expires_in=600)

# Send session ID and token to desktop client
print(f"Session: {cap.id}, Token: {token}")

2. Client Setup

Install

pip install "videodb[capture]"

Start Capture

The desktop client uses the token to stream screen and audio:
import asyncio
from videodb.capture import CaptureClient

async def capture(capture_session_id: str, client_token: str):
    client = CaptureClient(client_token=client_token)

    # Request permissions
    await client.request_permission("microphone")
    await client.request_permission("screen_capture")

    # Discover available sources
    channels = await client.list_channels()
    mic = channels.mics.default
    display = channels.displays.primary or channels.displays[1]
    system_audio = channels.system_audio.default
    selected = [c for c in [mic, display, system_audio] if c]

    # Start capture
    await client.start_session(
        capture_session_id=capture_session_id,
        channels=selected,
        primary_video_channel_id=display.name if display else None
    )

    # Listen for events
    async for ev in client.events():
        print(f"{ev.event}: {ev.payload}")
        if ev.event in ("recording-complete", "error"):
            break

    await client.stop_session()
    await client.shutdown()

# Run the capture
if __name__ == "__main__":
    asyncio.run(capture(
        capture_session_id="cap-xxx",  # From backend
        client_token="token-xxx"        # From backend
    ))

3. Backend Starts AI

When capture begins, your backend receives a webhook and starts AI processing:
def on_webhook(payload: dict):
    if payload["event"] == "capture_session.active":
        cap_id = payload["capture_session_id"]
        cap = conn.get_capture_session(cap_id)

        # Get RTStreams (one per channel)
        mics = cap.get_rtstream("mic")
        displays = cap.get_rtstream("display")

        # Start real-time AI processing
        if mics:
            mic = mics[0]
            mic.start_transcript()
            mic.index_audio(prompt="Extract key decisions and action items")

        if displays:
            display = displays[0]
            display.index_visuals(prompt="Describe what the user is doing")

4. What You Get

Your backend receives AI-ready events in real-time:
{"type": "transcript", "text": "Let's schedule the meeting for Thursday", "is_final": true}
{"type": "index", "index_type": "visual", "text": "User is viewing a Slack conversation with 3 unread messages"}
{"type": "index", "index_type": "audio", "text": "Discussion about scheduling a team meeting"}
{"type": "alert", "label": "sensitive_content", "triggered": true, "confidence": 0.92}
Build with these:
  • Screen-aware AI agents
  • Live meeting copilots
  • In-call assistance
  • Semantic search and replay

Architecture

Diagram showing the architecture of the system
  1. Backend creates a CaptureSession and mints a short-lived token
  2. Desktop client uses the token to stream screen + audio (never sees API key)
  3. VideoDB creates RTStreams (one per channel) when capture starts
  4. Backend receives webhook, starts transcript and indexing on RTStreams
  5. AI events flow back via WebSocket (real-time) or can be polled

Two Runtimes

BackendDesktop Client
Holds API keyReceives session token
Creates sessionsCaptures media
Runs AI pipelinesStreams to VideoDB
Receives eventsEmits local UX events
Rule of thumb: Webhooks for correctness (durable, at-least-once). WebSocket for live UI (best-effort).

5. Example Applications


6. Core Concepts

CaptureSession (cap-xxx)

The lifecycle container for one capture run. Created by backend, activated by desktop client. States: created → starting → active → stopping → stopped → exported

RTStream (rts-xxx)

A real-time media stream, one per captured channel. This is where you run AI:
rtstream.start_transcript()
rtstream.index_audio(prompt="Extract key decisions")
rtstream.index_visuals(prompt="Describe what user is doing")
rtstream.search("budget discussion")

Channel

A recordable source on the desktop:
ChannelDescription
mic:defaultDefault microphone
system_audio:defaultSystem audio output
display:1, display:2Connected displays

Explore More

View All Examples on GitHub

Complete source code with quickstart guides, example apps, and implementation patterns