1. Backend Setup
Install
Create a Capture Session
Your backend creates a session and generates a short-lived token for the desktop client:
import videodb
conn = videodb.connect()
# Create session for a user
cap = conn.create_capture_session(
end_user_id = "user_abc" ,
callback_url = "https://your-backend.com/webhooks/videodb" ,
metadata = { "app" : "my-ai-copilot" }
)
# Generate token for desktop client (never share API key)
token = conn.generate_client_token( expires_in = 600 )
# Send session ID and token to desktop client
print ( f "Session: { cap.id } , Token: { token } " )
2. Client Setup
Install
pip install "videodb[capture]"
Start Capture
The desktop client uses the token to stream screen and audio:
import asyncio
from videodb.capture import CaptureClient
async def capture ( capture_session_id : str , client_token : str ):
client = CaptureClient( client_token = client_token)
# Request permissions
await client.request_permission( "microphone" )
await client.request_permission( "screen_capture" )
# Discover available sources
channels = await client.list_channels()
mic = channels.mics.default
display = channels.displays.primary or channels.displays[ 1 ]
system_audio = channels.system_audio.default
selected = [c for c in [mic, display, system_audio] if c]
# Start capture
await client.start_session(
capture_session_id = capture_session_id,
channels = selected,
primary_video_channel_id = display.name if display else None
)
# Listen for events
async for ev in client.events():
print ( f " { ev.event } : { ev.payload } " )
if ev.event in ( "recording-complete" , "error" ):
break
await client.stop_session()
await client.shutdown()
# Run the capture
if __name__ == "__main__" :
asyncio.run(capture(
capture_session_id = "cap-xxx" , # From backend
client_token = "token-xxx" # From backend
))
3. Backend Starts AI
When capture begins, your backend receives a webhook and starts AI processing:
def on_webhook ( payload : dict ):
if payload[ "event" ] == "capture_session.active" :
cap_id = payload[ "capture_session_id" ]
cap = conn.get_capture_session(cap_id)
# Get RTStreams (one per channel)
mics = cap.get_rtstream( "mic" )
displays = cap.get_rtstream( "display" )
# Start real-time AI processing
if mics:
mic = mics[ 0 ]
mic.start_transcript()
mic.index_audio( prompt = "Extract key decisions and action items" )
if displays:
display = displays[ 0 ]
display.index_visuals( prompt = "Describe what the user is doing" )
4. What You Get
Your backend receives AI-ready events in real-time:
{ "type" : "transcript" , "text" : "Let's schedule the meeting for Thursday" , "is_final" : true }
{ "type" : "index" , "index_type" : "visual" , "text" : "User is viewing a Slack conversation with 3 unread messages" }
{ "type" : "index" , "index_type" : "audio" , "text" : "Discussion about scheduling a team meeting" }
{ "type" : "alert" , "label" : "sensitive_content" , "triggered" : true , "confidence" : 0.92 }
Build with these:
Screen-aware AI agents
Live meeting copilots
In-call assistance
Semantic search and replay
Architecture
Backend creates a CaptureSession and mints a short-lived token
Desktop client uses the token to stream screen + audio (never sees API key)
VideoDB creates RTStreams (one per channel) when capture starts
Backend receives webhook, starts transcript and indexing on RTStreams
AI events flow back via WebSocket (real-time) or can be polled
Two Runtimes
Backend Desktop Client Holds API key Receives session token Creates sessions Captures media Runs AI pipelines Streams to VideoDB Receives events Emits local UX events
Rule of thumb: Webhooks for correctness (durable, at-least-once). WebSocket for live UI (best-effort).
5. Example Applications
6. Core Concepts
CaptureSession (cap-xxx)
The lifecycle container for one capture run. Created by backend, activated by desktop client.
States: created → starting → active → stopping → stopped → exported
RTStream (rts-xxx)
A real-time media stream, one per captured channel. This is where you run AI:
rtstream.start_transcript()
rtstream.index_audio( prompt = "Extract key decisions" )
rtstream.index_visuals( prompt = "Describe what user is doing" )
rtstream.search( "budget discussion" )
Channel
A recordable source on the desktop:
Channel Description mic:defaultDefault microphone system_audio:defaultSystem audio output display:1, display:2Connected displays
Explore More
View All Examples on GitHub Complete source code with quickstart guides, example apps, and implementation patterns