This playbook will guide you through the process of creating your own agents within the Director framework. You'll learn:
How to plan and structure an effective agent Best practices for development and integration Techniques for handling user communication and errors Ways to leverage Director's powerful video processing capabilities 🏛️ Understanding the Architecture
Before diving into agent creation, let's understand how Director works as a system. Director follows a modular architecture that enables seamless interaction between users and AI-powered video processing capabilities.
System Overview
Director's architecture is designed around three core principles:
Modularity: Each component has a specific responsibility and can be developed or modified independently Scalability: The system can handle multiple requests and complex video operations efficiently Extensibility: New agents and tools can be easily added to expand functionality Looking at the system architecture diagram below, you can see how these principles come together:
Understanding the Director Framework
Director consists of several key components working together:
Reasoning Engine: The brain of the system that: Interprets natural language commands Coordinates multiple agents Maintains conversation context Manages workflows and decision-making Specialized workers that handle specific tasks. For example: Finds specific content within videos Generates preview images Reusable functions that agents can leverage: integrations (OpenAI, Anthropic, etc.) External API connections (, , , etc.) Session Management: Handles state and context across interactions
The Reasoning Engine in Detail
The Reasoning Engine is the orchestrator of all agent activities. As shown in the architecture diagram, it:
Understands natural language requests Maintains conversation history and context Determines required actions and sequence Selects appropriate agents from the pool Coordinates multiple agents for complex tasks Manages dependencies between agent tasks Provides real-time progress updates Returns formatted responses to the user Tracks ongoing operations Ensures context persistence Bringing It All Together
The architecture enables powerful workflows like:
A user requests a video summary through the chat interface The Flask server processes this request and routes it to the Reasoning Engine The Reasoning Engine coordinates multiple agents to analyze and process the video Real-time updates flow back through WebSocket connections The final result is presented in the video player This architectural foundation is what makes Director so powerful. When you create a new agent, it becomes part of this ecosystem, leveraging all these capabilities to perform its tasks efficiently.
With the fundamentals covered, let’s start building! 🚀
✍️ Planning Phase
The success of an agent heavily depends on thorough planning and requirements gathering. Before writing any code:
Compile a comprehensive list of questions across all aspects Include edge cases and potential future requirements Consider integration points with other agents/systems MUST-HAVE (v1)
- Core functionality requirements
- Essential error handling
- Basic user feedback
SHOULD-HAVE (v2)
- Enhanced features
- Performance optimizations
- Additional provider support
NICE-TO-HAVE (v3)
- Advanced customization
- Extra integration points
- Optional enhancements
TECHNICAL_LIMITS = {
"max_input_length": "Clear limits",
"rate_limits": "API constraints",
"storage_requirements": "Resource needs",
"performance_expectations": "Response times"
}
Investing time in this planning phase:
Prevents scope creep during development Ensures clear alignment with team expectations Makes code structure more maintainable Helps predict potential issues before they arise Creates clear testing boundaries Provides systematic upgrade paths for future versions
Remember: It's easier to adjust plans than refactor code. Take time to ask questions and challenge assumptions before implementation begins.
☑️ Pre-Development Checklist
1. Purpose
Define a clear, single-responsibility purpose. Some examples from the codebase: : Finds and retrieves specific content within videos : Generates preview images from video frames : Handles media uploads with format validation and processing 2. Background Check
Review for similar functionality Consider extending existing agents by: Optimizing core functionality or improving performance Enhancing integration points (adding support for better tools, platforms or models) 3. Agent architecture
The I/O Contract defines how your agent interacts with the system, including the expected inputs, outputs, and how they are structured. It ensures consistency and clarity in communication between the agent, the system, and the end user. This contract is critical for integrating your agent with the infrastructure and enabling seamless interaction with other components.
Input Contract
The input contract specifies the parameters your agent expects to receive. These parameters can be simple (e.g., a string or number) or complex (e.g., a nested JSON object). The input contract is defined in two parts:
Function Signature: The run method of your agent defines the expected parameters. JSON Schema: For complex inputs, a JSON schema is used to describe the structure and constraints of the input data. Simple Input Example
For agents with straightforward inputs, you can define the parameters directly in the run method. Here’s an example for a Slack agent:
Complex Input Example
For agents requiring structured inputs (e.g., video generation), you define a JSON schema. This schema is used to validate the input and provide clear documentation for API consumers. Here’s an example for a video generation agent:
In this example:
The prompt field is a required string. The config field is an optional object with nested properties (duration and style). The required keyword ensures mandatory fields are validated. Output Contract
The output contract defines how your agent communicates results, errors, and progress updates. This includes:
AgentResponse: A standardized response format for success or failure. Progress Updates: Real-time updates using the output_message object. Frontend Content Handling: Structured content for rendering in the frontend (e.g., text, video, images). AgentResponse
The AgentResponse object is used to return the result of the agent's execution. It includes:
status: Indicates success (AgentStatus.SUCCESS) or failure (AgentStatus.ERROR). message: A human-readable message describing the result. data: Additional data returned by the agent (e.g., generated content). Example:
Progress Updates
Use the output_message object to provide real-time updates during the agent's execution. For example:
Frontend Content Handling
Agents can return different types of content (e.g., text, video, images) using the output_message.content list. Each content type is represented by a specific model (e.g., TextContent, VideoContent). Example:
Plan and fix the steps to go from input → output. Keep the following factors in mind:
Input validation & preprocessing Result formatting & cleanup Look for composition opportunities with existing agents or tools
Example 1: ComparisonAgent leveraging VideoGenerationAgent Example 2: AudioGenerationAgent using ElevenLabs tool 4. Session Management
The session parameter is crucial for maintaining context across multiple interactions, preventing the agent from handling requests without retaining state. Assigning a unique name and description aids in debugging and log analysis, making it easier to track agent behavior Flexible parameter handling: Your agent can either get its settings on-the-fly from user input (dynamic parameters), or use pre-defined parameters that you specify upfront. Simple agents often work best with dynamic parameters, while complex agents usually need pre-defined configurations. class NewAgent(BaseAgent):
def __init__(self, session: Session, **kwargs):
self.agent_name = "unique_name"
self.description = "Clear, specific description"
self.parameters = self.get_parameters()
super().__init__(session=session, **kwargs)
Note: Parameter handling self.parameters changes based on the complexity of the parameters required for the agent:
#in case of simple docstring
self.parameters = self.get_parameters()
#in case of dictionary of complex parameters:
self.parameters = AGENT_PARAMETERS
Key Points of User Communication:
Director's Log (Showing Steps) Think of this like a progress bar or status updates you see when installing software. It tells users what's happening behind the scenes.
Progress Updates (Showing Content Responses) This handles the actual content (videos, images, text) and its current state. Think of this like when you upload a file to Google Drive - you see both the file and its upload status.
This shows up in the chat as:
A message saying "Your video is being generated..." Final Cut (Returning Results) When the agent finishes, it returns three things:
status: Did it work? (success/error) message: What happened? (user-friendly explanation) data: The actual results (video URLs, text, etc.) 5. Error Handling
Clear error responses help the reasoning engine interpret failures, guide users effectively, and determine the next appropriate actions in the workflow.
try:
# Main logic
self.output_message.actions.append("Current action...")
self.output_message.push_update()
except Exception as e:
logger.exception(f"Error in {self.agent_name}")
content.status = MsgStatus.error
content.status_message = "User-friendly error message"
self.output_message.publish()
return AgentResponse(status=AgentStatus.ERROR, message=str(e))