VideoDB Documentation

Pages

Director - Video Agent Framework

Make Your Video Sound Studio Quality with Voice Cloning

Imagine watching or sharing your videos, but instead of the original low-quality audio, you hear your voice rendered in crystal-clear, studio-quality sound.

A cloned voice can breathe new life into your videos and in this blog, I’ll show you how to build a Voice Replacement Agent in under 1 hour using Director.

🎯 Game Plan for Voice Replacement Agent

The steps are fairly simple:

User uploads voice samples (to clone) and a video (whose voice needs to be replaced).

Clone the voice from the provided samples.

Extract the transcript from the video.

Generate an audio clip in the cloned voice, narrating the transcript.

Overlay the new audio onto the video, replacing the original voice.

⭕️ Core Architecture

The Voice Replacement Agent is built on Director's extensible framework, leveraging its session management and state tracking capabilities while adding specialised voice replacement functionality. This is how the flow from the user’s input to the video output will look like.

⁠

Required Inputs

The following parameters will be needed for the agent, to know about the importance of an agent’s parameters, check out

Marketing Content · Agent Creation Playbook⁠

Parameter

Type of parameter

Description

sample_video

An object containing the video_id and start and end time for the sample length

name_of_voice

Name to assign to the cloned voice

description

Description of how the voice sounds (e.g., "elderly" or "child-like", the accent etc.)

is_authorized_to_clone_voice

Flag indicating if the user is authorized to clone the voice

video_ids

List of Unique IDs given to the videos stored in VideoDB

collection_id

Collection ID for storing the generated audio

There are no rows in this table

⁠

💡 The collection_id parameter refers to the ID of a VideoDB collection where all your videos are stored and the generated (synthesised) audio file can be stored for future use. To learn more, explore
VideoDB Collections⁠
.

From ElevenLabs:

You’ll need the following two key methods from the ElevenLabs SDK:

clone: Used to clone a voice based on the provided audio samples.

generate: Generates synthesised audio from text using the cloned voice.

You will need to implement these above methods in the VideoDB Director for the same.

⚒️ Setup

VideoDB and Director Setup

Get your API key from

VideoDB Console⁠

. Install the latest SDK

Follow instructions mentioned at

Setup Director Locally⁠

⚙️ Create ElevenLabs methods

For the voice cloning feature, the clone and generate methods given by ElevenLabs needs to be implemented. For this, you can access the existing ElevenLabs tool present in the /backend/director/tools

folder⁠

. Lets go to the ElevenLabs tool in the elevenlabs.py.

Define the required methods In the ElevenLabsTool.

For cloning, create the clone_audio method

def clone_audio(self, audio_files: list[str], name_of_voice, description):

voice = self.client.clone(

name=name_of_voice,

files=audio_files,

description=description

)

return voice

And for generating the audio, create the synthesise_text method

def synthesise_text(self, voice:Voice, text_to_synthesis:str):

audio = self.client.generate(text=text_to_synthesis, voice=voice, model="eleven_multilingual_v2")

return audio

🤖 Building the Agent

1. Import the required components

Create a voice_replacement.py file inside backend/director/agents

folder⁠

and add all our imports to it. They include:

2. Define the parameters for the agent

Referring the Game Plan, Create a JSON schema for these parameters.

3. Implement Agent Class

We will now create the agent class. The parameters set here ( self.agent_name, self.description and the self.parameters) determine how the agent interacts with the reasoning engine.

4. Implement the core logic of the voice replacement agent

1. Declare a run method.

We will need to implement a run method in the agent’s class. This is the heart of the agent as this is the method that runs when the agent is called.

In the run method, define the required parameters which will be used to implement the agent.

2. Check Authorisation: If the user isn't authorised, return an error response

3. Initialise Tools: Check for the ElevenLabs API key and initialise the required tools.

4. Save the audio files locally: From the video stored in VideoDB, we will need to extract the required audio file. This can be broken into following steps:

Generate a stream of the video for the sample based on start and end time of sample audio

Download the video based on the stream

Extract the audio file from the video

For generating the stream, we will use the existing get_video_stream method from the VideoDBTool :

For downloading the video, first we will get the download link of the above generated stream

Now, we will write a _download_video_file method which will download the video via the download_url and give the path where it is saved

Now, let’s use the method to get the video_path

We will now extract the audio from a video using VideoDB. For this, we will need a extract_audio_from_video method.

Start by defining a method _extract_audio_from_video, which takes a video file path as input and extracts audio from it.

Now, modify the existing get_audio method inside tools/videodb-tool.py

file⁠

to include a url field in the returned object. This field provides a direct link to the extracted audio file.

Create the _download_audio_file method which will take the URL

Finally, let’s use this method to get the audio file from it

5. Clone the Voice: Call ElevenLabs' clone_audio method to create the cloned voice.

6. Start processing all videos: We have generated the cloned voice and we can now create overlays for the videos. We will start processing the videos for each video_id present in the video_ids list

7. Extract transcript from video: For the generation of audio for the video, we will need to extract the transcript from the video. For this, we will make a method in the agent which will take the video_id and return the transcript

And now, you can get the transcript from the video in the run method.

8. Synthesise Text: Use ElevenLabs' synthesise_text method to generate audio from the input text

💡 To communicate the steps that the agent is taking, you can simply use the self.output_message.actions and the self.output_message.push_update methods to send the updates to the client. This will allow you to communicate with the user about what the agent is achieving at a particular time.

9. Save the Audio File: Store the generated audio file locally

10. Upload to VideoDB: We will upload the generated audio file to VideoDB and retrieve its unique audio ID.

11. Overlay audio on to the video: We will use VideoDB’s timeline feature to overlay the cloned voice onto the video.

💡 To know more about timeline and audio overlays, you can visit our doc about the same:
Audio overlay + Video + Timeline⁠
⁠

For this, make a method for adding the overlay using the video_id and audio_id which returns a stream link so that we can stream the video.

Now, in the run method, before adding the overlay to the video, we will use VideoContent to display the video output in the Director’s UI to show that adding the audio overlay is in progress.

We will now add the overlay and pass the generated stream link to the video_content so that we can watch it in the Director’s UI.

12. Publish the updates and return an AgentResponse to end the agents process

Once all the videos are processed, we will use the publish method to save the messages as a final step.

Also, you can send certain information such as the cloned_voice_id and audio_id as a response so that any subsequent chat requests will be able to use them to generate further responses.

13. Implement Error Handling

Ensure robust error handling to manage failures gracefully.

5. Register the Agent

To use the agent, go inside the backend/director/handler.py

file⁠

and import the agent.

And add the agent in the self.agents list

And that is it! It was this straightforward to write an agent. Now you can try out this agent in the Director locally and explore the cloning capabilities of the ElevenLabs’ cloning feature and seamless video overlay feature from VideoDB to breathe new life into your videos!

🚀 Using the agent

Simply go to the frontend at

http://localhost:8080⁠

and refresh the page to see the agent available in the options for use.

💡 Conclusion

The Voice Transformation Agent demonstrates the power and flexibility of Director's agent framework. By leveraging VideoDB's video overlay feature and robust database system and ElevenLabs' voice cloning capabilities, we've created a secure, scalable solution for your voice cloning and video overlay needs.

Key takeaways:

Simple integration with Director's framework

Robust error handling and security measures

Scalable architecture for audio processing

Seamless way of adding audio overlays on videos

Ethical considerations built into the design

Creating an agent in VideoDB Director is incredibly easy, allowing you to build powerful and customised solutions quickly. What will you build? 🚀

⁠