Skip to main content
Open In Colab

Introduction

With an endless stream of new video content on our feeds, engaging the audience with dynamic visual elements can make educational and promotional videos much more impactful. VideoDB’s suite of features allows you to enhance videos with programmatic editing. In this tutorial, we’ll explore how to create a video that visually counts and displays instances of a specified word as it’s spoken. We’ll use VideoDB’s Keyword Search to index spoken words, and then apply audio and text overlays to show a counter updating in real-time with synchronized audio cues.

Setup

Installing packages

!pip install videodb

API Keys

Before proceeding, ensure access to VideoDB and set up Get your API key from VideoDB Console. ( Free for first 50 uploads, No credit card required)

Steps

Step 1: Connect to VideoDB

Establish a session for uploading videos. Import the necessary modules from VideoDB library to access functionalities.
import videodb

# Set your API key
api_key = "your_api_key"

# Connect to VideoDB
conn = videodb.connect(api_key=api_key)
coll = conn.get_collection()

Step 2: Upload Video

Upload and play the video to ensure it’s correctly loaded. We’ll be using this video for the purpose of this tutorial.
video = coll.upload(url="https://www.youtube.com/watch?v=Js4rTM2Z1Eg")
video.play()

Step 3: Indexing Spoken Words

Index the video to identify and timestamp all spoken words.
video.index_spoken_words()
Search within the video for the keyword (“education” in this example), and note each occurrence.
from videodb import SearchType

result = video.search(query="education", search_type=SearchType.keyword)

Step 5: Setup Timeline and Audio

Initialize the timeline and prepare an audio asset to use for each word occurrence.
from videodb.editor import Timeline, Track, Clip, AudioAsset, VideoAsset, TextAsset
from videodb.editor import Font, Background, Alignment, HorizontalAlignment, VerticalAlignment, Position, Offset
from videodb import MediaType

timeline = Timeline(conn)

# Upload the twink sound effect
audio = coll.upload(url="https://github.com/video-db/videodb-cookbook-assets/raw/main/audios/twink.mp3", media_type=MediaType.audio)

Step 6: Overlay Text and Audio

Add text and audio overlays at each instance where the word is spoken using the Track and Clip pattern. Note: Adding the ‘padding’ is an optional step. It helps in adding a little more context to the exact instance identified, thus resulting in a better compiled output.
video_duration = min(300, int(video.length))  # First 5 minutes only
audio_offset = 1  # Delay audio/text update by 1 second for better sync

# Create timeline and tracks
timeline = Timeline(conn)
video_track = Track()
text_track = Track()
audio_track = Track()

# Add video clip (first 5 minutes)
video_clip = Clip(
    asset=VideoAsset(id=video.id, start=0),
    duration=video_duration)
video_track.add_clip(0, video_clip)

# Filter shots within our duration
shots_in_range = [s for s in result.shots if int(s.start) + audio_offset < video_duration]

# Add text overlays that update at each word occurrence
for i, shot in enumerate(shots_in_range):
    trigger_time = int(shot.start) + audio_offset

    # Initial "Count-0" from start until first word
    if i == 0 and trigger_time > 0:
        text_asset = TextAsset(
            text="Count-0",
            font=Font(family="Do Hyeon", size=72, color="#000100"),
            background=Background(color="#F702A4", opacity=1.0),
            alignment=Alignment(horizontal=HorizontalAlignment.right, vertical=VerticalAlignment.top),)
        text_clip = Clip(asset=text_asset, duration=trigger_time,
          position=Position.top_right, offset=Offset(x=-0.05, y=0.05))
        text_track.add_clip(0, text_clip)

    # Duration until next word or end of video
    if i + 1 < len(shots_in_range):
        next_trigger = int(shots_in_range[i + 1].start) + audio_offset
    else:
        next_trigger = video_duration

    text_dur = next_trigger - trigger_time

    # Text overlay with updated count
    text_asset = TextAsset(
        text=f"Count-{i + 1}",
        font=Font(family="Do Hyeon", size=72, color="#000100"),
        background=Background(color="#F702A4", opacity=1.0),
        alignment=Alignment(horizontal=HorizontalAlignment.right, vertical=VerticalAlignment.top),)
    text_clip = Clip(asset=text_asset, duration=text_dur, position=Position.top_right, offset=Offset(x=-0.05, y=0.05))
    text_track.add_clip(trigger_time, text_clip)

    # Audio cue at same trigger time
    if trigger_time < video_duration - 2:
        audio_clip = Clip(asset=AudioAsset(id=audio.id), duration=2)
        audio_track.add_clip(trigger_time, audio_clip)

# Add all tracks to timeline
timeline.add_track(video_track)
timeline.add_track(text_track)
timeline.add_track(audio_track)

Step 7: Generate and Play the Stream

Finally, generate a streaming URL for your edited video and play it.
from videodb import play_stream

stream_url = timeline.generate_stream()
play_stream(stream_url)
Here’s a preview of showing occurrence of the word Education

Conclusion

This tutorial showcases VideoDB’s capabilities to create a video that programmatically counts and displays the frequency of a specific keyword spoken throughout the video. This method can be adapted for various applications where dynamic text overlays add significant value to video content.

Tips and Tricks

  • Use different text styles and positions based on your video’s theme.
  • Add background sounds or effects to enhance the viewer’s experience.

Explore Full Notebook

Open the complete implementation in Google Colab with all code examples.