Skip to main content
Open In Colab

The Challenge

You have a beautiful music video, but it’s 4 minutes long. Social media demands are different — TikTok, Reels, and Shorts want 15-60 second vertical clips that stop people mid-scroll. Manually identifying the best segment, extracting it, generating backgrounds, adding lyrics, and exporting for vertical format is tedious and time-consuming. What if AI could do all of that for you?

What You’ll Build

Turn any music video into viral-ready vertical clips optimized for social media. This system automates the complete workflow:
  • AI identifies the catchiest segments (chorus with build-up)
  • Generates vertical 9:16 backgrounds for mobile viewing
  • Creates mobile-optimized captions with high contrast
  • Adds pre-roll, CTA, and post-roll for professional polish
  • Outputs 15-60 second clips ready to upload
All powered by VideoDB’s Editor SDK — built for social media from the ground up.

Setup

Install Dependencies

pip install videodb

Connect to VideoDB

import videodb

# Connect to VideoDB
api_key = "your_api_key"
conn = videodb.connect(api_key=api_key)
coll = conn.get_collection()

Implementation

Step 1: Upload Your Music Video

# Upload from URL
video = coll.upload(url="https://www.youtube.com/watch?v=gtnD0eCcCC8")

Step 2: Generate Transcript with Word-Level Timestamps

# Generate transcript
transcript = video.generate_transcript(force=True)

# Retrieve timed transcript (word-level timestamps)
transcript_timed = video.get_transcript()

# Retrieve plain text transcript
transcript_text = video.get_transcript_text()

Step 3: Identify the Best Segment

The AI analyzes the transcript to find the catchiest chorus with proper build-up and wind-down:
import json

# Define visual aesthetic
user_request = "Calm and serene vibe. Ocean, beach, sunsets and peace"

prompt = f"""
You are a viral social media content strategist specializing in short-form vertical video. Your task is to identify the most engaging, catchy, and shareable segments from a music video to create TikTok/Reels/Shorts content.

# INPUT DATA

1. Full Transcript Text (may contain ASR errors/missing punctuation):
{transcript_text}

2. Word-Level Timed Transcript (JSON array with start/end/text/speaker fields):
{json.dumps(transcript_timed, ensure_ascii=False)}

3. Video Metadata:
   - Name: {video.name}
   - Duration: {video.length} seconds
   - User Style Request: {user_request}

# SEGMENT SELECTION CRITERIA

Identify **1-3 high-quality segments** (prioritize quality over quantity). Each segment should:

## What Makes a Segment "Catchy"?

**PRIMARY FOCUS: Chorus/Hook with Proper Framing**

Segments MUST be built around the chorus with these components:
1. **Build-up (Pre-Chorus/Verse End)**: 2-4 lines leading into the chorus that create anticipation
2. **The Chorus/Hook**: The main catchy, repetitive, memorable section
3. **Wind-down (Post-Chorus)**: 2-4 lines after the chorus that provide resolution

The segment structure should be: **Ramp Up → Peak (Chorus) → Ramp Down**

Additional qualities that enhance a chorus segment:
- **Emotional Peak**: Most intense, emotionally charged moment (climax, drop, powerful vocals)
- **Quotable Lines**: Lyrics that are relatable, funny, profound, or highly shareable
- **Viral Potential**: Lyrics that could inspire trends, dances, duets, or memes
- **Energy Shift**: Dramatic beat drop, tempo change, or dynamic transition in/around the chorus
- **Memorable Moment**: Distinctive vocal run, ad-lib, or production element that makes the chorus special

**Non-Negotiable Rules:**
- The chorus is the centerpiece - always include it
- Never start directly on the first chorus line - include the approach
- Never end directly after the last chorus line - include the exit
- Build-up and wind-down are MANDATORY for professional feel

## Segment Requirements
- **Duration**: 15-60 seconds per segment (optimal: 20-45 seconds for retention)
- **Focus on Chorus**: Segments should CENTER around the chorus/hook - this is the primary target
- **Build-up REQUIRED**: MUST include the build-up/lead-in before the chorus (last 2-4 lines of pre-chorus or verse)
- **Wind-down REQUIRED**: MUST include the wind-down/resolution after the chorus (first 2-4 lines after chorus ends)
- **No Abrupt Starts**: Never start directly on the main chorus line - include runway space before it
- **No Abrupt Ends**: Never cut immediately when the chorus ends - include graceful exit
- **Ramp Up & Ramp Down**: The segment should feel like a complete emotional arc with natural entry and exit
- **Completeness**: Each segment should feel like a complete mini-story with beginning, climax (chorus), and resolution
- **Standalone Quality**: Segment must work independently and feel professionally edited, not chopped
- **Quality Over Quantity**: If only 1 truly excellent segment exists, return only that one

## Lyric Processing

### Text Handling
- Lightly fix obvious ASR errors ONLY when strongly implied by context
- Do NOT invent new lyrics beyond what is clearly present
- Merge words into natural phrases suitable for vertical video display
- Target 3-8 words per line (shorter than full videos due to vertical format)
- Split long lines for mobile readability

### Timing
- Each stanza's start = earliest word start time
- Each stanza's end = latest word end time
- Each line must have precise start/end timestamps from word-level data
- Ensure NO time gaps within a segment

### Speaker Handling
- Preserve speaker labels
- If stanza mixes speakers, set speaker="Mixed"
- Assign consistent hex colors per speaker

# IMAGE GENERATION REQUIREMENTS

For EACH stanza within EACH segment, create an "image_prompt" following these rules:

## Short-Form Specific Considerations
- **Vertical aspect ratio**: 9:16 (portrait orientation for mobile)
- **Mobile-first design**: Images should look compelling on small phone screens
- **Attention-grabbing**: Short-form content needs more visual impact than full videos
- **Text readability**: Even more critical in vertical format with limited screen space

## Independence & Self-Containment
- Each prompt must be FULLY SELF-CONTAINED with complete scene descriptions
- NEVER use references like "previous image", "same as before", "similar to", or "continue from"
- Each prompt must work standalone if generated in isolation

## Content & Style
- Honor the user's style request: {user_request}
- Reflect the mood, theme, and emotional tone of the current stanza
- Create visual progression within each segment
- **CRITICAL**: Prompts must NEVER include text, lyrics, words, letters, names, or any written language in the generated image

## Composition for Vertical Video Text Overlay
- **Vertical framing**: Design for 9:16 portrait orientation
- Keep **center vertical strip** relatively clear for text overlay
- Text typically appears in middle-to-upper-middle area on mobile
- Avoid busy details in the central vertical zone
- Can have more detail at top/bottom edges where text is less likely
- Use balanced, mobile-optimized compositions
- Images should be visually striking but NOT overly distracting

## Visual Consistency Within Each Segment
Each segment should have internal visual consistency:
- **Color palette**: Harmonious colors within the segment
- **Artistic style**: Consistent style throughout the segment
- **Mood/atmosphere**: Unified emotional tone
- **Composition approach**: Similar framing strategy

## Prompt Structure
Each image_prompt should specify:
1. Scene/subject matter relevant to the stanza
2. Emotional mood and atmosphere matching the segment's energy
3. Lighting conditions and color tone
4. Artistic style (aligned with user_request)
5. **Vertical composition notes** (e.g., "portrait orientation, centered vertical negative space, detailed top and bottom thirds")
6. Mobile-optimized visual impact

# FONT COLOR SELECTION

For EACH stanza, select a "font_color" ensuring optimal mobile readability:

## Available Colors
Choose from this list ONLY:
- White: #FFFFFF (for dark backgrounds)
- Black: #000000 (for light backgrounds)
- Yellow: #FFFF00 (for dark backgrounds, extremely high visibility on mobile)
- Dark Blue: #00008B (for light backgrounds)
- Dark Green: #006400 (for light backgrounds)

## Selection Guidelines
- **Light/bright backgrounds** → Black (#000000), Dark Blue (#00008B), or Dark Green (#006400)
- **Dark/dim backgrounds** → White (#FFFFFF) or Yellow (#FFFF00)
- **Mobile priority**: Ensure MAXIMUM CONTRAST for small screen readability
- Consider that users often view in bright outdoor lighting or dim rooms
- Yellow (#FFFF00) works exceptionally well for high-energy viral content on dark backgrounds

# SEGMENT METADATA

For each segment, provide:

## segment_title
- A catchy, descriptive title for the segment (3-6 words)
- Should hint at what makes this segment special
- Examples: "Explosive Chorus Drop", "Emotional Bridge Moment", "Viral Hook Section"

## start_time & end_time
- Precise timestamps from the original full music video
- These will be used to extract the video segment

# OUTPUT FORMAT

Return ONLY a valid JSON array with this exact structure:

[
  {{
    "segment_title": "string (catchy 3-6 word title)",
    "start_time": float (seconds from original video start),
    "end_time": float (seconds from original video start),
    "duration": float (end_time - start_time, for verification),
    "why_catchy": "string (1-2 sentence explanation of what makes this segment viral-worthy)",
    "stanzas": [
      {{
        "stanza_start": float,
        "stanza_end": float,
        "image_prompt": "string (detailed, self-contained, vertical 9:16 image generation prompt)",
        "font_color": "string (hex code from approved list)",
        "lines": [
          {{
            "text": "string (lyric line)",
            "start": float,
            "end": float
          }}
        ]
      }}
    ]
  }}
]

# CRITICAL REMINDERS

1. **CHORUS-CENTERED STRUCTURE**: Every segment must be built around a chorus with proper build-up and wind-down
2. **NO ABRUPT STARTS**: Always include 2-4 lines BEFORE the chorus begins (pre-chorus/verse ending)
3. **NO ABRUPT ENDS**: Always include 2-4 lines AFTER the chorus ends (post-chorus beginning)
4. **PROFESSIONAL RAMP UP/DOWN**: The segment should feel like a complete emotional journey, not a choppy cut
5. **Quality over quantity**: 1 amazing segment > 3 mediocre ones
6. **Each image_prompt is independent**: No cross-references between prompts
7. **Vertical format**: All image prompts must specify 9:16 portrait orientation
8. **No text in images**: Never include lyrics, words, or letters in image generation prompts
9. **Complete coverage**: No time gaps within each segment
10. **Mobile-first**: Optimize everything for small screen viewing
11. **Honor style request**: All creative decisions must align with: {user_request}
12. Each text line cannot exceed beyond 20 characters [ Highly Important ]

Output the JSON array only, no additional text.
"""

stanzas = coll.generate_text(
    prompt=prompt,
    model_name="pro",
    response_type="json",
)

segments = stanzas.get("output")

Step 4: Generate Vertical Background Images with AI

Create AI-generated background images for each lyric stanza:
from concurrent.futures import ThreadPoolExecutor, as_completed

# Helper function to optimize segment stanzas
def optimize_segment_stanzas(stanzas):
    if not stanzas: return []
    optimized = []
    for i in range(len(stanzas)):
        current = stanzas[i]
        is_music = current.get("lines") and "[Music...]" in current["lines"][0].get("text", "")
        if is_music and i > 0:
            prev = optimized[-1]
            music_duration = current["stanza_end"] - current["stanza_start"]
            if music_duration < 2.0:
                prev["stanza_end"] = current["stanza_end"]
                if prev.get("lines"): prev["lines"][-1]["end"] = current["stanza_end"]
                continue
        optimized.append(current)
    return optimized

# Prepare image generation tasks
image_tasks = []
for seg_idx, segment in enumerate(segments):
    segment["optimized_stanzas"] = optimize_segment_stanzas(segment["stanzas"])

    for stan_idx, s in enumerate(segment["optimized_stanzas"]):
        image_tasks.append({
            "seg_idx": seg_idx,
            "stan_idx": stan_idx,
            "prompt": s["image_prompt"]
        })

def generate_worker(task):
    img = coll.generate_image(prompt=task["prompt"], aspect_ratio="9:16")
    return task["seg_idx"], task["stan_idx"], img.id

# Generate images in parallel
MAX_WORKERS = 10
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
    futures = [executor.submit(generate_worker, task) for task in image_tasks]

    for future in as_completed(futures):
        seg_i, stan_i, img_id = future.result()
        segments[seg_i]["optimized_stanzas"][stan_i]["image_id"] = img_id

Step 5: Build Multi-Layer Timeline

Create a timeline with audio, backgrounds, and animated lyrics:
import math
from videodb.editor import (
    VideoAsset, ImageAsset, TextAsset, Font,
    Clip, Track, Timeline, Transition,
    Position, Fit, Offset, Alignment, HorizontalAlignment, VerticalAlignment, Border
)

def calculate_y_offsets(num_lines, line_height=45, timeline_height=1080):
    offsets = []
    start_pixel_offset = -((num_lines - 1) * line_height) / 2
    half_height = timeline_height / 2
    for i in range(num_lines):
        pixel_y = start_pixel_offset + (i * line_height)
        offsets.append(pixel_y / half_height)
    return offsets

# Timeline configuration
TIMELINE_WIDTH = 608
TIMELINE_HEIGHT = 1080
PRE_ROLL = 5.0
POST_ROLL = 5.0
CTA_DURATION = 5.0

# Render each segment
for idx, segment in enumerate(segments):
    # Calculate timing
    actual_seg_start = segment["start_time"]
    actual_seg_end = segment["end_time"]

    timeline_start_og = max(0, actual_seg_start - PRE_ROLL)
    intro_duration = actual_seg_start - timeline_start_og
    total_audio_duration = (actual_seg_end - timeline_start_og) + CTA_DURATION + POST_ROLL

    stanza_items = segment["optimized_stanzas"]

    # Initialize timeline
    timeline = Timeline(conn)
    timeline.resolution = f"{TIMELINE_WIDTH}x{TIMELINE_HEIGHT}"
    timeline.background = "#000000"

    # Layer 1: Audio Track (opacity=0 for audio only)
    audio_track = Track(z_index=0)
    audio_track.add_clip(
        start=0.0,
        clip=Clip(
            asset=VideoAsset(id=video.id, start=timeline_start_og, volume=1.0),
            duration=total_audio_duration,
            fit=Fit.crop,
            position=Position.center,
            opacity=0.0,
            transition=Transition(in_="fade", out="fade", duration=5.0)
        ),
    )
    timeline.add_track(audio_track)

    # Layer 2: Background Images Track
    images_track = Track(z_index=1)
    for i, s in enumerate(stanza_items):
        local_start = s["stanza_start"] - timeline_start_og
        local_end = s["stanza_end"] - timeline_start_og

        if i == 0:
            local_start = 0.0
            trans_in = Transition(in_="fade", duration=intro_duration)
        else:
            trans_in = Transition(in_="fade", out="fade", duration=0.35)

        duration = max(0.1, local_end - local_start)

        images_track.add_clip(
            start=local_start,
            clip=Clip(
                asset=ImageAsset(id=s["image_id"]),
                duration=duration,
                fit=Fit.crop,
                position=Position.center,
                transition=trans_in
            ),
        )
    timeline.add_track(images_track)

    # Layer 3: Lyrics Track
    lyrics_track = Track(z_index=2)
    LINE_HEIGHT = 45
    lyrics_border = Border(color="#000000", width=1.5)

    for s in stanza_items:
        lines = s.get("lines", [])
        y_offsets = calculate_y_offsets(len(lines), LINE_HEIGHT, TIMELINE_HEIGHT)
        local_stanza_end = s["stanza_end"] - timeline_start_og

        for l_idx, line in enumerate(lines):
            local_line_start = line["start"] - timeline_start_og
            line_duration = max(0.1, local_stanza_end - local_line_start)

            text_asset = TextAsset(
                text=line["text"],
                font=Font(family="Bebas Neue", size=48, color="#FFFFFF", weight=700),
                border=lyrics_border,
                alignment=Alignment(horizontal=HorizontalAlignment.center, vertical=VerticalAlignment.center)
            )

            lyrics_track.add_clip(
                start=local_line_start,
                clip=Clip(
                    asset=text_asset,
                    duration=line_duration,
                    position=Position.center,
                    offset=Offset(x=0, y=y_offsets[l_idx]),
                    transition=Transition(in_="fade", duration=0.5)
                )
            )

    # Layer 4: CTA (Call to Action)
    cta_start_time = (actual_seg_end - timeline_start_og)
    cta_lines = ["Visit the channel", "for the full video"]
    cta_y_offsets = calculate_y_offsets(len(cta_lines), line_height=60, timeline_height=TIMELINE_HEIGHT)

    for c_idx, cta_text in enumerate(cta_lines):
        cta_asset = TextAsset(
            text=cta_text,
            font=Font(family="Bebas Neue", size=42, color="#FFFFFF", weight=700),
            border=lyrics_border,
            alignment=Alignment(horizontal=HorizontalAlignment.center, vertical=VerticalAlignment.center)
        )

        lyrics_track.add_clip(
            start=cta_start_time,
            clip=Clip(
                asset=cta_asset,
                duration=CTA_DURATION,
                position=Position.center,
                offset=Offset(x=0, y=cta_y_offsets[c_idx]),
                transition=Transition(in_="fade", duration=0.5)
            )
        )

    timeline.add_track(lyrics_track)

    # Generate final stream
    stream_url = timeline.generate_stream()

What You Get

A professional-looking short-form video ready for social media:
  • Perfectly timed to the music
  • AI-generated matching visuals
  • Synced lyrics on screen
  • Professional pre-roll and CTA
  • Vertical format (608x1080)
  • Ready to upload to TikTok, Reels, or Shorts
Here’s the final rendered video:

The Result

With this system, you can:
  • Process music videos in minutes instead of hours
  • Generate unlimited vertical clips from one source video
  • Maintain consistency across all clips with the same visual style
  • Stay on top of social media trends with fast production
No more manual editing. Just music, lyrics, AI-generated visuals, and viral potential.

Explore the Full Notebook

Open the complete implementation with parallel processing, custom effects, and advanced styling options.