The Challenge
You have a beautiful music video, but it’s 4 minutes long. Social media demands are different — TikTok, Reels, and Shorts want 15-60 second vertical clips that stop people mid-scroll. Manually identifying the best segment, extracting it, generating backgrounds, adding lyrics, and exporting for vertical format is tedious and time-consuming. What if AI could do all of that for you?What You’ll Build
Turn any music video into viral-ready vertical clips optimized for social media. This system automates the complete workflow:- AI identifies the catchiest segments (chorus with build-up)
- Generates vertical 9:16 backgrounds for mobile viewing
- Creates mobile-optimized captions with high contrast
- Adds pre-roll, CTA, and post-roll for professional polish
- Outputs 15-60 second clips ready to upload
Setup
Install Dependencies
Copy
Ask AI
pip install videodb
Connect to VideoDB
Copy
Ask AI
import videodb
# Connect to VideoDB
api_key = "your_api_key"
conn = videodb.connect(api_key=api_key)
coll = conn.get_collection()
Implementation
Step 1: Upload Your Music Video
Copy
Ask AI
# Upload from URL
video = coll.upload(url="https://www.youtube.com/watch?v=gtnD0eCcCC8")
Step 2: Generate Transcript with Word-Level Timestamps
Copy
Ask AI
# Generate transcript
transcript = video.generate_transcript(force=True)
# Retrieve timed transcript (word-level timestamps)
transcript_timed = video.get_transcript()
# Retrieve plain text transcript
transcript_text = video.get_transcript_text()
Step 3: Identify the Best Segment
The AI analyzes the transcript to find the catchiest chorus with proper build-up and wind-down:Copy
Ask AI
import json
# Define visual aesthetic
user_request = "Calm and serene vibe. Ocean, beach, sunsets and peace"
prompt = f"""
You are a viral social media content strategist specializing in short-form vertical video. Your task is to identify the most engaging, catchy, and shareable segments from a music video to create TikTok/Reels/Shorts content.
# INPUT DATA
1. Full Transcript Text (may contain ASR errors/missing punctuation):
{transcript_text}
2. Word-Level Timed Transcript (JSON array with start/end/text/speaker fields):
{json.dumps(transcript_timed, ensure_ascii=False)}
3. Video Metadata:
- Name: {video.name}
- Duration: {video.length} seconds
- User Style Request: {user_request}
# SEGMENT SELECTION CRITERIA
Identify **1-3 high-quality segments** (prioritize quality over quantity). Each segment should:
## What Makes a Segment "Catchy"?
**PRIMARY FOCUS: Chorus/Hook with Proper Framing**
Segments MUST be built around the chorus with these components:
1. **Build-up (Pre-Chorus/Verse End)**: 2-4 lines leading into the chorus that create anticipation
2. **The Chorus/Hook**: The main catchy, repetitive, memorable section
3. **Wind-down (Post-Chorus)**: 2-4 lines after the chorus that provide resolution
The segment structure should be: **Ramp Up → Peak (Chorus) → Ramp Down**
Additional qualities that enhance a chorus segment:
- **Emotional Peak**: Most intense, emotionally charged moment (climax, drop, powerful vocals)
- **Quotable Lines**: Lyrics that are relatable, funny, profound, or highly shareable
- **Viral Potential**: Lyrics that could inspire trends, dances, duets, or memes
- **Energy Shift**: Dramatic beat drop, tempo change, or dynamic transition in/around the chorus
- **Memorable Moment**: Distinctive vocal run, ad-lib, or production element that makes the chorus special
**Non-Negotiable Rules:**
- The chorus is the centerpiece - always include it
- Never start directly on the first chorus line - include the approach
- Never end directly after the last chorus line - include the exit
- Build-up and wind-down are MANDATORY for professional feel
## Segment Requirements
- **Duration**: 15-60 seconds per segment (optimal: 20-45 seconds for retention)
- **Focus on Chorus**: Segments should CENTER around the chorus/hook - this is the primary target
- **Build-up REQUIRED**: MUST include the build-up/lead-in before the chorus (last 2-4 lines of pre-chorus or verse)
- **Wind-down REQUIRED**: MUST include the wind-down/resolution after the chorus (first 2-4 lines after chorus ends)
- **No Abrupt Starts**: Never start directly on the main chorus line - include runway space before it
- **No Abrupt Ends**: Never cut immediately when the chorus ends - include graceful exit
- **Ramp Up & Ramp Down**: The segment should feel like a complete emotional arc with natural entry and exit
- **Completeness**: Each segment should feel like a complete mini-story with beginning, climax (chorus), and resolution
- **Standalone Quality**: Segment must work independently and feel professionally edited, not chopped
- **Quality Over Quantity**: If only 1 truly excellent segment exists, return only that one
## Lyric Processing
### Text Handling
- Lightly fix obvious ASR errors ONLY when strongly implied by context
- Do NOT invent new lyrics beyond what is clearly present
- Merge words into natural phrases suitable for vertical video display
- Target 3-8 words per line (shorter than full videos due to vertical format)
- Split long lines for mobile readability
### Timing
- Each stanza's start = earliest word start time
- Each stanza's end = latest word end time
- Each line must have precise start/end timestamps from word-level data
- Ensure NO time gaps within a segment
### Speaker Handling
- Preserve speaker labels
- If stanza mixes speakers, set speaker="Mixed"
- Assign consistent hex colors per speaker
# IMAGE GENERATION REQUIREMENTS
For EACH stanza within EACH segment, create an "image_prompt" following these rules:
## Short-Form Specific Considerations
- **Vertical aspect ratio**: 9:16 (portrait orientation for mobile)
- **Mobile-first design**: Images should look compelling on small phone screens
- **Attention-grabbing**: Short-form content needs more visual impact than full videos
- **Text readability**: Even more critical in vertical format with limited screen space
## Independence & Self-Containment
- Each prompt must be FULLY SELF-CONTAINED with complete scene descriptions
- NEVER use references like "previous image", "same as before", "similar to", or "continue from"
- Each prompt must work standalone if generated in isolation
## Content & Style
- Honor the user's style request: {user_request}
- Reflect the mood, theme, and emotional tone of the current stanza
- Create visual progression within each segment
- **CRITICAL**: Prompts must NEVER include text, lyrics, words, letters, names, or any written language in the generated image
## Composition for Vertical Video Text Overlay
- **Vertical framing**: Design for 9:16 portrait orientation
- Keep **center vertical strip** relatively clear for text overlay
- Text typically appears in middle-to-upper-middle area on mobile
- Avoid busy details in the central vertical zone
- Can have more detail at top/bottom edges where text is less likely
- Use balanced, mobile-optimized compositions
- Images should be visually striking but NOT overly distracting
## Visual Consistency Within Each Segment
Each segment should have internal visual consistency:
- **Color palette**: Harmonious colors within the segment
- **Artistic style**: Consistent style throughout the segment
- **Mood/atmosphere**: Unified emotional tone
- **Composition approach**: Similar framing strategy
## Prompt Structure
Each image_prompt should specify:
1. Scene/subject matter relevant to the stanza
2. Emotional mood and atmosphere matching the segment's energy
3. Lighting conditions and color tone
4. Artistic style (aligned with user_request)
5. **Vertical composition notes** (e.g., "portrait orientation, centered vertical negative space, detailed top and bottom thirds")
6. Mobile-optimized visual impact
# FONT COLOR SELECTION
For EACH stanza, select a "font_color" ensuring optimal mobile readability:
## Available Colors
Choose from this list ONLY:
- White: #FFFFFF (for dark backgrounds)
- Black: #000000 (for light backgrounds)
- Yellow: #FFFF00 (for dark backgrounds, extremely high visibility on mobile)
- Dark Blue: #00008B (for light backgrounds)
- Dark Green: #006400 (for light backgrounds)
## Selection Guidelines
- **Light/bright backgrounds** → Black (#000000), Dark Blue (#00008B), or Dark Green (#006400)
- **Dark/dim backgrounds** → White (#FFFFFF) or Yellow (#FFFF00)
- **Mobile priority**: Ensure MAXIMUM CONTRAST for small screen readability
- Consider that users often view in bright outdoor lighting or dim rooms
- Yellow (#FFFF00) works exceptionally well for high-energy viral content on dark backgrounds
# SEGMENT METADATA
For each segment, provide:
## segment_title
- A catchy, descriptive title for the segment (3-6 words)
- Should hint at what makes this segment special
- Examples: "Explosive Chorus Drop", "Emotional Bridge Moment", "Viral Hook Section"
## start_time & end_time
- Precise timestamps from the original full music video
- These will be used to extract the video segment
# OUTPUT FORMAT
Return ONLY a valid JSON array with this exact structure:
[
{{
"segment_title": "string (catchy 3-6 word title)",
"start_time": float (seconds from original video start),
"end_time": float (seconds from original video start),
"duration": float (end_time - start_time, for verification),
"why_catchy": "string (1-2 sentence explanation of what makes this segment viral-worthy)",
"stanzas": [
{{
"stanza_start": float,
"stanza_end": float,
"image_prompt": "string (detailed, self-contained, vertical 9:16 image generation prompt)",
"font_color": "string (hex code from approved list)",
"lines": [
{{
"text": "string (lyric line)",
"start": float,
"end": float
}}
]
}}
]
}}
]
# CRITICAL REMINDERS
1. **CHORUS-CENTERED STRUCTURE**: Every segment must be built around a chorus with proper build-up and wind-down
2. **NO ABRUPT STARTS**: Always include 2-4 lines BEFORE the chorus begins (pre-chorus/verse ending)
3. **NO ABRUPT ENDS**: Always include 2-4 lines AFTER the chorus ends (post-chorus beginning)
4. **PROFESSIONAL RAMP UP/DOWN**: The segment should feel like a complete emotional journey, not a choppy cut
5. **Quality over quantity**: 1 amazing segment > 3 mediocre ones
6. **Each image_prompt is independent**: No cross-references between prompts
7. **Vertical format**: All image prompts must specify 9:16 portrait orientation
8. **No text in images**: Never include lyrics, words, or letters in image generation prompts
9. **Complete coverage**: No time gaps within each segment
10. **Mobile-first**: Optimize everything for small screen viewing
11. **Honor style request**: All creative decisions must align with: {user_request}
12. Each text line cannot exceed beyond 20 characters [ Highly Important ]
Output the JSON array only, no additional text.
"""
stanzas = coll.generate_text(
prompt=prompt,
model_name="pro",
response_type="json",
)
segments = stanzas.get("output")
Step 4: Generate Vertical Background Images with AI
Create AI-generated background images for each lyric stanza:Copy
Ask AI
from concurrent.futures import ThreadPoolExecutor, as_completed
# Helper function to optimize segment stanzas
def optimize_segment_stanzas(stanzas):
if not stanzas: return []
optimized = []
for i in range(len(stanzas)):
current = stanzas[i]
is_music = current.get("lines") and "[Music...]" in current["lines"][0].get("text", "")
if is_music and i > 0:
prev = optimized[-1]
music_duration = current["stanza_end"] - current["stanza_start"]
if music_duration < 2.0:
prev["stanza_end"] = current["stanza_end"]
if prev.get("lines"): prev["lines"][-1]["end"] = current["stanza_end"]
continue
optimized.append(current)
return optimized
# Prepare image generation tasks
image_tasks = []
for seg_idx, segment in enumerate(segments):
segment["optimized_stanzas"] = optimize_segment_stanzas(segment["stanzas"])
for stan_idx, s in enumerate(segment["optimized_stanzas"]):
image_tasks.append({
"seg_idx": seg_idx,
"stan_idx": stan_idx,
"prompt": s["image_prompt"]
})
def generate_worker(task):
img = coll.generate_image(prompt=task["prompt"], aspect_ratio="9:16")
return task["seg_idx"], task["stan_idx"], img.id
# Generate images in parallel
MAX_WORKERS = 10
with ThreadPoolExecutor(max_workers=MAX_WORKERS) as executor:
futures = [executor.submit(generate_worker, task) for task in image_tasks]
for future in as_completed(futures):
seg_i, stan_i, img_id = future.result()
segments[seg_i]["optimized_stanzas"][stan_i]["image_id"] = img_id
Step 5: Build Multi-Layer Timeline
Create a timeline with audio, backgrounds, and animated lyrics:Copy
Ask AI
import math
from videodb.editor import (
VideoAsset, ImageAsset, TextAsset, Font,
Clip, Track, Timeline, Transition,
Position, Fit, Offset, Alignment, HorizontalAlignment, VerticalAlignment, Border
)
def calculate_y_offsets(num_lines, line_height=45, timeline_height=1080):
offsets = []
start_pixel_offset = -((num_lines - 1) * line_height) / 2
half_height = timeline_height / 2
for i in range(num_lines):
pixel_y = start_pixel_offset + (i * line_height)
offsets.append(pixel_y / half_height)
return offsets
# Timeline configuration
TIMELINE_WIDTH = 608
TIMELINE_HEIGHT = 1080
PRE_ROLL = 5.0
POST_ROLL = 5.0
CTA_DURATION = 5.0
# Render each segment
for idx, segment in enumerate(segments):
# Calculate timing
actual_seg_start = segment["start_time"]
actual_seg_end = segment["end_time"]
timeline_start_og = max(0, actual_seg_start - PRE_ROLL)
intro_duration = actual_seg_start - timeline_start_og
total_audio_duration = (actual_seg_end - timeline_start_og) + CTA_DURATION + POST_ROLL
stanza_items = segment["optimized_stanzas"]
# Initialize timeline
timeline = Timeline(conn)
timeline.resolution = f"{TIMELINE_WIDTH}x{TIMELINE_HEIGHT}"
timeline.background = "#000000"
# Layer 1: Audio Track (opacity=0 for audio only)
audio_track = Track(z_index=0)
audio_track.add_clip(
start=0.0,
clip=Clip(
asset=VideoAsset(id=video.id, start=timeline_start_og, volume=1.0),
duration=total_audio_duration,
fit=Fit.crop,
position=Position.center,
opacity=0.0,
transition=Transition(in_="fade", out="fade", duration=5.0)
),
)
timeline.add_track(audio_track)
# Layer 2: Background Images Track
images_track = Track(z_index=1)
for i, s in enumerate(stanza_items):
local_start = s["stanza_start"] - timeline_start_og
local_end = s["stanza_end"] - timeline_start_og
if i == 0:
local_start = 0.0
trans_in = Transition(in_="fade", duration=intro_duration)
else:
trans_in = Transition(in_="fade", out="fade", duration=0.35)
duration = max(0.1, local_end - local_start)
images_track.add_clip(
start=local_start,
clip=Clip(
asset=ImageAsset(id=s["image_id"]),
duration=duration,
fit=Fit.crop,
position=Position.center,
transition=trans_in
),
)
timeline.add_track(images_track)
# Layer 3: Lyrics Track
lyrics_track = Track(z_index=2)
LINE_HEIGHT = 45
lyrics_border = Border(color="#000000", width=1.5)
for s in stanza_items:
lines = s.get("lines", [])
y_offsets = calculate_y_offsets(len(lines), LINE_HEIGHT, TIMELINE_HEIGHT)
local_stanza_end = s["stanza_end"] - timeline_start_og
for l_idx, line in enumerate(lines):
local_line_start = line["start"] - timeline_start_og
line_duration = max(0.1, local_stanza_end - local_line_start)
text_asset = TextAsset(
text=line["text"],
font=Font(family="Bebas Neue", size=48, color="#FFFFFF", weight=700),
border=lyrics_border,
alignment=Alignment(horizontal=HorizontalAlignment.center, vertical=VerticalAlignment.center)
)
lyrics_track.add_clip(
start=local_line_start,
clip=Clip(
asset=text_asset,
duration=line_duration,
position=Position.center,
offset=Offset(x=0, y=y_offsets[l_idx]),
transition=Transition(in_="fade", duration=0.5)
)
)
# Layer 4: CTA (Call to Action)
cta_start_time = (actual_seg_end - timeline_start_og)
cta_lines = ["Visit the channel", "for the full video"]
cta_y_offsets = calculate_y_offsets(len(cta_lines), line_height=60, timeline_height=TIMELINE_HEIGHT)
for c_idx, cta_text in enumerate(cta_lines):
cta_asset = TextAsset(
text=cta_text,
font=Font(family="Bebas Neue", size=42, color="#FFFFFF", weight=700),
border=lyrics_border,
alignment=Alignment(horizontal=HorizontalAlignment.center, vertical=VerticalAlignment.center)
)
lyrics_track.add_clip(
start=cta_start_time,
clip=Clip(
asset=cta_asset,
duration=CTA_DURATION,
position=Position.center,
offset=Offset(x=0, y=cta_y_offsets[c_idx]),
transition=Transition(in_="fade", duration=0.5)
)
)
timeline.add_track(lyrics_track)
# Generate final stream
stream_url = timeline.generate_stream()
What You Get
A professional-looking short-form video ready for social media:- Perfectly timed to the music
- AI-generated matching visuals
- Synced lyrics on screen
- Professional pre-roll and CTA
- Vertical format (608x1080)
- Ready to upload to TikTok, Reels, or Shorts
The Result
With this system, you can:- Process music videos in minutes instead of hours
- Generate unlimited vertical clips from one source video
- Maintain consistency across all clips with the same visual style
- Stay on top of social media trends with fast production
Explore the Full Notebook
Open the complete implementation with parallel processing, custom effects, and advanced styling options.