VideoDB Documentation

Pages

Visual Search and Indexing

Deep Dive into Prompt Engineering : Mastering Video Scene Indexing

⁠

Introduction:

As developers working on video processing, we often face challenges in accurately indexing and describing complex scenes. This blog post explores how strategic prompt engineering can significantly enhance our ability to extract detailed information from video frames, opening up new possibilities for advanced video search and analysis.

Goal of the Experiment:

Our primary objective was to demonstrate how refined prompts can significantly improve search results and information extraction from video content. We aimed to create a system capable of accurately identifying objects, actions, and even emotions in various video scenes. For this particular experiment, we used video footage from a

dog show⁠

, featuring various breeds walking down a runway with their handlers, surrounded by spectators and photographers. Our goal was to create prompts that could answer detailed queries like "Show me the happiest moments featuring a Golden Retriever" with high precision.

Frame-Level vs Scene-Level Prompting:

In our experiment, we explored both frame-level and scene-level prompting:

Frame-level prompts focus on extracting information from individual frames.

Scene-level prompts analyze a series of frames to describe the overall action.

Important Considerations:

Computational Cost: Frame-level descriptions, while providing granular detail, are computationally heavy and potentially costly. It's not always necessary or efficient to use them for every use case.

Strategic Approach: A recommended strategy is to use frame prompts as a tuning mechanism. By testing and refining frame-level prompts, we can identify the most effective way to extract information from the vision model. Once optimized, we can incorporate these insights into scene-level prompts, potentially achieving high accuracy without the computational overhead of frame-by-frame analysis.

Let's walk through our prompt iterations and their outputs:

Frame-level Prompts:

✍️ Frame Prompt 1: Basic animal identification

frame_prompt = """

You will be provided with an image. Your task is to identify and describe the animals in the image.

1. Identify Animals: List distinct animals in the image.

2. Describe animals: Provide a brief description of each animal, including breed, color, and any other notable features.

"""

Output:

[

{

"name": "dog",

"context": "a grey dog is being led down the green carpet by a person wearing a blue jacket."

}

]

This output lacked specificity in breed identification and environmental context. Our next prompt aims to address these issues.

✍️ Frame Prompt 2: Enhanced breed identification and spatial information

frame_prompt = """

You will be provided with an image. Your task is to identify the animals and their breeds in the image.

1. Identify Animals: List distinct animals and their breed in the image.

2. Describe the environment: Provide a brief description of the interaction between the animals and the objects or the environment around them.

"""

Output:

[

{

"name": "Dog - Poodle",

"context": "A Poodle being led down a carpeted path by a handler in the green dress, participating in what appears to be a dog show."

{

"name": "Dog - Russell Terrier",

"context": "A Russell Terrier being guided by a handler in black attire on the same path, also taking part in the dog show."

}

]

This output significantly improved breed identification and provided more environmental context. With this satisfactory frame-level output, we're now ready to incorporate these learnings into scene-level prompts.

However, let's first examine what a generic scene-level prompt can achieve without the added context from our frame-level experiments.

Scene-level

✏️ Scene Prompt 1: Basic scene-level prompt

scene_prompt = """

You will be provided with a series of images. Your task is to view all images together and describe the overall story or scene in the best possible way.

Expected Output:

- A detailed story or scene description.

- A list of objects and actions in each image.

Example Output:

{

"scene_story": "A person is cooking in the kitchen and then someone rings the doorbell.",

"images": [

{"description": "Someone is cooking in the kitchen."},

{"description": "Someone rings the doorbell."}

]

}

"""

Output:

{

"scene_story": "A dog show competition where handlers are presenting their dogs on a green runway surrounded by an audience and judges.",

"images": [

{"description": "Handlers walking dogs down a green carpeted runway."},

{"description": "Spectators and photographers observing the dog show."},

{"description": "Judges evaluating the dogs as they pass by."}

]

}

Note: This generic scene-level prompt provided a basic structure but lacked the detailed breed identification and specific actions we achieved with our frame-level prompts. Our next iteration aims to incorporate these learnings.

✏️ Scene Prompt 2: Combining frame-level specifications in scene-level prompt

scene_prompt = """

You will be provided with a series of images. Your task is to view all images together and describe the overall story or scene in the best possible way.

For each image, your task is to identify the animals and their breeds in the image.

1. Identify the animals present in the frame with specifications about their colour and breed, and any other notable features.

2. Describe the environment: Provide a brief description of the interaction between the animals and the objects or the environment around them.

"""

Output:

{

"scene_story": "A dog show is ongoing where handlers are showcasing their dogs on a designated runway, surrounded by spectators and event staff.",

"images": [

{

"description": "Three handlers are walking their dogs on a green carpeted runway. The runway is fenced and lined with purple banners. There are numerous spectators and photographers capturing the event. One dog is a white and brown breed, possibly a Jack Russell Terrier, and the other is a medium-sized, grey wavy-haired breed, potentially a Poodle."

{

"description": "The scene is similar with the dogs and handlers moving further along the runway. The same dogs, a Jack Russell Terrier and a grey Poodle, are being showcased. The audience continues to watch and photograph the event."

{

"description": "The handlers and dogs are closer to the end of the runway. The dogs remain the primary focus, with one being a Jack Russell Terrier and the other a grey Poodle. The spectators and photographers are actively engaged in observing the dogs."

}

]

}

Note: This output successfully incorporated breed identification and more detailed actions, but still lacked some specificity in describing multiple animals in a single frame. Our next iteration addresses this.

✏️ Scene Prompt 3: Ensuring correct breed identification in context of environment

scene_prompt = """

You will be provided with a series of images. Your task is to describe the scene based on these sequential images. Stay focused on the animals in the frame.

For each image, your task is to:

1. Identify the animals present in the frame define their colour, breed, and any other notable features. If there are multiple animals present, define the same characteristics for all those in focus.

2. Describe the environment in context of the animals: Provide a brief description of the interaction between the animals and the objects or the environment around them. Be specific about the actions of every animal if there are multiple animals detected in a single frame.

"""

Output:

{

"scene_story": "A lively dog show takes center stage as two enthusiastic canines race down a carpeted aisle, cheered on by an eager crowd and accompanied by their handlers.",

"images": [

{

"description": "A white and brown Jack Russell Terrier runs energetically on a green carpet, escorted by a woman in a blue jacket. Behind them, a grey poodle closely follows while being guided by another handler in maroon clothing. The carpeted path is bordered by purple barriers adorned with yellow and blue flowers. Spectators and photographers on the left capture the moment."

{

"description": "The Jack Russell Terrier continues to lead, bounding forward with its handler. The grey poodle and its handler have gained some ground but maintain their position behind the terrier. The handlers maintain a brisk pace alongside the dogs as the crowd continues to observe and document the event."

{

"description": "In the final image, the Jack Russell Terrier keeps its lead, with the grey poodle still following closely. The handlers are encouraging their dogs with focused enthusiasm. The spectators remain engaged, some with cameras ready to capture the exciting finish."

}

]

}

Note: This prompt successfully captured both the specific breeds and the overall scene dynamics, providing a detailed and accurate description. However, the format could be more structured for easier parsing and use in applications. Our final iteration addresses this.

✏️ Scene Prompt 4: Structured JSON output with emotional states

scene_prompt = """

You will be provided with a series of images from a dog show. Your task is to describe the scene based on these sequential images. Focus on identifying the breeds and describing the key actions.

For each image, your task is to:

1. Identify the animals present in the frame, including their breed, color, and any notable features.

2. Describe the actions of the animals and any interactions with the environment or other animals.

3. Highlight any emotional expressions or notable moments.

Output should be a structured JSON with the following format:

{

"scene_story": "Brief overview of the scene",

"images": [

{

"frame_time": "Time of the frame in seconds",

"breeds": [{"breed": "Golden Retriever", "color": "golden"}],

"actions": "Description of the actions and interactions",

"emotion": "Observed emotion or notable moment"

...

]

}

"""

Output:

{

"scene_story": "A dog show competition where handlers are walking their dogs down the green carpeted runway while photographers and spectators watch and take photos.",

"images": [

{

"frame_time": "0 seconds",

"breeds": [

{"breed": "Wire Fox Terrier", "color": "white and tan"},

{"breed": "Standard Poodle", "color": "gray"}

"actions": "Handlers are walking their dogs down a green carpeted runway. The Wire Fox Terrier on the right is being walked by a handler wearing a dark outfit. A Standard Poodle in the middle is being walked by a handler in a teal blue blazer. Photographers are taking pictures and spectators are observing.",

"emotion": "The Wire Fox Terrier looks attentive and focused."

{

"frame_time": "2 seconds",

"breeds": [

{"breed": "Wire Fox Terrier", "color": "white and tan"},

{"breed": "Standard Poodle", "color": "gray"}

"actions": "The Wire Fox Terrier is still being walked by the handler in a dark outfit. The Standard Poodle is also being walked down the runway, slightly ahead of the Wire Fox Terrier. Both handlers and dogs are making their way to the front end of the runway while onlookers take pictures and watch.",

"emotion": "The dogs are showing a calm and focused demeanor."

}

]

}

Note: This final iteration provides a comprehensive, structured output that's easy to work with programmatically and captures all the key information we set out to extract, including breed identification, actions, and even emotional states.

With a refined prompt in action, we can get some pretty interesting results! Here’s the result for our initial query: Show me the happiest moments featuring a Golden Retriever

⁠

Conclusion:

Through this experiment, we've demonstrated how iterative prompt engineering can significantly improve the accuracy and detail of video scene indexing. We progressed from basic animal identification to detailed breed recognition, action description, and even emotional state detection. This approach can be adapted to various video processing tasks beyond dog shows, opening up new possibilities in video indexing and search applications.

Key Takeaways:

Frame-level prompts are excellent for detailed, specific information about individual moments.

Scene-level prompts provide a cohesive narrative and capture actions spanning multiple frames.

Structured outputs (like JSON) make the extracted information more readily usable in downstream applications.

Iterative refinement of prompts is crucial to achieving the desired level of detail and accuracy.

Consider the trade-off between accuracy and computational cost when deciding between frame-level and scene-level analysis.

Use frame-level prompts as a tuning mechanism to inform and improve scene-level prompts.

Look for ways to better integrate frame-level insights into scene-level descriptions for more comprehensive and efficient video indexing.

Future Directions:

Moving forward, we should explore ways to more effectively bridge the gap between frame-level and scene-level analyses. This could involve developing algorithms that can aggregate frame-level insights to inform scene-level descriptions, or creating more sophisticated prompts that can extract frame-level details while operating at a scene level.

By keeping these considerations in mind, we can continue to refine our approach to video indexing, balancing the need for detailed information with computational efficiency. This balance will be crucial as we scale our solutions to handle larger volumes of video data across diverse use cases.

Remember, the key to effective prompt engineering lies in clearly defining your information needs and iteratively refining your approach based on the outputs.

Happy coding!

⁠

⭐️ Bonus: Challenges in Prompt Refinement

During our experiment, we encountered challenges in refining our prompts. Two intermediate prompts (originally Scene Prompts 3 and 4) didn't yield the improvements we expected. These challenges highlight the iterative nature of prompt engineering:

1. Over-specification: We found that adding too many specific instructions sometimes led to inconsistent results, with the model focusing on certain aspects while neglecting others.

2. Balancing detail and generalization: Striking the right balance between detailed instructions and allowing the model enough flexibility to generalize was a key challenge.

3. Prompt length: Very long, detailed prompts sometimes resulted in the model losing focus on the primary task.

Want to print your doc?
This is not the way.

Try clicking the ⋯ next to your doc name or using a keyboard shortcut (

CtrlP

) instead.