Introduction:
As developers working on video processing, we often face challenges in accurately indexing and describing complex scenes. This blog post explores how strategic prompt engineering can significantly enhance our ability to extract detailed information from video frames, opening up new possibilities for advanced video search and analysis.
Goal of the Experiment:
Our primary objective was to demonstrate how refined prompts can significantly improve search results and information extraction from video content. We aimed to create a system capable of accurately identifying objects, actions, and even emotions in various video scenes. For this particular experiment, we used video footage from a , featuring various breeds walking down a runway with their handlers, surrounded by spectators and photographers. Our goal was to create prompts that could answer detailed queries like "Show me the happiest moments featuring a Golden Retriever" with high precision. Frame-Level vs Scene-Level Prompting:
In our experiment, we explored both frame-level and scene-level prompting:
Frame-level prompts focus on extracting information from individual frames. Scene-level prompts analyze a series of frames to describe the overall action. Important Considerations:
Computational Cost: Frame-level descriptions, while providing granular detail, are computationally heavy and potentially costly. It's not always necessary or efficient to use them for every use case. Strategic Approach: A recommended strategy is to use frame prompts as a tuning mechanism. By testing and refining frame-level prompts, we can identify the most effective way to extract information from the vision model. Once optimized, we can incorporate these insights into scene-level prompts, potentially achieving high accuracy without the computational overhead of frame-by-frame analysis.
Let's walk through our prompt iterations and their outputs:
Frame-level Prompts:
✍️ Frame Prompt 1: Basic animal identification
frame_prompt = """
You will be provided with an image. Your task is to identify and describe the animals in the image.
1. Identify Animals: List distinct animals in the image.
2. Describe animals: Provide a brief description of each animal, including breed, color, and any other notable features.
"""
Output:
[
{
"name": "dog",
"context": "a grey dog is being led down the green carpet by a person wearing a blue jacket."
}
]
This output lacked specificity in breed identification and environmental context. Our next prompt aims to address these issues.
✍️ Frame Prompt 2: Enhanced breed identification and spatial information
frame_prompt = """
You will be provided with an image. Your task is to identify the animals and their breeds in the image.
1. Identify Animals: List distinct animals and their breed in the image.
2. Describe the environment: Provide a brief description of the interaction between the animals and the objects or the environment around them.
"""
Output:
[
{
"name": "Dog - Poodle",
"context": "A Poodle being led down a carpeted path by a handler in the green dress, participating in what appears to be a dog show."
},
{
"name": "Dog - Russell Terrier",
"context": "A Russell Terrier being guided by a handler in black attire on the same path, also taking part in the dog show."
}
]
This output significantly improved breed identification and provided more environmental context. With this satisfactory frame-level output, we're now ready to incorporate these learnings into scene-level prompts.
However, let's first examine what a generic scene-level prompt can achieve without the added context from our frame-level experiments.
Scene-level
✏️ Scene Prompt 1: Basic scene-level prompt
scene_prompt = """
You will be provided with a series of images. Your task is to view all images together and describe the overall story or scene in the best possible way.
Expected Output:
- A detailed story or scene description.
- A list of objects and actions in each image.
Example Output:
{
"scene_story": "A person is cooking in the kitchen and then someone rings the doorbell.",
"images": [
{"description": "Someone is cooking in the kitchen."},
{"description": "Someone rings the doorbell."}
]
}
"""
Output:
{
"scene_story": "A dog show competition where handlers are presenting their dogs on a green runway surrounded by an audience and judges.",
"images": [
{"description": "Handlers walking dogs down a green carpeted runway."},
{"description": "Spectators and photographers observing the dog show."},
{"description": "Judges evaluating the dogs as they pass by."}
]
}
Note: This generic scene-level prompt provided a basic structure but lacked the detailed breed identification and specific actions we achieved with our frame-level prompts. Our next iteration aims to incorporate these learnings.
✏️ Scene Prompt 2: Combining frame-level specifications in scene-level prompt
scene_prompt = """
You will be provided with a series of images. Your task is to view all images together and describe the overall story or scene in the best possible way.
For each image, your task is to identify the animals and their breeds in the image.
1. Identify the animals present in the frame with specifications about their colour and breed, and any other notable features.
2. Describe the environment: Provide a brief description of the interaction between the animals and the objects or the environment around them.
"""
Output:
{
"scene_story": "A dog show is ongoing where handlers are showcasing their dogs on a designated runway, surrounded by spectators and event staff.",
"images": [
{
"description": "Three handlers are walking their dogs on a green carpeted runway. The runway is fenced and lined with purple banners. There are numerous spectators and photographers capturing the event. One dog is a white and brown breed, possibly a Jack Russell Terrier, and the other is a medium-sized, grey wavy-haired breed, potentially a Poodle."
},
{
"description": "The scene is similar with the dogs and handlers moving further along the runway. The same dogs, a Jack Russell Terrier and a grey Poodle, are being showcased. The audience continues to watch and photograph the event."
},
{
"description": "The handlers and dogs are closer to the end of the runway. The dogs remain the primary focus, with one being a Jack Russell Terrier and the other a grey Poodle. The spectators and photographers are actively engaged in observing the dogs."
}
]
}
Note: This output successfully incorporated breed identification and more detailed actions, but still lacked some specificity in describing multiple animals in a single frame. Our next iteration addresses this.
✏️ Scene Prompt 3: Ensuring correct breed identification in context of environment
scene_prompt = """
You will be provided with a series of images. Your task is to describe the scene based on these sequential images. Stay focused on the animals in the frame.
For each image, your task is to:
1. Identify the animals present in the frame define their colour, breed, and any other notable features. If there are multiple animals present, define the same characteristics for all those in focus.
2. Describe the environment in context of the animals: Provide a brief description of the interaction between the animals and the objects or the environment around them. Be specific about the actions of every animal if there are multiple animals detected in a single frame.
"""
Output:
{
"scene_story": "A lively dog show takes center stage as two enthusiastic canines race down a carpeted aisle, cheered on by an eager crowd and accompanied by their handlers.",
"images": [
{
"description": "A white and brown Jack Russell Terrier runs energetically on a green carpet, escorted by a woman in a blue jacket. Behind them, a grey poodle closely follows while being guided by another handler in maroon clothing. The carpeted path is bordered by purple barriers adorned with yellow and blue flowers. Spectators and photographers on the left capture the moment."
},
{
"description": "The Jack Russell Terrier continues to lead, bounding forward with its handler. The grey poodle and its handler have gained some ground but maintain their position behind the terrier. The handlers maintain a brisk pace alongside the dogs as the crowd continues to observe and document the event."
},
{
"description": "In the final image, the Jack Russell Terrier keeps its lead, with the grey poodle still following closely. The handlers are encouraging their dogs with focused enthusiasm. The spectators remain engaged, some with cameras ready to capture the exciting finish."
}
]
}
Note: This prompt successfully captured both the specific breeds and the overall scene dynamics, providing a detailed and accurate description. However, the format could be more structured for easier parsing and use in applications. Our final iteration addresses this.
✏️ Scene Prompt 4: Structured JSON output with emotional states
scene_prompt = """
You will be provided with a series of images from a dog show. Your task is to describe the scene based on these sequential images. Focus on identifying the breeds and describing the key actions.
For each image, your task is to:
1. Identify the animals present in the frame, including their breed, color, and any notable features.
2. Describe the actions of the animals and any interactions with the environment or other animals.
3. Highlight any emotional expressions or notable moments.
Output should be a structured JSON with the following format:
{
"scene_story": "Brief overview of the scene",
"images": [
{
"frame_time": "Time of the frame in seconds",
"breeds": [{"breed": "Golden Retriever", "color": "golden"}],
"actions": "Description of the actions and interactions",
"emotion": "Observed emotion or notable moment"
},
...
]
}
"""
Output:
{
"scene_story": "A dog show competition where handlers are walking their dogs down the green carpeted runway while photographers and spectators watch and take photos.",