A Scene object describes a unique event in the video. From a timeline perspective it’s a timestamp range.
video_id : id of the video object
start : seconds
end : seconds
description : string description
Each scene object has an attribute frames, that has list of Frame objects.
Frame
Each Scene can be described by a list of frames. Each Frame object primarily has the URL of the image and its description field.
id : ID of the frame object
url : URL of the image
frame_time : Timestamp of the frame in the video
description : string description
video_id : id of the video object
scene_id : id of the scene object
We provide you with easy-to-use Objects and Functions to bring flexibility in designing your visual understanding pipeline. With these tools, you have the freedom to:
Extract scene according to your use case.
Go to frame level abstraction.
Assign label, custom model description for each frame.
Use of multiple models, prompts for each scene or frame to convert information to text.
Send multiple frames to vision model for better temporal activity understanding.
extract_scenes()
This function accepts the extraction_type and extraction_config and returns a
Vision models excel at describing images, but videos present an added complexity due to the temporal changes in the information. With our pipeline, you can maintain image-level understanding in frames and combine them using LLMs at the scene level to capture temporal or activity-related understanding.
You have freedom to iterate through each scene and frame level to describe the information for indexing purposes.
frame.description = "bring text from external sources/ pipeline"
)
Create Scene by custom annotation
These annotations can come from your application or from external vision model, if you extract the description using any vision LLM
for scene in scenes:
scene.description = "summary of frame level description"
Using this pipeline, you have the freedom to design your own flow. In the example above, we’ve described each frame in the scene independently, but some vision models allow multiple images in one go as well. Feel free to customise your flow as per your needs.
Experiment with sending multiple frames to a vision model.
Utilize prompts to describe multiple frames, then assign these descriptions to the scene.
Integrate your own vision model into the pipeline.
We’ll soon be adding more details and strategies for effective and advanced multimodal search. We welcome your input on what strategies have worked best in your specific use cases