⚡️ Query transformation or processing is a crucial aspect of enhancing RAG pipelines, especially when dealing with multimodal information. By breaking down queries into their spoken and visual components, you can create more targeted and efficient search capabilities. ⚡️
While manual breakdown is a good starting point, automating this process with LLMs can greatly improve scalability and accuracy, making your systems more powerful and user-friendly.
# Manual query breaking
spoken_query = "Show me where the narrator discusses the formation of the solar system"
visual_query = "Visualize the Milky Way galaxy"
#Using LLM to transform the query
from openai import OpenAI
transformation_prompt ="""
Divide the following query into two distinct parts: one for spoken content and one for visual content. The spoken content should refer to any narration, dialogue, or verbal explanations and The visual content should refer to any images, videos, or graphical representations. Format the response strictly as:\nSpoken: <spoken_query>\nVisual: <visual_query>\n\nQuery: {query}
"""
# Initialize OpenAI client
client = OpenAI()
defdivide_query(query):
# Use the OpenAI client to create a chat completion with a structured prompt
query ="Show me the footage where the narrator talks about the terrestrial planets and Mercury, Venus, Earth are visible on the screen"
spoken_query, visual_query =divide_query(query)
print(f"Spoken Query: {spoken_query}")
print(f"Visual Query: {visual_query}")
🔎 Step 3: Perform Searches
Now that we have our divided the queries, let's perform searches on both the spoken word and scene indexes:
from videodb import SearchType, IndexType
# Perform the search using the spoken query
spoken_results = video.search(
query=spoken_query,
index_type=IndexType.spoken_word,
search_type=SearchType.semantic
)
# Perform the search using the visual query, change default parameters.
scene_results = video.search(
query=visual_query,
index_type=IndexType.scene,
search_type=SearchType.semantic,
score_threshold=0.1,
dynamic_score_percentage=100,
)
# Optionally, you can play the results to see what was found
spoken_results.play()
scene_results.play()
🔀 Step 4: Combine Search Results of Both Modalities
Each search result provides a list of timestamps that are relevant to the query in the relevant modality (semantic & scene/ visual, in this case).
There are two ways to combine these search results:
Union: This method takes all the timestamps from every search result, creating a comprehensive list that includes every relevant time, even if some timestamps appear in only one result.
Intersection: This method only includes timestamps that appear in both the search results, resulting in a smaller list with times which are universally relevant for both queries.
Depending on the method you prefer, you can pass the appropriate argument to the combine_results() function below.
Finally, let's generate a stream of the intersecting segments and watch it:
from videodb import play_stream
print(f"Multimodal Query: {query}")
stream_link = video.generate_stream(results)
play_stream(stream_link)
This would play a video stream containing only the segments where both the spoken content and visual content match our original multimodal query.
Load content from console.videodb.io?
Loading external content may reveal information to 3rd parties. Learn more
Allow
Conclusion
Congratulations 🙌🏼 You've successfully implemented a multimodal search workflow. This powerful technique allows for precise identification of video segments that match both spoken and visual criteria, opening up new possibilities for:
Law Enforcement: Helps in quickly retrieving crucial evidence from vast amounts of surveillance and news footage.
Media and Journalism: Facilitates the process of locating specific segments within hours of news broadcasts, aiding in efficient reporting and fact-checking.
Public Safety: Enhances the ability of authorities to disseminate important information to the public by quickly identifying and sharing relevant content.
... and much more!
There can be more methods to enable multimodal search queries and we’ll be adding detailed guide for each method.
Remember, the key to mastering this technique is experimentation. Try different queries, adjust your search parameters, and see how you can fine-tune the results for your specific use case.
To learn more about Scene Index, explore the following guides: