When you index your data and retrieve it with certain parameters, how do you measure the effectiveness of your search? This is where search evaluation comes in. By using test data, queries, and their results, you can assess the performance of indexes, search parameters, and other related factors. This evaluation helps you understand how well your search system is working and identify areas for improvement.
We can imagine information in video indexed as documents which are “timestamps + some textual information” describing the visuals as there is no audio in this video”.
We can use the structure as
timestamp : (start, end ),description: “string”
So, if we use index_scenes function
At (1, 2) - 29 seconds is displayed
At (2, 3) - 28 seconds is displayed
...
This continues until:
At (29, 30) - 1 second is displayed
Ground Truth
It is the the ideal expected result. To evaluate the performance of search we need some test queries and the expected results.
Let's say for the query "Six" the expected result documents are at the following timestamps:
We will call this list of timestamps our ground truth for the query "Six."
Evaluation Metrics
To evaluate the effectiveness of our search functionality, we'll can experiment with our query "Six" with various search parameters. 📊
The search results can be categorized as follows:
Retrieved Documents 🔍:
Retrieved Relevant Documents: Matches our ground truth ✅
Retrieved Irrelevant Documents: Don't match our ground truth ❌
Non-Retrieved Documents 🚫:
Non-Retrieved Relevant Documents: In our ground truth but not in results 😕
Non-Retrieved Irrelevant Documents: Neither in ground truth nor results 👍
We can further classify these categories in terms of search accuracy:
💡 This classification helps us assess the precision and recall of our search algorithm, enabling further optimization.
Accuracy
Accuracy measures how well our search algorithm retrieves required documents while excluding irrelevant ones. It can be calculated as follows:
In other words, accuracy is the ratio of correctly classified documents (both retrieved relevant and non-retrieved irrelevant) to the total number of documents. 📊
To get a more comprehensive evaluation of search performance, it's crucial to consider other metrics such as precision, recall, and F1-score in addition to accuracy. 💡🔬
Precision and Recall
Precision is percentage of relevant retrieved docs out of all retrieved docs. It answers the question: "Of the documents our search returned, how many were actually relevant?"
Recall indicates the percentage of relevant documents that were successfully retrieved. It addresses the question: "Out of all the relevant documents, how many did our search find?" 🔍
The Precision-Recall Trade-off
These metrics often have an inverse relationship, leading to a trade-off:
Recall 📈:
Measures the model's ability to find all relevant cases in a dataset.
Increases or remains constant as more documents are retrieved.
Never decreases with an increase in retrieved documents.
Precision 📉:
Refers to the proportion of correct positive identifications.
Typically decreases as more documents are retrieved.
Drops due to increased likelihood of including false positives.
Search in VideoDB
Let’s understand the search interface provided by VideoDB and measure results with the above metric.
This function performs a search on video content with various customizable parameters:
query: The search query string.
search_type: Determines the search method. Keyword search on single video level returns all the documents .
SearchType.semantic(default): For question-answering queries. ( across 1000s of videos/ collection ) Checkout
This interface allows for flexible and precise searching of video content, with options to fine-tune result filtering based on relevance scores and dynamic thresholds.
Experiment
Follow this notebook to explore experiments on fine-tuning search results and gain a deeper understanding of the methods involved
Here’s a basic outcome of the default settings for both search types on the query "six" for the above video:
1. Semantic Search Default:
2. Keyword Search:
Outcome
As you can see, keyword search is best suited for queries like "teen" and "six." However, if the queries are in natural language, such as "find me a 6" then semantic search is more appropriate.
Keyword search would struggle to find relevant results for such natural language queries.
Search + LLM
For complex queries like "Find me all the numbers greater than six" a basic search will not work effectively since it merely matches the query with documents in vector space and returns the matching documents.
In such cases, you can apply a loose filter to get all the documents that match the query. However, you will need to add an additional layer of intelligence using a Large Language Model (LLM). The matched documents can then be passed to the LLM to curate a response that accurately answers the query.