Evaluate search results
This feature is in early access and is not intended for production usage.
This page shows you how to use the Pinecone Evals API to evaluate the relevance of search results, using an LLM to approximate human judgement. Result evaluation is helpful when experimenting with various aspects of your retrieval pipeline, such as embedding models, chunking strategies, and search techniques.
You can also try Pinecone’s experimental Python Evals SDK, a simple interface for comparing and measuring different search approaches with interactive reports.
How it works
You provide a query and list of search results
Pinecone prompts an LLM to evaluate the the results
Pinecone returns an evaluation for each result
Generate evaluations
To evaluate the relevance of search results, use the eval
API endpoint with the following request parameters:
Parameter | Description |
---|---|
query.inputs.text | The query text. |
eval.fields | The field in each search result to evaluate. |
eval.mode | The mode of the prompt sent to the LLM. Accepted values: "search" or "rag" .This determines how the LLM evaluates and scores the relevance of results. For more details, see Relevance scores. |
hits | The search results to evaluate. |
For example, the following request evaluates 10 results as relevant input for generating a response to the query, “What are some interesting facts about human biology?”:
The response includes a true/false judgment of relevance for each result. For details on how the LLM evaluates relevance, see Relevance scores.
Include evaluation details
To include more detailed scoring and justification in the response, set eval.debug
to true
:
The expanded response includes a relevance score
and justification
for each result. For details on how the LLM evaluates relevance, see Relevance scores.
Set a relevance threshold
By default, any result given a relevance score of 2 (e.g., moderately relevant) or greater by the LLM is considered relevant. However, you can change this by setting the eval.relevance_threshold
parameter. For example, to set a threshold of 3 (e.g., highly relevant), use the following request:
Notice that only the one result with a score of 3 (e.g., highly relevant) now shows "relevant": true
in the response:
Relevance scores
The relevance scores assigned to search results depend on the mode
parameter in the request.
When the mode
parameter is set to "search"
, the LLM is prompted to evaluate each result as a direct response to the query and give each result a score
between 0 and 3 based on the following criteria:
Score | Description |
---|---|
3 | Highly relevant: Passage precisely addresses the core query, provides comprehensive and directly applicable information, contains minimal irrelevant content, and delivers factually accurate insights. |
2 | Moderately relevant: Passage addresses a substantial portion of the query but may miss some elements, provides useful information that lacks some depth or comprehensiveness, and contains only minor irrelevant details. |
1 | Partially relevant: Passage touches on query-related aspects but lacks depth or covers only a small part, contains notable irrelevant content, or requires additional context to be useful. |
0 | Not relevant: Passage fails to address the query, contains primarily irrelevant or off-topic content, or provides no meaningful insight for the query. |
When the mode
parameter is set to "search"
, the LLM is prompted to evaluate each result as a direct response to the query and give each result a score
between 0 and 3 based on the following criteria:
Score | Description |
---|---|
3 | Highly relevant: Passage precisely addresses the core query, provides comprehensive and directly applicable information, contains minimal irrelevant content, and delivers factually accurate insights. |
2 | Moderately relevant: Passage addresses a substantial portion of the query but may miss some elements, provides useful information that lacks some depth or comprehensiveness, and contains only minor irrelevant details. |
1 | Partially relevant: Passage touches on query-related aspects but lacks depth or covers only a small part, contains notable irrelevant content, or requires additional context to be useful. |
0 | Not relevant: Passage fails to address the query, contains primarily irrelevant or off-topic content, or provides no meaningful insight for the query. |
When the mode
parameter is set to "rag"
, the LLM is prompted to evaluate each result as input for generating a response to the query (e.g., RAG) and give each result a score
between 0 and 3 based on the following criteria:
Score | Description |
---|---|
3 | Highly relevant: Passage contains specific information about key concepts in the query, provides detailed context that would be valuable for constructing a comprehensive response, contains minimal irrelevant content, and offers unique insights or facts directly related to query topics. |
2 | Moderately relevant: Passage addresses some query concepts with useful information, provides context that would contribute to a response but may lack some specificity or depth, and contains only minor irrelevant details. |
1 | Partially relevant: Passage mentions query-related concepts but with limited depth or specificity, provides general information that has some connection to the query, or contains notable irrelevant content alongside some useful information. |
0 | Not relevant: Passage contains no information about key concepts in the query, provides no context that would help construct a meaningful response, or consists primarily of irrelevant content. |
Was this page helpful?