Evaluate search results

This feature is in early access and is not intended for production usage.

This page shows you how to use the Pinecone Evals API to evaluate the relevance of search results, using an LLM to approximate human judgement. Result evaluation is helpful when experimenting with various aspects of your retrieval pipeline, such as embedding models, chunking strategies, and search techniques. You can also try Pinecone’s experimental Python Evals SDK, a simple interface for comparing and measuring different search approaches with interactive reports.

How it works

You provide a query and list of search results

Pinecone prompts an LLM to evaluate the the results

Pinecone returns an evaluation for each result

Generate evaluations

To evaluate the relevance of search results, use the eval API endpoint with the following request parameters:

Parameter	Description
`query.inputs.text`	The query text.
`eval.fields`	The field in each search result to evaluate.
`eval.mode`	The mode of the prompt sent to the LLM. Accepted values: `"search"` or `"rag"`. This determines how the LLM evaluates and scores the relevance of results. For more details, see Relevance scores.
`hits`	The search results to evaluate.

For example, the following request evaluates 10 results as relevant input for generating a response to the query, “What are some interesting facts about human biology?”:

curl

curl "https://api.pinecone.io/evals" \
  -H "Content-Type: application/json" \
  -H "Api-Key: YOUR_API_KEY" \
  -d '{
    "query": {
      "inputs": {
        "text": "What are some interesting facts about the human body?"
      }
    },
    "eval": {
      "fields": ["chunk_text"],
      "mode": "rag"
    },
    "hits": [
      { "chunk_text": "The human body is made up of about 60% water." },
      { "chunk_text": "The cornea is the only part of the body with no blood supply; it gets oxygen directly from the air." },
      { "chunk_text": "Dogs have an incredible sense of smell, much stronger than humans." }
    ]
  }'

The response includes a true/false judgment of relevance for each result. For details on how the LLM evaluates relevance, see Relevance scores.

curl

{
  "hits": [
    {
      "index": 0,
      "fields": {
        "chunk_text": "The human body is made up of about 60% water."
      },
      "relevant": true
    },
    {
      "index": 1,
      "fields": {
        "chunk_text": "The cornea is the only part of the body with no blood supply; it gets oxygen directly from the air."
      },
      "relevant": true
    },
    {
      "index": 2,
      "fields": {
        "chunk_text": "Dogs have an incredible sense of smell, much stronger than humans."
      },
      "relevant": false
    }
  ],
  "metrics": {
    "ndcg": 0.8339912323981488,
    "map": 1.0,
    "mrr": 1.0
  },
  "usage": {
    "evaluation_input_tokens": 1066,
    "evaluation_output_tokens": 261
  }
}

Include evaluation details

To include more detailed scoring and justification in the response, set eval.debug to true:

curl

curl "https://api.pinecone.io/evals" \
  -H "Content-Type: application/json" \
  -H "Api-Key: YOUR_API_KEY" \
  -d '{
    "query": {
      "inputs": {
        "text": "What are some interesting facts about the human body?"
      }
    },
    "eval": {
      "fields": ["chunk_text"],
      "mode": "rag",
      "debug": true
    },
    "hits": [
      { "chunk_text": "The human body is made up of about 60% water." },
      { "chunk_text": "The cornea is the only part of the body with no blood supply; it gets oxygen directly from the air." },
      { "chunk_text": "Dogs have an incredible sense of smell, much stronger than humans." }
    ]
  }'

The expanded response includes a relevance score and justification for each result. For details on how the LLM evaluates relevance, see Relevance scores.

curl

{
  "hits": [
    {
      "index": 0,
      "fields": {
        "chunk_text": "The human body is made up of about 60% water."
      },
      "relevant": true,
      "score": 2,
      "justification": "The passage provides a specific, factual statement about the human body's water composition, which directly addresses the query about interesting human body facts. While brief and lacking depth (it's only one fact), it is precisely the type of information requested and would be useful in constructing a response. The fact is accurate and relevant, though it represents only a single data point rather than comprehensive information."
    },
    {
      "index": 1,
      "fields": {
        "chunk_text": "The cornea is the only part of the body with no blood supply; it gets oxygen directly from the air."
      },
      "relevant": true,
      "score": 3,
      "justification": "The passage provides a specific, factual detail about the human cornea having no blood supply and getting oxygen directly from the air. This is precisely the type of interesting biological fact that would be relevant to include in a response about interesting human body facts. The information is concise but contains a complete and specific anatomical fact."
    },
    {
      "index": 2,
      "fields": {
        "chunk_text": "Dogs have an incredible sense of smell, much stronger than humans."
      },
      "relevant": false,
      "score": 0,
      "justification": "The passage is about dogs' sense of smell, which has no connection to the query about human body facts. The passage mentions humans only as a comparison point for dogs' superior smell, but provides no information about the human body itself that would be useful for answering the query."
    }
  ],
  "metrics": {
    "ndcg": 0.8339912323981488,
    "map": 1.0,
    "mrr": 1.0
  },
  "usage": {
    "evaluation_input_tokens": 1066,
    "evaluation_output_tokens": 268
  }
}

Set a relevance threshold

By default, any result given a relevance score of 2 (e.g., moderately relevant) or greater by the LLM is considered relevant. However, you can change this by setting the eval.relevance_threshold parameter. For example, to set a threshold of 3 (e.g., highly relevant), use the following request:

curl

curl "https://api.pinecone.io/evals" \
  -H "Content-Type: application/json" \
  -H "Api-Key: YOUR_API_KEY" \
  -d '{
    "query": {
      "inputs": {
        "text": "What are some interesting facts about the human body?"
      }
    },
    "eval": {
      "fields": ["chunk_text"],
      "mode": "rag",
      "debug": true,
      "relevance_threshold": 3
    },
    "hits": [
      { "chunk_text": "The human body is made up of about 60% water." },
      { "chunk_text": "The cornea is the only part of the body with no blood supply; it gets oxygen directly from the air." },
      { "chunk_text": "Dogs have an incredible sense of smell, much stronger than humans." }
    ]
  }'

Notice that only the one result with a score of 3 (e.g., highly relevant) now shows "relevant": true in the response:

curl

{
  "hits": [
    {
      "index": 0,
      "fields": {
        "chunk_text": "The human body is made up of about 60% water."
      },
      "relevant": false,
      "score": 2,
      "justification": "The passage provides a specific, factual statement about the human body's water composition (60%), which directly addresses the query asking for interesting facts about the human body. While this is just a single fact without elaboration or additional context, it is precisely the type of information requested and would be useful in constructing a response. However, it lacks depth as it's only one brief fact without further explanation."
    },
    {
      "index": 1,
      "fields": {
        "chunk_text": "The cornea is the only part of the body with no blood supply; it gets oxygen directly from the air."
      },
      "relevant": true,
      "score": 3,
      "justification": "The passage provides a specific, factual detail about the human cornea having no blood supply and getting oxygen directly from the air. This is precisely the type of interesting biological fact that would be relevant to include in a response about interesting human body facts. The information is concise but contains a complete and specific anatomical insight."
    },
    {
      "index": 2,
      "fields": {
        "chunk_text": "Dogs have an incredible sense of smell, much stronger than humans."
      },
      "relevant": false,
      "score": 0,
      "justification": "The passage discusses a fact about dogs' sense of smell compared to humans, but the query specifically asks about facts about the human body. While the passage makes a brief comparative reference to humans, it is primarily about dogs and provides no substantive information about the human body itself. The passage would not contribute meaningful information to answer a query about human body facts."
    }
  ],
  "metrics": {
    "ndcg": 0.8339912323981488,
    "map": 0.5,
    "mrr": 0.5
  },
  "usage": {
    "evaluation_input_tokens": 1066,
    "evaluation_output_tokens": 299
  }
}

Relevance scores

The relevance scores assigned to search results depend on the mode parameter in the request.

When the mode parameter is set to "search", the LLM is prompted to evaluate each result as a direct response to the query and give each result a score between 0 and 3 based on the following criteria:

Score	Description
3	Highly relevant: Passage precisely addresses the core query, provides comprehensive and directly applicable information, contains minimal irrelevant content, and delivers factually accurate insights.
2	Moderately relevant: Passage addresses a substantial portion of the query but may miss some elements, provides useful information that lacks some depth or comprehensiveness, and contains only minor irrelevant details.
1	Partially relevant: Passage touches on query-related aspects but lacks depth or covers only a small part, contains notable irrelevant content, or requires additional context to be useful.
0	Not relevant: Passage fails to address the query, contains primarily irrelevant or off-topic content, or provides no meaningful insight for the query.

Get started

Index data

Search

Optimize

Manage data

Manage cost

Move to production

Admin

Operations

Using pods

How it works

Generate evaluations

Include evaluation details

Set a relevance threshold

Relevance scores

Get started

Index data

Search

Optimize

Manage data

Manage cost

Move to production

Admin

Operations

Using pods

​How it works

​Generate evaluations

​Include evaluation details

​Set a relevance threshold

​Relevance scores

How it works

Generate evaluations

Include evaluation details

Set a relevance threshold

Relevance scores