Filter with metadata
You can limit your vector search based on metadata. Pinecone lets you attach metadata key-value pairs to vectors in an index, and specify filter expressions when you query the index.
Searches with metadata filters retrieve exactly the number of nearest-neighbor results that match the filters. For most cases, the search latency will be even lower than unfiltered searches.
Searches without metadata filters do not consider metadata. To combine keywords with semantic search, see sparse-dense embeddings.
Supported metadata types
You can associate a metadata payload with each vector in an index, as key-value pairs in a JSON object where keys are strings and values are one of:
- String
- Number (integer or floating point, gets converted to a 64 bit floating point)
- Booleans (true, false)
- List of strings
Null metadata values are not supported. Instead of setting a key to hold a
null value, we recommend you remove that key from the metadata payload.
For example, the following would be valid metadata payloads:
{
"genre": "action",
"year": 2020,
"length_hrs": 1.5
}
{
"color": "blue",
"fit": "straight",
"price": 29.99,
"is_jeans": true
}
Supported metadata size
Pinecone supports 40KB of metadata per vector.
Metadata query language
Pinecone’s filtering query language is based on MongoDB’s query and projection operators. Pinecone currently supports a subset of those selectors:
Filter | Description | Supported types |
---|---|---|
$eq | Matches vectors with metadata values that are equal to a specified value. | Number, string, boolean |
$ne | Matches vectors with metadata values that are not equal to a specified value. | Number, string, boolean |
$gt | Matches vectors with metadata values that are greater than a specified value. | Number |
$gte | Matches vectors with metadata values that are greater than or equal to a specified value. | Number |
$lt | Matches vectors with metadata values that are less than a specified value. | Number |
$lte | Matches vectors with metadata values that are less than or equal to a specified value. | Number |
$in | Matches vectors with metadata values that are in a specified array. | String, number |
$nin | Matches vectors with metadata values that are not in a specified array. | String, number |
$exists | Matches vectors with the specified metadata field. | Boolean |
For example, the following vector has a "genre"
metadata field with a list of strings:
{ "genre": ["comedy", "documentary"] }
This means "genre"
takes on both values, and queries with the following filters will match the vector:
{"genre":"comedy"}
{"genre": {"$in":["documentary","action"]}}
{"$and": [{"genre": "comedy"}, {"genre":"documentary"}]}
However, queries with the following filter will not match the vector:
{ "$and": [{ "genre": "comedy" }, { "genre": "drama" }] }
Additionally, queries with the following filters will not match the vector because they are invalid. They will result in a query compilation error:
# INVALID QUERY:
{"genre": ["comedy", "documentary"]}
# INVALID QUERY:
{"genre": {"$eq": ["comedy", "documentary"]}}
Query an index with metadata filters
Metadata filter expressions can be included with queries to limit the search to only vectors matching the filter expression.
top_k
over 1000, avoid returning vector data (include_values=True
) or metadata (include_metadata=True
).Use the filter
parameter to specify the metadata filter expression. For example, to search for a movie in the “documentary” genre:
Additional filter examples
Filter | Example | Description |
---|---|---|
$eq | {"genre": {"$eq": "documentary"}} | Matches vectors with the genre “documentary”. |
$ne | {"genre": {"$ne": "drama"}} | Matches vectors with a genre other than “drama”. |
$gt | {"year": {"$gt": 2019}} | Matches vectors with a year greater than 2019. |
$gte | {"year": {"$gte": 2020}} | Matches vectors with a year greater than or equal to 2020. |
$lt | {"year": {"$lt": 2020}} | Matches vectors with a year less than 2020. |
$lte | {"year": {"$lte": 2020}} | Matches vectors with a year less than or equal to 2020. |
$in | {"genre": {"$in": ["comedy", "documentary"]}} | Matches vectors with the genre “comedy” or “documentary”. |
$nin | {"genre": {"$nin": ["comedy", "documentary"]}} | Matches vectors with a genre other than “comedy” or “documentary”. |
$exists | {"genre": {"$exists": true}} | Matches vectors with the “genre” field. |
Combine filters
The metadata filters can be combined by using $and
and $or
operators:
Operator | Example | Description |
---|---|---|
$and | {"$and": [{"genre": {"$eq": "drama"}}, {"year": {"$gte": 2020}}]} | Matches vectors with the genre “drama” and a year greater than or equal to 2020. |
$or | {"$or": [{"genre": {"$eq": "drama"}}, {"year": {"$gte": 2020}}]} | Matches vectors with the genre “drama” or a year greater than or equal to 2020. |
Insert metadata into an index
Metadata can be included in upsert requests as you insert vectors. For example, the following code inserts vectors with metadata into an index:
Delete vectors by metadata filter
Serverless indexes do not support deleting by metadata. You can delete records by ID prefix instead.
To use metadata values to select vectors to be deleted, pass a metadata filter expression to the delete
operation. This deletes all vectors matching the metadata filter expression.
For example, to delete all vectors with genre “documentary” and year 2019 from an index, use the following code:
Manage high-cardinality in pod-based indexes
For pod-based indexes, Pinecone indexes all metadata by default. When metadata contains many unique values, pod-based indexes will consume significantly more memory, which can lead to performance issues, pod fullness, and a reduction in the number of possible vectors that fit per pod.
To avoid indexing high-cardinality metadata that is not needed for filtering, use selective metadata indexing, which lets you specify which fields need to be indexed and which do not, helping to reduce the overall cardinality of the metadata index while still ensuring that the necessary fields are able to be filtered.
Since high-cardinality metadata does not cause high memory utilization in serverless indexes, selective metadata indexing is not supported.
Considerations for serverless indexes
For each serverless index, Pinecone clusters records that are likely to be queried together. When you query a serverless index with a metadata filter, Pinecone first uses internal metadata statistics to exclude clusters that do not have records matching the filter and then chooses the most relevant remaining clusters.
Note the following considerations:
-
When filtering by numeric metadata that cannot be ordered in a meaningful way (e.g., IDs as opposed to dates or prices), the chosen clusters may not be accurate. This is because the metadata statistics for each cluster reflect the min and max metadata values in the cluster, and min and max are not helpful when there is no meaningful order.
In such cases, it is best to store the metadata as strings instead of numbers. When filtering by string metadata, the chosen clusters will be more accurate, with a low false-positive rate, because the string metadata statistics for each cluster reflect the actual string values, compressed for space-efficiency.
-
When you use a highly selective metadata filter (i.e., a filter that rejects the vast majority of records in the index), the chosen clusters may not contain enough matching records to satisfy the designated
top_k
.
For more details about query execution, see Serverless architecture.
Was this page helpful?