Performing deletes by metadata filtering can be a very expensive process for any database. By using a hierarchical naming convention for vector IDs, you can avoid this process and perform deletes by ID. This is more efficient and will reduce the impact on the compute resources, minimize query latency, and maintain a more consistent user experience.Documentation Index
Fetch the complete documentation index at: https://docs.pinecone.io/llms.txt
Use this file to discover all available pages before exploring further.
1. Upsert
- Generate a hierarchical naming convention for vector IDs.
- One recommended pattern may be
parentId-chunkIdwhere parentId is the ID of the document andchunkIdis an integer starting with 0 to the total number of chunks - While capturing embeddings and preparing upserts for Pinecone, capture the total number of chunks for each
parentId. - Append the
chunkCountto the metadata field of theparentId-0vector, or you may append them to all chunks if desired. This should be an integer and cardinality will naturally be low. - Upsert the vectors with the
parentId-chunkIdas the ID. - Reverse lookups can be created where you find a chunk and want to find the parent document or sibling chunks.
- One recommended pattern may be
2. Delete by ID (to avoid delete by metadata filter)
-
Identify the
parentId- This could be an internal process to identify documents that have been modified or deleted.
- Or, this could be a end-user initiated process to delete a document based on a query that finds a sibling chunk or
parentId.
-
Once the
parentIdis identified, use thefetchendpoint to retrieve thechunkCountfrom the metadata field by sending theparentId-0vector ID. -
Build a list of IDs using the pattern of
parentIdandchunkCount. -
Batch these together and send them to the
deleteendpoint using the IDs of the vectors.curl - You may then upsert the new version of the document with the new vectors and metadata or if it is a delete-only process, you are finished.
3. Updates
- Updates are intended to apply small changes to a record whether that means updating the vector, or more commonly, the metadata.
- In cases where you are chunking data, you are more likely going to need to delete and re-upsert using the steps above.
- If you are only performing very small changes to a small number of vectors, the update process is ideal.
- If you are updating a large number of vectors, you may want to consider batching and slowing down the updates to avoid rate limiting or affecting query latency and response times.