1. Upsert
- Generate a hierarchical naming convention for vector IDs.
- One recommended pattern may be
parentId-chunkId
where parentId is the ID of the document andchunkId
is an integer starting with 0 to the total number of chunks - While capturing embeddings and preparing upserts for Pinecone, capture the total number of chunks for each
parentId
. - Append the
chunkCount
to the metadata field of theparentId-0
vector, or you may append them to all chunks if desired. This should be an integer and cardinality will naturally be low. - Upsert the vectors with the
parentId-chunkId
as the ID. - Reverse lookups can be created where you find a chunk and want to find the parent document or sibling chunks.
- One recommended pattern may be
2. Delete by ID (to avoid delete by metadata filter)
-
Identify the
parentId
- This could be an internal process to identify documents that have been modified or deleted.
- Or, this could be a end-user initiated process to delete a document based on a query that finds a sibling chunk or
parentId
.
-
Once the
parentId
is identified, use thefetch
endpoint to retrieve thechunkCount
from the metadata field by sending theparentId-0
vector ID. -
Build a list of IDs using the pattern of
parentId
andchunkCount
. -
Batch these together and send them to the
delete
endpoint using the IDs of the vectors. - You may then upsert the new version of the document with the new vectors and metadata or if it is a delete-only process, you are finished.
3. Updates
- Updates are intended to apply small changes to a record whether that means updating the vector, or more commonly, the metadata.
- In cases where you are chunking data, you are more likely going to need to delete and re-upsert using the steps above.
- If you are only performing very small changes to a small number of vectors, the update process is ideal.
- If you are updating a large number of vectors, you may want to consider batching and slowing down the updates to avoid rate limiting or affecting query latency and response times.