Record format
When you upsert raw text for Pinecone to convert to vectors automatically, each record consists of the following:Example:
- ID: A unique string identifier for the record.
- Text: The raw text for Pinecone to convert to a dense vector for semantic search or a sparse vector for lexical search, depending on the embedding model integrated with the index. This field name must match the
embed.field_map
defined in the index. - Metadata (optional): All additional fields are stored as record metadata. You can filter by metadata when searching or deleting records.
Upserting raw text is supported only for indexes with integrated embedding.
Use structured IDs
Use a structured, human-readable format for record IDs, including ID prefixes that reflect the type of data you’re storing, for example:- Document chunks:
document_id#chunk_number
- User data:
user_id#data_type#item_id
- Multi-tenant data:
tenant_id#document_id#chunk_id
document1#chunk1
- Using hash delimiterdocument1_chunk1
- Using underscore delimiterdocument1:chunk1
- Using colon delimiter
- Efficiency: Applications can quickly identify which record it should operate on.
- Clarity: Developers can easily understand what they’re looking at when examining records.
- Flexibility: ID prefixes enable list operations for fetching and updating records.
Include metadata
Include metadata key-value pairs that support your application’s key operations, for example:- Enable query-time filtering: Add fields for time ranges, categories, or other criteria for filtering searches for increased accuracy and relevance.
- Link related chunks: Use fields like
document_id
andchunk_number
to keep track of related records and enable efficient chunk deletion and document updates. - Link back to original data: Include
chunk_text
ordocument_url
for traceability and user display.
- String
- Number (integer or floating point, gets converted to a 64-bit floating point)
- Boolean (true, false)
- List of strings
Pinecone supports 40 KB of metadata per record.
Example
This example demonstrates how to manage document chunks in Pinecone using structured IDs and comprehensive metadata. It covers the complete lifecycle of chunked documents: upserting, searching, fetching, updating, and deleting chunks, and updating an entire document.Upsert chunks
When upserting documents that have been split into chunks, combine structured IDs with comprehensive metadata:Upserting raw text is supported only for indexes with integrated embedding.
Python
Search chunks
To search the chunks of a document, use a metadata filter expression that limits the search appropriately:Searching with text is supported only for indexes with integrated embedding.
Python
Fetch chunks
To retrieve all chunks for a specific document, first list the record IDs using the document prefix, and then fetch the complete records:Python
Pinecone is eventually consistent, so it’s possible that a write (upsert, update, or delete) followed immediately by a read (query, list, or fetch) may not return the latest version of the data. If your use case requires retrieving data immediately, consider implementing a small delay or retry logic after writes.
Update chunks
To update specific chunks within a document, first list the chunk IDs, and then update individual records:Python
Delete chunks
To delete chunks of a document, use a metadata filter expression that limits the deletion appropriately:Python
Update an entire document
When the amount of chunks or ordering of chunks for a document changes, the recommended approach is to first delete all chunks using a metadata filter, and then upsert the new chunks:Python