Understanding Pinecone Inference
Pinecone Inference is a service that gives you access to embedding and reranking models hosted on Pinecone’s infrastructure.
Pinecone currently hosts models in the US only.
Workflows
You can use Pinecone Inference as a standalone service or integrated with Pinecone’s database operations.
Standalone inference
When you use Pinecone Inference as a standalone service, you generate embeddings and rerank results as distinct steps from other database operations like upsert and query.
Embed data
Create an index
Upsert embeddings
Embed queries
Search the index
Rerank results
Integrated inference
This feature is in public preview.
When you use integrated inference, embedding and reranking are integrated with database operations and do not require extra steps.
Create an index configured for a specific embedding model
Upsert data with integrated embedding
Search the index with integrated embedding and reranking
Embedding models
The following embedding models are hosted by Pinecone and avaiable for standalone or integrated inference:
multilingual-e5-large
multilingual-e5-large
is a high-performance dense embedding model trained on a mixture of multilingual datasets. It works well on messy data and short queries expected to return medium-length passages of text (1-2 paragraphs).
Details
- Vector type: Dense
- Modality: Text
- Dimension: 1024
- Recommended similarity metric: Cosine
- Max input tokens per sequence: 507
- Max sequences per batch: 96
Parameters
The multilingual-e5-large
model supports the following parameters:
Parameter | Type | Required/Optional | Description | Default |
---|---|---|---|---|
input_type | string | Required | The type of input data. Accepted values: query or passage . | |
truncate | string | Optional | How to handle inputs longer than those supported by the model. Accepted values: END or NONE .END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit. | END |
Rate limits
Rate limits are defined at the project level and vary based on pricing plan and input type.
Input type | Starter plan | Paid plans |
---|---|---|
passage | 250k tokens per minute | 1M tokens per minute |
query | 50k tokens per minute | 250k tokens per minute |
Combined | 5M tokens per month | Unlimited tokens per month |
pinecone-sparse-english-v0
This feature is in public preview.
pinecone-sparse-english-v0
is a sparse embedding model for converting text to sparse vectors for keyword or hybrid semantic/keyword search. Built on the innovations of the DeepImpact architecture, the model directly estimates the lexical importance of tokens by leveraging their context, unlike traditional retrieval models like BM25, which rely solely on term frequency.
Details
- Vector type: Sparse
- Modality: Text
- Recommended similarity metric: Dotproduct
- Max input tokens per sequence: 512
- Max sequences per batch: 96
Parameters
The pinecone-sparse-english-v0
model supports the following parameters:
Parameter | Type | Required/Optional | Description | Default |
---|---|---|---|---|
input_type | string | Required | The type of input data. Accepted values: query or passage . | |
truncate | string | Optional | How to handle inputs longer than those supported by the model. Accepted values: END or NONE .END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit. | END |
return_tokens | boolean | Optional | Whether to return the string tokens. | False |
Rate limits
Rate limits are defined at the project level and vary based on pricing plan.
Limit type | Starter plan | Paid plans |
---|---|---|
Tokens per minute | 250K | 1M |
Tokens per month | 5M | Unlimited |
Reranking models
The following reranking models are hosted by Pinecone and available for standalone or integrated inference:
bge-reranker-v2-m3
bge-reranker-v2-m3
is a high-performance, multilingual reranking model that works well on messy data and short queries expected to return medium-length passages of text (1-2 paragraphs).
Details
- Modality: Text
- Max tokens per query and document pair: 1024
- Max documents: 100
Parameters
The bge-reranker-v2-m3
model supports the following parameters:
Parameter | Type | Required/Optional | Description | Default |
---|---|---|---|---|
truncate | string | Optional | How to handle inputs longer than those supported by the model. Accepted values: END or NONE .END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit. | NONE |
Rate limits
Rate limits are defined at the project level and vary based on pricing plan.
Limit type | Starter plan | Paid plans |
---|---|---|
Requests per minute | 60 | 60 |
Requests per month | 500 | Unlimited |
To request a rate increase, contact Support.
pinecone-rerank-v0
This feature is in public preview.
pinecone-rerank-v0
is a state of the art reranking model that out-performs competitors on widely accepted benchmarks. It can handle chunks up to 512 tokens (1-2 paragraphs).
Details
- Modality: Text
- Max tokens per query and document pair: 512
- Max documents: 100
Parameters
The pinecone-rerank-v0
model supports the following parameters:
Parameter | Type | Required/Optional | Description | Default |
---|---|---|---|---|
truncate | string | Optional | How to handle inputs longer than those supported by the model. Accepted values: END or NONE .END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit. | END |
Rate limits
Rate limits are defined at the project level and vary based on pricing plan.
Limit type | Starter plan | Paid plans |
---|---|---|
Requests per minute | 60 | 60 |
Requests per month | 500 | Unlimited |
cohere-rerank-3.5
This feature is available only on Standard and Enterprise plans.
cohere-rerank-3.5
is Cohere’s leading reranking model, balancing performance and latency for a wide range of enterprise search applications.
Details
- Modality: Text
- Max tokens per query and document pair: 40,000
- Max documents: 200
Parameters
The cohere-rerank-3.5
model supports the following parameters:
Parameter | Type | Required/Optional | Description |
---|---|---|---|
max_chunks_per_doc | integer | Optional | Long documents will be automatically truncated to the specified number of chunks. Accepted range: 1 - 3072 . |
Rate limits
Rate limits are defined at the project level and vary based on pricing plan.
Limit type | Starter plan | Paid plans |
---|---|---|
Requests per minute | N/A | 300 |
Requests per month | N/A | Unlimited |
SDK support
Standalone inference operations (embed
and rerank
) are supported by all Pinecone SDKs.
Integrated inference operations (create_for_model
, records/upsert
, and records/search
) are supported by the latest Python SDK plus the pinecone-plugin-records
plugin. Install the latest SDK and the plugin as follows:
The pinecone-plugin-records
plugin is not currently compatible with the pinecone[grpc]
version of the Python SDK.
Cost
Inference billing is based on tokens used. To learn more, see Understanding cost.
Was this page helpful?