Understanding Pinecone Inference API
Pinecone Inference is an API that gives you access to embedding and reranking models hosted on Pinecone’s infrastructure.
Pinecone currently hosts models in the US only.
Overview
Pinecone Inference is an API that gives you access to models hosted on Pinecone’s infrastructure.
Prerequisites
To use the Inference API, you need a Pinecone account and a Pinecone API key.
Models
Embed
The embed
endpoint generates embeddings for text data, such as queries or passages, using a specified embedding model.
The following embedding models are available:
Model | Dimension | Max input tokens | Max batch size | Parameters |
---|---|---|---|---|
multilingual-e5-large | 1024 | 507 | 96 | input_type : "query" or "passage" truncate : "END" or "NONE" |
Rerank
This feature is available only on Standard and Enterprise plans.
The rerank
endpoint takes documents and scores them by their relevance to a query.
Rerankers are used to increase retrieval quality as part of two-stage retrieval systems
The following reranking models are available:
Model | Max query tokens | Max query + doc tokens | Max documents | Parameters |
---|---|---|---|---|
bge-reranker-v2-m3 | 256 | 1024 | 100 | truncate : "END" or "NONE" |
SDK support
You can access the embed
and rerank
endpoints directly or using a supported Pinecone SDK:
SDK | embed support | rerank support |
---|---|---|
Python | Yes | Yes |
Node.js | Yes | No |
Go | Yes | No |
To install the latest SDK version, run the following command:
If you already have an SDK, upgrade to the latest version as follows:
Rate Limits
Rate limits are in place to ensure fair usage of the Inference API. Rate limits are measured in requests per minute (RPM) and tokens per minute (TPM) and vary based on the model you use and the pricing plan you are on.
Rate limits are defined at the project level.
Starter plan
Model | RPM | TPM | Max usage per month |
---|---|---|---|
multilingual-e5-large | 500 | 250K | 5M tokens |
Paid plan
To request a rate increase, contact Support.
Model | RPM | TPM |
---|---|---|
multilingual-e5-large | 500 | 1M |
bge-reranker-v2-m3 | 60 | - |
Cost
Inference billing is based on tokens used. To learn more, see Understanding cost.
Was this page helpful?