Understanding Pinecone Inference API
Overview
Pinecone Inference is an API that gives you access to models hosted on Pinecone’s infrastructure.
This feature is in public preview.
Prerequisites
To use the Inference API, you need a Pinecone account and a Pinecone API key.
Models
Embed
The embed
endpoint generates embeddings for text data, such as queries or passages, using a specified embedding model.
The following embedding models are available:
Model | Dimension | Max input tokens | Max batch size | Parameters |
---|---|---|---|---|
multilingual-e5-large | 1024 | 507 | 96 | input_type: ‘query’ or ‘passage’ truncation: ‘END’ or ‘NONE’ |
Rerank
The rerank
endpoint takes documents and scores them by their relevance to a query.
Rerankers are used to increase retrieval quality as part of two-stage retrieval systems
The following reranking models are available:
Model | Max query tokens | Max query + doc tokens | Max documents |
---|---|---|---|
bge-reranker-v2-m3 | 256 | 1024 | 100 |
Rate Limits
We have rate limits in place to ensure fair usage of the Inference API. Rate limits are measured in requests per minute (RPM) and tokens per minute (TPM). The rate limits vary based on the model you use and whether you are on the free or paid tier. Rate limits are defined at the project level.
Starter plan
Model | RPM | TPM | Starter tier usage |
---|---|---|---|
multilingual-e5-large | 500 | 250K | 5M tokens |
Paid Tiers
To request a rate increase, contact Support.
Model | RPM | TPM |
---|---|---|
multilingual-e5-large | 500 | 1M |
bge-reranker-v2-m3 | 60 | - |
Cost
Inference billing is based on tokens used. To learn more, see Understanding cost.
Was this page helpful?