Pinecone Inference is an API that gives you access to embedding and reranking models hosted on Pinecone’s infrastructure.

Pinecone currently hosts models in the US only.

Prerequisites

To use the Inference API, you need a Pinecone account and a Pinecone API key.

Embedding models

The embed endpoint generates embeddings for text data, such as queries or passages, using a specified embedding model.

The following embedding models are available:

multilingual-e5-large

multilingual-e5-large is a high-performance text embedding model trained on a mixture of multilingual datasets. It works well on messy data and short queries expected to return medium-length passages of text (1-2 paragraphs).

Details

  • Dimension: 1024
  • Vector type: Dense
  • Recommended similarity metric: Cosine
  • Max input tokens per sequence: 507
  • Max sequences per batch: 96

Parameters

The multilingual-e5-large model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
input_typestringRequiredThe type of input data. Accepted values: query or passage.
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END

Rate limits

Rate limits are defined at the project level and vary based on pricing plan and input type.

Input typeStarter planPaid plans
passage250k tokens per minute1M tokens per minute
query50k tokens per minute250k tokens per minute
Combined5M tokens per monthUnlimited tokens per month

Reranking models

The rerank endpoint takes documents and scores them by their relevance to a query. Rerankers are used to increase retrieval quality as part of two-stage retrieval systems

The following reranking models are available:

bge-reranker-v2-m3

bge-reranker-v2-m3 is a high-performance, multilingual reranking model that works well on messy data and short queries expected to return medium-length passages of text (1-2 paragraphs).

Details

  • Max tokens per query and document pair: 1024
  • Max documents: 100

Parameters

The bge-reranker-v2-m3 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
NONE

Rate limits

Rate limits are defined at the project level and vary based on pricing plan.

Limit typeStarter planPaid plans
Requests per minute6060
Requests per month500Unlimited

To request a rate increase, contact Support.

pinecone-rerank-v0

This feature is in early access and is not intended for production usage.

pinecone-rerank-v0 is a state of the art reranking model that out-performs competitors on widely accepted benchmarks. It can handle chunks up to 512 tokens (1-2 paragraphs).

Details

  • Max tokens per query and document pair: 512
  • Max documents: 100

Parameters

The pinecone-rerank-v0 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END

Rate limits

Rate limits are defined at the project level and vary based on pricing plan.

Limit typeStarter planPaid plans
Requests per minute6060
Requests per month500Unlimited

SDK support

You can access the embed and rerank endpoints directly or using a supported Pinecone SDK:

SDKembed supportrerank support
PythonYesYes
Node.jsYesYes
JavaYesYes
GoYesYes
.NETYesYes

To install the latest SDK version, run the following command:

If you already have an SDK, upgrade to the latest version as follows:

Cost

Inference billing is based on tokens used. To learn more, see Understanding cost.