Pinecone Inference is an API that gives you access to embedding and reranking models hosted on Pinecone’s infrastructure.

Pinecone currently hosts models in the US only.

Overview

Pinecone Inference is an API that gives you access to models hosted on Pinecone’s infrastructure.

Prerequisites

To use the Inference API, you need a Pinecone account and a Pinecone API key.

Models

Embed

The embed endpoint generates embeddings for text data, such as queries or passages, using a specified embedding model.

The following embedding models are available:

ModelDimensionMax input tokensMax batch sizeParameters
multilingual-e5-large102450796input_type: "query" or "passage"
truncate: "END" or "NONE"

Rerank

This feature is available only on Standard and Enterprise plans.

The rerank endpoint takes documents and scores them by their relevance to a query. Rerankers are used to increase retrieval quality as part of two-stage retrieval systems

The following reranking models are available:

ModelMax query tokensMax query + doc tokensMax documentsParameters
bge-reranker-v2-m32561024100truncate: "END" or "NONE"

SDK support

You can access the embed and rerank endpoints directly or using a supported Pinecone SDK:

SDKembed supportrerank support
PythonYesYes
Node.jsYesNo
GoYesNo

To install the latest SDK version, run the following command:

If you already have an SDK, upgrade to the latest version as follows:

Rate Limits

Rate limits are in place to ensure fair usage of the Inference API. Rate limits are measured in requests per minute (RPM) and tokens per minute (TPM) and vary based on the model you use and the pricing plan you are on.

Rate limits are defined at the project level.

Starter plan

ModelRPMTPMMax usage per month
multilingual-e5-large500250K5M tokens

To request a rate increase, contact Support.

ModelRPMTPM
multilingual-e5-large5001M
bge-reranker-v2-m360-

Cost

Inference billing is based on tokens used. To learn more, see Understanding cost.