Overview

Pinecone Inference is an API that gives you access to models hosted on Pinecone’s infrastructure.

This feature is in public preview.

Prerequisites

To use the Inference API, you need a Pinecone account and a Pinecone API key.

Models

Embed

The embed endpoint generates embeddings for text data, such as queries or passages, using a specified embedding model. The following embedding models are available:

ModelDimensionMax input tokensMax batch sizeParameters
multilingual-e5-large102450796input_type: ‘query’ or ‘passage’
truncation: ‘END’ or ‘NONE’

Rerank

The rerank endpoint takes documents and scores them by their relevance to a query. Rerankers are used to increase retrieval quality as part of two-stage retrieval systems

The following reranking models are available:

ModelMax query tokensMax query + doc tokensMax documents
bge-reranker-v2-m32561024100

Rate Limits

We have rate limits in place to ensure fair usage of the Inference API. Rate limits are measured in requests per minute (RPM) and tokens per minute (TPM). The rate limits vary based on the model you use and whether you are on the free or paid tier. Rate limits are defined at the project level.

Starter plan

ModelRPMTPMStarter tier usage
multilingual-e5-large500250K5M tokens

To request a rate increase, contact Support.

ModelRPMTPM
multilingual-e5-large5001M
bge-reranker-v2-m360-

Cost

Inference billing is based on tokens used. To learn more, see Understanding cost.