Pinecone Inference is a service that gives you access to embedding and reranking models hosted on Pinecone’s infrastructure.

Pinecone currently hosts models in the US only.

Workflows

You can use Pinecone Inference as a standalone service or integrated with Pinecone’s database operations.

Standalone inference

When you use Pinecone Inference as a standalone service, you generate embeddings and rerank results as distinct steps from other database operations like upsert and query.

1

Embed data

2

Create an index

3

Upsert embeddings

4

Embed queries

5

Search the index

6

Rerank results

Integrated inference

This feature is in public preview.

When you use integrated inference, embedding and reranking are integrated with database operations and do not require extra steps.

1

Create an index configured for a specific embedding model

2

Upsert data with integrated embedding

3

Search the index with integrated embedding and reranking

Embedding models

The following embedding models are hosted by Pinecone and avaiable for standalone or integrated inference:

multilingual-e5-large

multilingual-e5-large is a high-performance dense embedding model trained on a mixture of multilingual datasets. It works well on messy data and short queries expected to return medium-length passages of text (1-2 paragraphs).

Details

  • Vector type: Dense
  • Modality: Text
  • Dimension: 1024
  • Recommended similarity metric: Cosine
  • Max input tokens per sequence: 507
  • Max sequences per batch: 96

Parameters

The multilingual-e5-large model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
input_typestringRequiredThe type of input data. Accepted values: query or passage.
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END

Rate limits

Rate limits are defined at the project level and vary based on pricing plan and input type.

Input typeStarter planPaid plans
passage250k tokens per minute1M tokens per minute
query50k tokens per minute250k tokens per minute
Combined5M tokens per monthUnlimited tokens per month

pinecone-sparse-english-v0

This feature is in public preview.

pinecone-sparse-english-v0 is a sparse embedding model for converting text to sparse vectors for keyword or hybrid semantic/keyword search. Built on the innovations of the DeepImpact architecture, the model directly estimates the lexical importance of tokens by leveraging their context, unlike traditional retrieval models like BM25, which rely solely on term frequency.

Details

  • Vector type: Sparse
  • Modality: Text
  • Recommended similarity metric: Dotproduct
  • Max input tokens per sequence: 512
  • Max sequences per batch: 96

Parameters

The pinecone-sparse-english-v0 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
input_typestringRequiredThe type of input data. Accepted values: query or passage.
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END
return_tokensbooleanOptionalWhether to return the string tokens.False

Rate limits

Rate limits are defined at the project level and vary based on pricing plan.

Limit typeStarter planPaid plans
Tokens per minute250K1M
Tokens per month5MUnlimited

Reranking models

The following reranking models are hosted by Pinecone and available for standalone or integrated inference:

bge-reranker-v2-m3

bge-reranker-v2-m3 is a high-performance, multilingual reranking model that works well on messy data and short queries expected to return medium-length passages of text (1-2 paragraphs).

Details

  • Modality: Text
  • Max tokens per query and document pair: 1024
  • Max documents: 100

Parameters

The bge-reranker-v2-m3 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
NONE

Rate limits

Rate limits are defined at the project level and vary based on pricing plan.

Limit typeStarter planPaid plans
Requests per minute6060
Requests per month500Unlimited

To request a rate increase, contact Support.

pinecone-rerank-v0

This feature is in public preview.

pinecone-rerank-v0 is a state of the art reranking model that out-performs competitors on widely accepted benchmarks. It can handle chunks up to 512 tokens (1-2 paragraphs).

Details

  • Modality: Text
  • Max tokens per query and document pair: 512
  • Max documents: 100

Parameters

The pinecone-rerank-v0 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END

Rate limits

Rate limits are defined at the project level and vary based on pricing plan.

Limit typeStarter planPaid plans
Requests per minute6060
Requests per month500Unlimited

cohere-rerank-3.5

This feature is available only on Standard and Enterprise plans.

cohere-rerank-3.5 is Cohere’s leading reranking model, balancing performance and latency for a wide range of enterprise search applications.

Details

  • Modality: Text
  • Max tokens per query and document pair: 40,000
  • Max documents: 200

Parameters

The cohere-rerank-3.5 model supports the following parameters:

ParameterTypeRequired/OptionalDescription
max_chunks_per_docintegerOptionalLong documents will be automatically truncated to the specified number of chunks. Accepted range: 1 - 3072.

Rate limits

Rate limits are defined at the project level and vary based on pricing plan.

Limit typeStarter planPaid plans
Requests per minuteN/A300
Requests per monthN/AUnlimited

SDK support

Standalone inference operations (embed and rerank) are supported by all Pinecone SDKs.

Integrated inference operations (create_for_model, records/upsert, and records/search) are supported by the latest Python SDK plus the pinecone-plugin-records plugin. Install the latest SDK and the plugin as follows:

pip install --upgrade pinecone pinecone-plugin-records

The pinecone-plugin-records plugin is not currently compatible with the pinecone[grpc] version of the Python SDK.

Cost

Inference billing is based on tokens used. To learn more, see Understanding cost.