Pinecone Inference is a service that gives you access to embedding and reranking models hosted on Pinecone’s infrastructure.

Pinecone currently hosts models in the US only.

Workflows

You can use Pinecone Inference as a standalone service or integrated with Pinecone’s database operations.

Standalone inference

When you use Pinecone Inference as a standalone service, you generate embeddings and rerank results as distinct steps from other database operations like upsert and query.

1

Embed data

2

Create an index

3

Upsert embeddings

4

Embed queries

5

Search the index

6

Rerank results

Integrated inference

When you use integrated inference, embedding and reranking are integrated with database operations and do not require extra steps.

1

Create an index configured for a specific embedding model

2

Upsert data with integrated embedding

3

Search the index with integrated embedding and reranking

Indexes with integrated embedding do not support updating or importing with text.

Embedding models

The following embedding models are hosted by Pinecone and avaiable for standalone or integrated inference:

multilingual-e5-large

multilingual-e5-large is an efficient dense embedding model trained on a mixture of multilingual datasets. It works well on messy data and short queries expected to return medium-length passages of text (1-2 paragraphs).

Details

  • Vector type: Dense
  • Modality: Text
  • Dimension: 1024
  • Recommended similarity metric: Cosine
  • Max input tokens per sequence: 507
  • Max sequences per batch: 96

Parameters

The multilingual-e5-large model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
input_typestringRequiredThe type of input data. Accepted values: query or passage.
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END

Quotas

Quotas are defined at the project level and vary based on pricing plan and input type.

Input typeStarter planPaid plans
passage250k tokens per minute1M tokens per minute
query50k tokens per minute250k tokens per minute
Combined5M tokens per monthUnlimited tokens per month

llama-text-embed-v2

llama-text-embed-v2 is a high-performance dense embedding model optimized for text retrieval and ranking tasks. It is trained on a diverse range of text corpora and provides strong performance on longer passages and structured documents.

This feature is in public preview.

Details

  • Vector type: Dense
  • Modality: Text
  • Dimension: 1024 (default), 2048, 768, 512, 384
  • Recommended similarity metric: Cosine
  • Max input tokens per sequence: 2048
  • Max sequences per batch: 96

Parameters

The llama-text-embed-v2 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
input_typestringRequiredThe type of input data. Accepted values: query or passage.
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END
dimensionintegerOptionalDimension of the vector to return.1024

Quotas

Quotas are defined at the project level and vary based on pricing plan and input type.

Input typeStarter planPaid plans
Combined250k tokens per minute1M tokens per minute
Combined5M tokens per monthUnlimited tokens per month

pinecone-sparse-english-v0

This feature is in public preview.

pinecone-sparse-english-v0 is a sparse embedding model for converting text to sparse vectors for keyword or hybrid semantic/keyword search. Built on the innovations of the DeepImpact architecture, the model directly estimates the lexical importance of tokens by leveraging their context, unlike traditional retrieval models like BM25, which rely solely on term frequency.

Details

  • Vector type: Sparse
  • Modality: Text
  • Recommended similarity metric: Dotproduct
  • Max input tokens per sequence: 512
  • Max sequences per batch: 96

Parameters

The pinecone-sparse-english-v0 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
input_typestringRequiredThe type of input data. Accepted values: query or passage.
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END
return_tokensbooleanOptionalWhether to return the string tokens.False

Quotas

Quotas are defined at the project level and vary based on pricing plan.

Limit typeStarter planPaid plans
Tokens per minute250K1M
Tokens per month5MUnlimited

Reranking models

The following reranking models are hosted by Pinecone and available for standalone or integrated inference:

cohere-rerank-3.5

This feature is available only on Standard and Enterprise plans.

cohere-rerank-3.5 is Cohere’s leading reranking model, balancing performance and latency for a wide range of enterprise search applications.

Details

  • Modality: Text
  • Max tokens per query and document pair: 40,000
  • Max documents: 200

Parameters

The cohere-rerank-3.5 model supports the following parameters:

ParameterTypeRequired/OptionalDescription
max_chunks_per_docintegerOptionalLong documents will be automatically truncated to the specified number of chunks. Accepted range: 1 - 3072.
rank_fieldsarray of stringsOptionalThe fields to use for reranking. The model reranks based on the order of the fields specified (e.g., ["field1", "field2", "field3"]).["text"]

Quotas

Quotas are defined at the project level and vary based on pricing plan. To request a rate increase, contact Support.

Limit typeStarter planPaid plans
Requests per minuteN/A300
Requests per monthN/AUnlimited

bge-reranker-v2-m3

bge-reranker-v2-m3 is a high-performance, multilingual reranking model that works well on messy data and short queries expected to return medium-length passages of text (1-2 paragraphs).

Details

  • Modality: Text
  • Max tokens per query and document pair: 1024
  • Max documents: 100

Parameters

The bge-reranker-v2-m3 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
NONE
rank_fieldsarray of stringsOptionalThe field to use for reranking. The model supports only a single rerank field.["text"]

Quotas

Quotas are defined at the project level and vary based on pricing plan. To request a rate increase, contact Support.

Limit typeStarter planPaid plans
Requests per minute6060
Requests per month500Unlimited

pinecone-rerank-v0

This feature is in public preview.

pinecone-rerank-v0 is a state of the art reranking model that out-performs competitors on widely accepted benchmarks. It can handle chunks up to 512 tokens (1-2 paragraphs).

Details

  • Modality: Text
  • Max tokens per query and document pair: 512
  • Max documents: 100

Parameters

The pinecone-rerank-v0 model supports the following parameters:

ParameterTypeRequired/OptionalDescriptionDefault
truncatestringOptionalHow to handle inputs longer than those supported by the model. Accepted values: END or NONE.

END truncates the input sequence at the input token limit. NONE returns an error when the input exceeds the input token limit.
END
rank_fieldsarray of stringsOptionalThe field to use for reranking. The model supports only a single rerank field.["text"]

Quotas

Quotas are defined at the project level and vary based on pricing plan. To request a rate increase, contact Support.

Limit typeStarter planPaid plans
Requests per minute6060
Requests per month500Unlimited

SDK support

Standalone inference operations (embed and rerank) are supported by all Pinecone SDKs.

Integrated inference operations (create_for_model, records/upsert, and records/search) are supported by the latest Python SDK. Install the latest SDK as follows:

pip install --upgrade pinecone

Cost

Inference billing is based on tokens used. To learn more, see Understanding cost.