This section provides some tips for getting the best performance out of Pinecone.

Basic performance checklist

  • Switch to a cloud environment. For example: EC2, GCE, Google Colab, GCP AI Platform Notebook, or SageMaker Notebook. If you experience slow uploads or high query latencies, it might be because you are accessing Pinecone from your home network.
  • Deploy your application and your Pinecone service in the same region. Contact us if you need a dedicated deployment.
  • Reuse connections. We recommend you reuse the same pc.Index() instance when you are upserting
    vectors into, and querying, the same index.
  • Avoid quotas and rate limits and known limitations.

Increasing throughput

Batch upserts

When upserting larger amounts of data, upsert data in batches of 100-500 vectors over multiple upsert requests. Batching significantly reduces the time it takes to process data.

Send upserts in parallel

Pinecone is thread-safe, so you can send multiple read and write requests in parallel to help increase throughput. You can read more about high-throughput optimizations on our blog.

For serverless indexes, reads and writes follow independent paths, so you can can send multiple read and write requests in parallel to improve throughput.

For pod-based indexes, multiple reads can be performed in parallel, and multiple writes can be performed in parallel, but multiple reads and writes cannot be performed in parallel. Therefore, write batches may affect query latency, and read batches may affect write throughput.

Scale pod-based indexes

This guidance applies to pod-based indexes only. With serverless indexes, you don’t configure any compute or storage resources, and you don’t manually manage those resources to meet demand, save on cost, or ensure high availability. Instead, serverless indexes scale automatically based on usage.

To increase throughput (QPS) for pod-based indexes, increase the number of replicas for your index. See the configure_index API reference for more details.


The following example increases the number of replicas for example-index to 4.

from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")

pc.configure_index("example-index", replicas=4)

See the configure_index API reference for more details.

Decreasing latency

Use namespaces

When you use namespaces to partition records within a single index, you can limit queries to specific namespaces to reduces the number of records scanned. For more details, see Namespaces.

Use metadata filtering

When you attach metadata key-value pairs to records, you can filter queries to retrieve only records that match the metadata filter. For more details, see Metadata filtering.

For p2 pod-based indexes, metadata filters can increase query latency.

Avoid network calls to fetch index hosts

When you target an index, the Python and Node.js clients make a network call to fetch the host where the index is deployed. In a production situation, you can avoid this additional round trip by specifying the host of the index as follows:

from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")

index = pc.Index(host="INDEX_HOST")
import { Pinecone } from '@pinecone-database/pinecone';

const pc = new Pinecone({ apiKey: 'YOUR_API_KEY' });

// For the Node.js client, you must specify both the index name and host.
const index = pc.index("INDEX_NAME", "INDEX_HOST");

You can get the host of an index using the Pinecone console or using the describe_index operation. For more details, see Get an index endpoint.