This page describes the Pinecone architecture for serverless indexes.

Overview

Pinecone serverless runs as a managed service on the AWS cloud platform, with support for GCP and Azure cloud platforms coming soon. Within a given cloud region, client requests go through an API gateway to either a control plane or data plane. All vector data is written to highly efficient, distributed blob storage.

API gateway

Requests to Pinecone serverless contain an API key assigned to a specific project. Each incoming request is load-balanced through an edge proxy to an authentication service that verifies that the API key is valid for the targeted project. If so, the proxy routes the request to either the control plane or the data plane, depending on the type of work to be performed.

Control plane

The control plane handles requests to manage organizational objects, such as projects, indexes, and API keys. The control plane uses a dedicated database as the source of truth about these objects. Other services, such as the authentication service, also cache control plane data locally for performance optimization.

Data plane

The data plane handles requests to write and read records in indexes. Indexes are partitioned into one or more logical namespaces, and all write and read requests are scoped by namespace.

Writes and reads follow separate paths, with compute resources auto-scaling independently based on demand. The separation of compute resources ensures that queries never impact write throughput and writes never impact query latency. The auto-scaling of compute resources, combined with highly efficient blob storage, reduces cost, as you pay only for what you use.

Blob storage

For each namespace in a serverless index, Pinecone clusters records that are likely to be queried together and identifies a centroid dense vector to represent each cluster. These clusters and centroids are stored as data files in distributed blob storage that provides virtually limitless data scalability and guaranteed high-availability.

Clustering is based on the dense vector values of records. There is no distinct clustering for sparse vector values. For more details, see Sparse vector considerations

Write path

When the data plane receives a request to add, update, or delete records in the namespace of an index, the following takes place:

Log writers

A writer commits the raw data to blob storage and records a log sequence number (LSN) for the commit in a write ahead log (WAL). LSNs are monotonically increasing and ensure both that operations are applied in the order that they are received and, if necessary, can be replayed in the order that they are received.

At this point, Pinecone returns a 200 OK response to the client, guaranteeing the durability of the write, and two processes begin in parallel: The index is rebuilt and the records are added to a separate freshness layer.

Index builder

The index builder takes the raw data from blob storage, identifies the relevant clusters based on cluster centroids, and then writes the changes to the cluster data files in blob storage. It takes around 10 minutes for the changes to be visible to queries.

Freshness layer

Parallel to index building, the write is sent to the freshness layer. The freshness layer ensures that data that hasn’t yet been clustered by the index builder is available to be searched in seconds rather than minutes. It adds newly added vectors to an in-memory index that is periodically flushed based on the index builder progress.

The freshness layer can hold up to 2 million records in memory for each namespace in an index. In cases where more than 2 million records are upserted into a namespace, during the time it takes for the index builder to apply the changes, clients see the last state of the namespace plus the 2 million records in the freshness layer.

Read path

When the data plane receives a query to search records in the namespace of an index, the following takes place:

Query planners

A query planner identifies the most relevant clusters to search and sends a logical query plan to a sharded backend of stateless query executors that dynamically auto-scale based on system load.

The query planner chooses clusters based on the similarity of their centroid vectors to the dense vector value in the query. If the query includes a metadata filter, the query planner first uses internal metadata statistics to exclude clusters that do not have records matching the filter and then chooses clusters based on the similarity of their centroid vector to the dense vector value in the query.

Query executors

Query executors run the logical query plan. Whenever possible, the plan is run against a local cache of the chosen cluster data files. Otherwise, the executors fetch the chosen cluster data files from blob storage, run the query against the files, and cache the cluster files locally for future queries.

Query executors rank records based on the similarity of their dense vector value to the dense vector value in the query and then choose the top_k records. If the query includes a metadata filter, records that don’t match the filter are excluded before records are ranked. If the query includes a sparse vector value (i.e., hybrid search), the sparse value is considered when ranking records.

In parallel, records in the freshness layer are ranked based on the similarity of their dense vector value to the dense vector value in the query and the top_k records are chosen.

The top_k records from the cluster data files and from the freshness layer are then merged and deduplicated, and final results are returned to the client.