Import records

This page shows you how to import records from Amazon S3 or Google Cloud Storage into an index. Importing from object storage is the most efficient and cost-effective way to load large numbers of records into an index. To run through this guide in your browser, see the Bulk import colab notebook.

This feature is in public preview and available only on Standard and Enterprise plans.

Before you import

Before you can import records, ensure you have a serverless index, a storage integration, and data formatted in a Parquet file and uploaded to an Amazon S3 or Google Cloud Storage bucket.

Create an index

Create a serverless index for your data.

Make sure your index is on the same cloud as your object storage.
You cannot import records into existing namespaces, so make sure your index does not have namespaces with the same name as the namespaces you want to import into.

Add a storage integration

To import records from a secure data source, you must create an integration to allow Pinecone access to data in your object storage. See the following guides:

To import records from a public data source, a storage integration is not required.

Prepare your data

For each namespace you want to import into, create a Parquet file and upload it to object storage.

Dense index

To import into a dense index, the Parquet file must contain the following columns:

Column name	Parquet type	Description
`id`	`STRING`	Required. The unique identifier for each record.
`values`	`LIST<FLOAT>`	Required. A list of floating-point values that make up the dense vector embedding.
`sparse_values`	`STRUCT<indices: LIST<UINT_32>, values: LIST<FLOAT>>`	Optional. A list of floating-point values that make up the sparse vector embedding. To omit from specific rows, use `NULL`.
`metadata`	`STRING`	Optional. Additional metadata for each record. To omit from specific rows, use `NULL`.

The Parquet file cannot contain additional columns.

For example:

id | values                   | sparse_values                                                                          | metadata
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  | [ 3.82  2.48 -4.15 ... ] | {"indices": [1082468256, 1009084850, 1221765879, ...], "values": [2.0, 3.0, 4.0, ...]} | {"year": 1984, "month": 6, "source": "source1", "title": "Example1", "text": "When ..."}
2  | [ 1.82  3.48 -2.15 ... ] | {"indices": [2225824123, 1293001993, 3201939490, ...], "values": [5.0, 2.0, 3.0, ...]} | {"year": 1990, "month": 4, "source": "source2", "title": "Example2", "text": "Who ..."}

Sparse index

To import into a sparse index, the Parquet file must contain the following columns:

Column name	Parquet type	Description
`id`	`STRING`	Required. The unique identifier for each record.
`sparse_values`	`LIST<INT>` and `LIST<FLOAT>`	Required. A list of floating-point values (sparse values) and a list of integer values (sparse indices) that make up the sparse vector embedding.
`metadata`	`STRING`	Optional. Additional metadata for each record. To omit from specific rows, use `NULL`.

The Parquet file cannot contain additional columns.

For example:

id | sparse_values                                                                                       | metadata
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  | {"indices": [ 822745112 1009084850 1221765879 ... ], "values": [1.7958984 0.41577148 2.828125 ...]} | {"year": 1984, "month": 6, "source": "source1", "title": "Example1", "text": "When ..."}
2  | {"indices": [ 504939989 1293001993 3201939490 ... ], "values": [1.4383747 0.72849722 1.384775 ...]} | {"year": 1990, "month": 4, "source": "source2", "title": "Example2", "text": "Who ..."}

Import records into an index

Review current limitations before starting an import.

Use the start_import operation to start an asynchronous import of vectors from object storage into an index.

For uri, specify the URI of the bucket and import directory containing the namespaces and Parquet files you want to import, for example, s3://BUCKET_NAME/IMPORT_DIR for Amazon S3 or gs://BUCKET_NAME/IMPORT_DIR for Google Cloud Storage.
For integration_id, set the Integration ID of the Amazon S3 or Google Cloud Storage integration you created. The ID is found on the Storage integrations page of the Pinecone console.
An Integration ID is not needed to import from a public bucket.
For error_mode, use CONTINUE or ABORT.
- With ABORT, the operation will stop if any records fail to import.
- With CONTINUE, the operation will continue on error and complete, but there will not be any notification about which records, if any, failed to import. To see how many records were successfully imported, use the describe_import operation.

from pinecone import Pinecone, ImportErrorMode

pc = Pinecone(api_key="YOUR_API_KEY")

# To get the unique host for an index, 
# see https://docs.pinecone.io/guides/manage-data/target-an-index
index = pc.Index(host="INDEX_HOST")
root = "s3://BUCKET_NAME/PATH/TO/DIR"

index.start_import(
    uri=root,
    integration_id="a12b3d4c-47d2-492c-a97a-dd98c8dbefde", # Optional for public buckets
    error_mode=ImportErrorMode.CONTINUE # or ImportErrorMode.ABORT
)

The response contains an operation_id that you can use to check the status of the import:

Response

{
   "operation_id": "101"
}

Once all the data is loaded, the index builder will index the records, which usually takes at least 10 minutes. During this indexing process, the expected job status is InProgress, but 100.0 percent complete. Once all the imported records are indexed and fully available for querying, the import operation will be set to Completed.

You can start a new import using the Pinecone console. Find the index you want to import into, and click the ellipsis (..) menu > Import data.

Manage imports

List imports

Use the list_imports operation to list all of the recent and ongoing imports. By default, the operation returns up to 100 imports per page. If the limit parameter is passed, the operation returns up to that number of imports per page instead. For example, if limit=3, up to 3 imports are returned per page. Whenever there are additional imports to return, the response includes a pagination_token for fetching the next page of imports.

When using the Python SDK, list_import paginates automatically.

Python

from pinecone import Pinecone, ImportErrorMode

pc = Pinecone(api_key="YOUR_API_KEY")

# To get the unique host for an index, 
# see https://docs.pinecone.io/guides/manage-data/target-an-index
index = pc.Index(host="INDEX_HOST")

# List using a generator that handles pagination
for i in index.list_imports():
    print(f"id: {i.id} status: {i.status}")

# List using a generator that fetches all results at once
operations = list(index.list_imports())
print(operations)

Response

{
  "data": [
    {
      "id": "1",
      "uri": "s3://BUCKET_NAME/PATH/TO/DIR",
      "status": "Pending",
      "started_at": "2024-08-19T20:49:00.754Z",
      "finished_at": "2024-08-19T20:49:00.754Z",
      "percent_complete": 42.2,
      "records_imported": 1000000
    }
  ],
  "pagination": {
    "next": "Tm90aGluZyB0byBzZWUgaGVyZQo="
  }
}

You can view the list of imports for an index in the Pinecone console. Select the index and navigate to the Imports tab.

Describe an import

Use the describe_import operation to get details about a specific import.

from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")

# To get the unique host for an index, 
# see https://docs.pinecone.io/guides/manage-data/target-an-index
index = pc.Index(host="INDEX_HOST")

index.describe_import(id="101")

Response

{
  "id": "101",
  "uri": "s3://BUCKET_NAME/PATH/TO/DIR",
  "status": "Pending",
  "created_at": "2024-08-19T20:49:00.754Z",
  "finished_at": "2024-08-19T20:49:00.754Z",
  "percent_complete": 42.2,
  "records_imported": 1000000
}

You can view the details of your import using the Pinecone console.

Cancel an import

The cancel_import operation cancels an import if it is not yet finished. It has no effect if the import is already complete.

from pinecone import Pinecone

pc = Pinecone(api_key="YOUR_API_KEY")

# To get the unique host for an index, 
# see https://docs.pinecone.io/guides/manage-data/target-an-index
index = pc.Index(host="INDEX_HOST")

index.cancel_import(id="101")

Response

{}

You can cancel your import using the Pinecone console. To cancel an ongoing import, select the index you are importing into and navigate to the Imports tab. Then, click the ellipsis (..) menu > Cancel.

Import limits

Metric	Limit
Max size per import request	2 TB or 200,000,000 records
Max namespaces per import request	10,000
Max files per import request	100,000
Max size per file	10 GB

Also:

Import only supports Amazon S3 or Google Cloud Storage as a data source.
You cannot import data from S3 Express One Zone storage.
You cannot import data into existing namespaces.
Each import will take at least 10 minutes to complete.
When importing into an index with integrated embedding, records must contain vectors, not text. To add records with text, you must use upsert.

Get started

Index data

Search

Optimize

Manage data

Manage cost

Move to production

Admin

Operations

Using pods

Before you import

Create an index

Add a storage integration

Prepare your data

Dense index

Sparse index

Import records into an index

Manage imports

List imports

Describe an import

Cancel an import

Import limits

Get started

Index data

Search

Optimize

Manage data

Manage cost

Move to production

Admin

Operations

Using pods

​Before you import

​Create an index

​Add a storage integration

​Prepare your data

​Dense index

​Sparse index

​Import records into an index

​Manage imports

​List imports

​Describe an import

​Cancel an import

​Import limits

Before you import

Create an index

Add a storage integration

Prepare your data

Dense index

Sparse index

Import records into an index

Manage imports

List imports

Describe an import

Cancel an import

Import limits