Import records
This page shows you how to import records from object storage into an index and interact with the import. Importing from object storage is the most efficient and cost-effective way to load large numbers of records into an index.
To run through this guide in your browser, see the Bulk import colab notebook.
This feature is in public preview and available only on Standard and Enterprise plans.
Before you import
Before you can import records, ensure you have a serverless index, a storage integration, and data formatted in a Parquet file and uploaded to the Amazon S3 bucket.
Create an index
Create a serverless index for your data.
- Import does not support integrated embedding, so make sure your index is not associated with an integrated embedding model.
- Import only supports AWS S3 as a data source, so make sure your index is also on AWS.
- You cannot import records into existing namespaces, so make sure your index does not have namespaces with the same name as the namespaces you want to import into.
Add a storage integration
To import records from a secure data source, you must create an integration to allow Pinecone access to data in your object storage. For information on how to add, edit, and delete a storage integration, see Manage storage integrations.
To import records from a public data source, a storage integration is not required.
Prepare your data
For each namespace you want to import into, create a Parquet file and upload it to object storage.
Dense index
To import into a dense index, the Parquet file must contain the following columns:
Column name | Parquet type | Description |
---|---|---|
id | STRING | Required. The unique identifier for each record. |
values | LIST<FLOAT> | Required. A list of floating-point values that make up the dense vector embedding. |
metadata | STRING | Optional. Additional metadata for each record. To omit from specific rows, use NULL . |
The Parquet file cannot contain additional columns.
For example:
Sparse index
To import into a sparse index, the Parquet file must contain the following columns:
Column name | Parquet type | Description |
---|---|---|
id | STRING | Required. The unique identifier for each record. |
sparse_values | LIST<INT> and LIST<FLOAT> | Required. A list of floating-point values (sparse values) and a list of integer values (sparse indices) that make up the sparse vector embedding. |
metadata | STRING | Optional. Additional metadata for each record. To omit from specific rows, use NULL . |
The Parquet file cannot contain additional columns.
For example:
Dense index
To import into a dense index, the Parquet file must contain the following columns:
Column name | Parquet type | Description |
---|---|---|
id | STRING | Required. The unique identifier for each record. |
values | LIST<FLOAT> | Required. A list of floating-point values that make up the dense vector embedding. |
metadata | STRING | Optional. Additional metadata for each record. To omit from specific rows, use NULL . |
The Parquet file cannot contain additional columns.
For example:
Sparse index
To import into a sparse index, the Parquet file must contain the following columns:
Column name | Parquet type | Description |
---|---|---|
id | STRING | Required. The unique identifier for each record. |
sparse_values | LIST<INT> and LIST<FLOAT> | Required. A list of floating-point values (sparse values) and a list of integer values (sparse indices) that make up the sparse vector embedding. |
metadata | STRING | Optional. Additional metadata for each record. To omit from specific rows, use NULL . |
The Parquet file cannot contain additional columns.
For example:
In object storage, your directory structure determines which Pinecone namespaces your data is imported into. The files associated with each namespace must be in a separate prefix (or sub-directory). The namespace cannot be the same as an existing namespace in the index you are importing into.
To import records, specify the URI of the prefix containing the namespace and Parquet files you want to import. For example:
Pinecone then finds all .parquet
files inside the namespace prefix and imports them into the namespace. All other file types are ignored.
In the example above, the import is located at the top-level URL of s3://BUCKET_NAME/PATH/TO/NAMESPACES/
. When scanning this directory, Pinecone finds the namespace example_namespace
, which contains four .parquet
files and one .log
file. Pinecone ignores the .log
file.
Each import request can import up 1TB of data, or 100,000,000 records into a maximum of 100 namespaces, whichever limit is met first.
Import records into an index
Review current limitations before starting an import.
Use the start_import
operation to start an asynchronous import of vectors from object storage into an index.
To import from a private bucket, specify the Integration ID (integration
) of the Amazon S3 integration you created. The ID is found on the Storage integrations page of the Pinecone console. An ID is not needed to import from a public bucket.
The operation returns an id
that you can use to check the status of the import.
If you set the import to continue on error, the operation will skip records that fail to import and continue with the next record. The operation will complete, but there will not be any notification about which records, if any, failed to import. To see how many records were successfully imported, use the describe_import
operation.
Once all the data is loaded, the index builder will index the records, which usually takes at least 10 minutes. During this indexing process, the expected job status is InProgress
, but 100.0
percent complete. Once all the imported records are indexed and fully available for querying, the import operation will be set to Completed
.
You can start a new import using the Pinecone console. Find the index you want to import into, and click the ellipsis (..) menu > Import data.
Manage imports
List imports
Use the list_imports
operation to list all of the recent and ongoing imports. By default, the operation returns up to 100 imports per page. If the limit
parameter is passed, the operation returns up to that number of imports per page instead. For example, if limit=3
, up to 3 imports are returned per page. Whenever there are additional imports to return, the response includes a pagination_token
for fetching the next page of imports.
When using the Python SDK, list_import
paginates automatically.
You can view the list of imports for an index in the Pinecone console. Select the index and navigate to the Imports tab.
When using the Python SDK, list_import
paginates automatically.
You can view the list of imports for an index in the Pinecone console. Select the index and navigate to the Imports tab.
When using the Node.js SDK, Go SDK, .NET SDK, or REST API to list recent and ongoing imports, you must manually fetch each page of results. To view the next page of results, include the paginationToken
provided in the response of the list_imports
/ GET
request.
Describe an import
Use the describe_import
operation to get details about a specific import.
You can view the details of your import using the Pinecone console.
Cancel an import
The cancel_import
operation cancels an import if it is not yet finished. It has no effect if the import is already complete.
You can cancel your import using the Pinecone console. To cancel an ongoing import, select the index you are importing into and navigate to the Imports tab. Then, click the ellipsis (..) menu > Cancel.