This feature is in public preview and available only on Standard and Enterprise plans.
Before you import
Before you can import records, ensure you have a serverless index, a storage integration, and data formatted in a Parquet file and uploaded to an Amazon S3 bucket, Google Cloud Storage bucket, or Azure Blob Storage container.Create an index
Create a serverless index for your data. Be sure to create your index on a cloud that supports importing from the object storage you want to use:Index location | AWS S3 | Google Cloud Storage | Azure Blob Storage |
---|---|---|---|
AWS | ✅ | ✅ | ✅ |
GCP | ❌ | ✅ | ✅ |
Azure | ❌ | ✅ | ✅ |
Add a storage integration
To import records from a public data source, a storage integration is not required. However, to import records from a secure data source, you must create an integration to allow Pinecone access to data in your object storage. See the following guides:Prepare your data
-
In your Amazon S3 bucket, Google Cloud Storage bucket, or Azure Blob Storage container, create an import directory containing a subdirectory for each namespace you want to import into. The namespaces must not yet exist in your index.
For example, to import data into the namespaces
example_namespace1
andexample_namespace2
, your directory structure would look like this:To import into the default namespace, use a subdirectory called__default__
. The default namespace must be empty. -
For each namespace, create one or more Parquet files defining the records to import.
Parquet files must contain specific columns, depending on the index type:
To import into a namespace in a dense index, the Parquet file must contain the following columns:
Column name Parquet type Description id
STRING
Required. The unique identifier for each record. values
LIST<FLOAT>
Required. A list of floating-point values that make up the dense vector embedding. metadata
STRING
Optional. Additional metadata for each record. To omit from specific rows, use NULL
.For example:The Parquet file cannot contain additional columns. -
Upload the Parquet files into the relevant subdirectory.
For example, if you have subdirectories for the namespaces
example_namespace1
andexample_namespace2
and upload 4 Parquet files into each, your directory structure would look as follows after the upload:
Import records into an index
Review current limitations before starting an import.
start_import
operation to start an asynchronous import of vectors from object storage into an index.
-
For
uri
, specify the URI of the bucket and import directory containing the namespaces and Parquet files you want to import. For example:- Amazon S3:
s3://BUCKET_NAME/IMPORT_DIR
- Google Cloud Storage:
gs://BUCKET_NAME/IMPORT_DIR
- Azure Blob Storage:
https://STORAGE_ACCOUNT.blob.core.windows.net/CONTAINER_NAME/IMPORT_DIR
- Amazon S3:
-
For
integration_id
, specify the Integration ID of the Amazon S3, Google Cloud Storage, or Azure Blob Storage integration you created. The ID is found on the Storage integrations page of the Pinecone console.An Integration ID is not needed to import from a public bucket. -
For
error_mode
, useCONTINUE
orABORT
.- With
ABORT
, the operation stops if any records fail to import. - With
CONTINUE
, the operation continues on error, but there is not any notification about which records, if any, failed to import. To see how many records were successfully imported, use the describe an import operation.
- With
id
that you can use to check the status of the import:
Response
InProgress
, but 100.0
percent complete. Once all the imported records are indexed and fully available for querying, the import operation is set to Completed
.
You can start a new import using the Pinecone console. Find the index you want to import into, and click the ellipsis (..) menu > Import data.
Track import progress
The amount of time required for an import depends on various factors, including:- The number of records to import
- The number of namespaces to import, and the the number of records in each
- The total size (in bytes) of the import
describe_import
operation with the import ID:
status
, percent_complete
, and records_imported
:
Response
error
field with the reason for the failure. See the Troubleshooting section for more information.
Response
Manage imports
List imports
Use thelist_imports
operation to list all of the recent and ongoing imports. By default, the operation returns up to 100 imports per page. If the limit
parameter is passed, the operation returns up to that number of imports per page instead. For example, if limit=3
, up to 3 imports are returned per page. Whenever there are additional imports to return, the response includes a pagination_token
for fetching the next page of imports.
When using the Python SDK,
list_import
paginates automatically.Python
Response
You can view the list of imports for an index in the Pinecone console. Select the index and navigate to the Imports tab.
Cancel an import
Thecancel_import
operation cancels an import if it is not yet finished. It has no effect if the import is already complete.
Response
You can cancel your import using the Pinecone console. To cancel an ongoing import, select the index you are importing into and navigate to the Imports tab. Then, click the ellipsis (..) menu > Cancel.
Import limits
If your import exceeds these limits, you’ll get an
Exceeds system limit
error. Pinecone can help unblock these imports quickly. Contact Pinecone support for assistance.Metric | Limit |
---|---|
Max namespaces per import | 10,000 |
Max size per namespace | 500 GB |
Max files per import | 100,000 |
Max size per file | 10 GB |
- You cannot import data from an AWS S3 bucket into a Pinecone index hosted on GCP or Azure.
- You cannot import data from S3 Express One Zone storage.
- You cannot import data into an existing namespace.
- When importing data into the
__default__
namespace of an index, the default namespace must be empty. - Each import takes at least 10 minutes to complete.
- When importing into an index with integrated embedding, records must contain vectors, not text. To add records with text, you must use upsert.
Troubleshooting
When an import fails, you’ll see an error message with the reason for the failure in the Pinecone console or in the response to the describe an import operation.Namespace already exists
Namespace already exists
You cannot import data into an existing namespace. If your import directory structure contains a folder with the name of an existing namespace in your index, the import will fail with the following error:To fix this, rename the folder to use a namespace name that does not yet exist.
No namespace found
No namespace found
In object storage, your directory structure must be as follows:If a Parquet file is not nested under a namespace subdirectory, the import will fail with the following error:To fix this, move the Parquet file to a namespace subdirectory.
Parquet files not found
Parquet files not found
Each namespace subdirectory must contain Parquet files with data to import. If a namespace subdirectory does not include Parquet files, the import will fail with the following error:To fix this, add Parquet files to the namespace subdirectory.
Invalid import URI
Invalid import URI
In your start import request, the import To fix this, remove the namespaces directory or Parquet filename from the
uri
must specify only the bucket and import directory containing the namespaces and Parquet files you want to import. If the uri
also contains a namespaces directory or a Parquet filename, the import will fail with the following error:uri
.Invalid Parquet files
Invalid Parquet files
When a Parquet file is not formatted correctly, the import will fail with a message like one of the following:These errors are returned for both
File schema errors
File corruption errors
Type errors
CONTINUE
and ABORT
error modes.To fix these errors, check the specific error message and follow the instructions in the Prepare your data section.Invalid records
Invalid records
When the This will be followed by an error message identifying the specific issue. For example:When the To fix these errors, check the specific error message and follow the instructions in the Prepare your data section.
error_mode
is ABORT
and a file contains invalid records, the import will stop processing on the first invalid record and return an error message identifying the file name and row:Missing values
Invalid metadata
Invalid vectors
error_mode
is CONTINUE
, the import will skip individual invalid records. However, if all records are invalid and skipped (for example, the vector type in the file does not match the vector type of the index), the import will fail with a general message: