An import is a long-running operation that asynchronously imports large numbers of records from Parquet files in object storage into a Pinecone serverless index.

This feature is in public preview and available only on Standard and Enterprise plans.

Object storage

To import data from a secure data source, you must create an integration to allow Pinecone access to data in your object storage. For information on how to add, edit, and delete a storage integration, see Manage storage integrations.

See the following sections, Directory structure and Parquet file format, for information on how to structure your data.

To import data from a public data source, a storage integration is not required.

Directory structure

Your directory structure in your object storage determines which Pinecone namespaces your data is imported into. The files associated with each namespace must be in a separate prefix (or sub-directory). The namespace cannot be the same as an existing namespace in the index you are importing into.

To import data, specify the URI of the prefix containing the namespace and Parquet files you want to import. For example:

s3://BUCKET_NAME/PATH/TO/NAMESPACES
--/example_namespace/
----0.parquet
----1.parquet
----2.parquet
----3.parquet
----.log
--/example_namespace2/
----4.parquet
----5.parquet
----6.parquet
----7.parquet
----.log

Pinecone then finds all .parquet files inside the namespace prefix and imports them into the namespace. All other file types are ignored.

In the example above, the import is located at the top-level URL of s3://BUCKET_NAME/PATH/TO/NAMESPACES/. When scanning this directory, Pinecone finds the namespace example_namespace, which contains four .parquet files and one .log file. Pinecone ignores the .log file.

Each import request can import up 1TB of data, or 100,000,000 records into a maximum of 100 namespaces, whichever limit is met first.

Parquet file format

Your data must be stored in a Parquet file in your object storage. The name of the file does not matter, but it must have the .parquet extension.

The Parquet file must contain the following columns:

Column nameParquet typeDescription
idSTRINGRequired. The unique identifier for each record.
valuesLIST<FLOAT>Required. A list of floating-point values that make up the vector embedding.
metadataSTRINGOptional. Additional metadata for each record. To omit from specific rows, use NULL.

The Parquet file cannot contain additional columns.

For example:

id | values                   | metadata
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  | [ 3.82  2.48 -4.15 ... ] | {"year": 1984, "month": 6, "source": "source1", "title": "Example1", "text": "When ..."}
2  | [ 1.82  3.48 -2.15 ... ] | {"year": 1990, "month": 4, "source": "source2", "title": "Example2", "text": "Who ..."}

Limitations

  • Import is only available for serverless indexes on AWS.
  • You cannot import data from S3 Express One Zone storage.
  • You cannot import data into existing namespaces. You must create a new namespace during the import operation.
  • Each import request can import up 1TB of data, or 100,000,000 records into a maximum of 100 namespaces, whichever limit is reached first.
  • Sparse vectors cannot be imported.
  • Every import will take at least 10 minutes to complete.

Pricing

Was this page helpful?