An import is a long-running operation that asynchronously imports large numbers of records from Parquet files in object storage into a Pinecone serverless index.

This feature is in public preview and available only on Standard and Enterprise plans.

Object storage

To import data from a secure data source, you must create an integration to allow Pinecone access to data in your object storage. For information on how to add, edit, and delete a storage integration, see Manage storage integrations.

See the following sections, Directory structure and Parquet file format, for information on how to structure your data.

To import data from a public data source, a storage integration is not required.

Directory structure

Your directory structure in your object storage determines which Pinecone namespaces your data is imported into. The files associated with each namespace must be in a separate prefix (or sub-directory). The namespace cannot be the same as an existing namespace in the index you are importing into.

To import data, specify the URI of the prefix containing the namespace and Parquet files you want to import. For example:

s3://BUCKET_NAME/PATH/TO/NAMESPACES
--/example_namespace/
----0.parquet
----1.parquet
----2.parquet
----3.parquet
----.log
--/example_namespace2/
----4.parquet
----5.parquet
----6.parquet
----7.parquet
----.log

Pinecone then finds all .parquet files inside the namespace prefix and imports them into the namespace. All other file types are ignored.

In the example above, the import is located at the top-level URL of s3://BUCKET_NAME/PATH/TO/NAMESPACES/. When scanning this directory, Pinecone finds the namespace example_namespace, which contains four .parquet files and one .log file. Pinecone ignores the .log file.

Each import request can import up 1TB of data, or 100,000,000 records into a maximum of 100 namespaces, whichever limit is met first.

Parquet file format

Your data must be stored in a Parquet file in your object storage. The name of the file does not matter, but it must have the .parquet extension.

Dense index

When importing into a dense index, the Parquet file must contain the following columns:

Column nameParquet typeDescription
idSTRINGRequired. The unique identifier for each record.
valuesLIST<FLOAT>Required. A list of floating-point values that make up the dense vector embedding.
metadataSTRINGOptional. Additional metadata for each record. To omit from specific rows, use NULL.

The Parquet file cannot contain additional columns.

For example:

id | values                   | metadata
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  | [ 3.82  2.48 -4.15 ... ] | {"year": 1984, "month": 6, "source": "source1", "title": "Example1", "text": "When ..."}
2  | [ 1.82  3.48 -2.15 ... ] | {"year": 1990, "month": 4, "source": "source2", "title": "Example2", "text": "Who ..."}

Sparse index

When importing into a sparse index, the Parquet file must contain the following columns:

Column nameParquet typeDescription
idSTRINGRequired. The unique identifier for each record.
sparse_valuesLIST<INT> and LIST<FLOAT>Required. A list of floating-point values (sparse values) and a list of integer values (sparse indices) that make up the sparse vector embedding.
metadataSTRINGOptional. Additional metadata for each record. To omit from specific rows, use NULL.

The Parquet file cannot contain additional columns.

For example:

id | sparse_values                                                                                       | metadata
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
1  | {"indices": [ 822745112 1009084850 1221765879 ... ], "values": [1.7958984 0.41577148 2.828125 ...]} | {"year": 1984, "month": 6, "source": "source1", "title": "Example1", "text": "When ..."}
2  | {"indices": [ 504939989 1293001993 3201939490 ... ], "values": [1.4383747 0.72849722 1.384775 ...]} | {"year": 1990, "month": 4, "source": "source2", "title": "Example2", "text": "Who ..."}

Limitations

  • Import is only available for serverless indexes on AWS.
  • You cannot import data from S3 Express One Zone storage.
  • You cannot import data into existing namespaces. You must create a new namespace during the import operation.
  • Each import request can import up 1TB of data into a maximum of 100 namespaces. Note that you cannot import more than 10GB per file and no more than 100,000 files per import.
  • Every import will take at least 10 minutes to complete.

Pricing

Was this page helpful?