Creating a dataset
Dependencies
The Pinecone datasets project usespoetry
for dependency management and supports python versions 3.8+.
To install poetry, run the following command from the project root directory:
Shell
Dataset metadata
To create a public dataset, you may need to generate dataset metadata. Example The following example creates a metadata objectmeta
containing metadata for a dataset test_dataset
.
Python
Python
Viewing dataset schema
To see the complete schema, run the following command:Python
Running tests
To run tests locally, run the following command:Shell
Uploading and listing a dataset
Pinecone datasets can load a dataset from any storage bucket where it has access using the default access controls for s3, gcs or local permissions. Pinecone datasets expects data to be uploaded with the following directory structure: Figure 1: Expected directory structure for Pinecone datasets ├── base_path # path to where all datasets│ ├── dataset_id # name of dataset
│ │ ├── metadata.json # dataset metadata (optional, only for listed)
│ │ ├── documents # datasets documents
│ │ │ ├── file1.parquet
│ │ │ └── file2.parquet
│ │ ├── queries # dataset queries
│ │ │ ├── file1.parquet
│ │ │ └── file2.parquet
└── … Pinecone datasets scans storage and lists every dataset with metadata file. Example The following shows the format of an example s3 bucket address for a datasets metadata file:
s3://my-bucket/my-dataset/metadata.json
Using your own dataset
By default, the Pinecone SDK uses Pinecone’s public datasets bucket on GCS. You can use your own bucket by setting thePINECONE_DATASETS_ENDPOINT
environment variable.
Example
The following export command changes the default dataset storage endpoint to gs://my-bucket
. Calling list_datasets
or load_dataset
now scans that bucket and list all datasets.
Python
s3://
as a prefix to your bucket to access an s3 bucket.
Authenticating to your own storage bucket
Pinecone Datasets supports GCS and S3 storage buckets, using default authentication as provided by the fsspec implementation: gcsfs for GCS and s3fs for AWS. To authenticate to an AWS s3 bucket using the key/secret method, follow these steps:- Set a new endpoint by setting the environment variable
PINECONE_DATASETS_ENDPOINT
to the s3 address for your bucket.
Shell
- Use the key and secret parameters to pass your credentials to the
list_datasets
andload_dataset
functions.
Python
Accessing a non-listed dataset
To access a non-listed dataset, load it directly using theDataset
constructor.
Example
The following loads the dataset non-listed-dataset
.
Python
What’s next
- Learn more about using datasets with the Pinecone Python library