Use public Pinecone datasets
This page lists the catalog of public Pinecone datasets and shows you how to work with them using the Python pinecone-datasets library.
To create, upload, and list your own dataset for use by other Pinecone users, see Creating datasets.
Available public datasets
name | documents | source | bucket | task | dense model (dimensions) | sparse model |
---|---|---|---|---|---|---|
ANN_DEEP1B_d96_angular | 9,990,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_DEEP1B_d96_angular | ANN | ANN benchmark (96) | None |
ANN_Fashion-MNIST_d784_euclidean | 60,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_Fashion-MNIST_d784_euclidean | ANN | ANN benchmark (784) | None |
ANN_GIST_d960_euclidean | 1,000,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GIST_d960_euclidean | ANN | ANN benchmark (960) | None |
ANN_GloVe_d100_angular | 1,183,514 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d100_angular | ANN | ANN benchmark (100) | None |
ANN_GloVe_d200_angular | 1,183,514 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d200_angular | ANN | ANN benchmark (200) | None |
ANN_GloVe_d25_angular | 1,183,514 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d25_angular | ANN | ANN benchmark (25) | None |
ANN_GloVe_d50_angular | 1,183,514 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d50_angular | ANN | ANN benchmark (50) | None |
ANN_GloVe_d64_angular | 292,385 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d64_angular | ANN | ANN benchmark (65) | None |
ANN_MNIST_d784_euclidean | 60,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_MNIST_d784_euclidean | ANN | ANN benchmark (784) | None |
ANN_NYTimes_d256_angular | 290,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_NYTimes_d256_angular | ANN | ANN benchmark (256) | None |
ANN_SIFT1M_d128_euclidean | 1,000,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_SIFT1M_d128_euclidean | ANN | ANN benchmark (128) | None |
amazon_toys_quora_all-MiniLM-L6-bm25 | 10,000 | https://www.kaggle.com/datasets/PromptCloudHQ/toy-products-on-amazon | gs://pinecone-datasets-dev/amazon_toys_quora_all-MiniLM-L6-bm25 | QA | sentence-transformers/all-MiniLM-L6-v2 (384) | bm25 |
it-threat-data-test | 1,042,965 | https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv | it_threat_model.model (128) | None | ||
it-threat-data-train | 1,042,867 | https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv | it_threat_model.model (128) | None | ||
langchain-python-docs-text-embedding-ada-002 | 3476 | https://huggingface.co/datasets/jamescalam/langchain-docs-23-06-27 | text-embedding-ada-002 (1536) | None | ||
movielens-user-ratings | 970,582 | https://huggingface.co/datasets/pinecone/movielens-recent-ratings | gs://pinecone-datasets-dev/movielens-user-ratings | classification | pinecone/movie-recommender-user-model (32) | None |
msmarco-v1-bm25-allMiniLML6V2 | 8,841,823 | all-minilm-l6-v2 (384) | bm25-k0.9-b0.4 | |||
quora_all-MiniLM-L6-bm25-100K | 100,000 | https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs | gs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25 | similar questions | sentence-transformers/msmarco-MiniLM-L6-cos-v5 (384) | naver/splade-cocondenser-ensembledistil |
quora_all-MiniLM-L6-bm25 | 522,931 | https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs | gs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25 | similar questions | sentence-transformers/msmarco-MiniLM-L6-cos-v5 (384) | naver/splade-cocondenser-ensembledistil |
quora_all-MiniLM-L6-v2_Splade-100K | 100,000 | https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs | gs://pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Splade | similar questions | sentence-transformers/msmarco-MiniLM-L6-cos-v5 (384) | naver/splade-cocondenser-ensembledistil |
quora_all-MiniLM-L6-v2_Splade | 522,931 | https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs | gs://pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Splade | similar questions | sentence-transformers/msmarco-MiniLM-L6-cos-v5 (384) | naver/splade-cocondenser-ensembledistil |
squad-text-embedding-ada-002 | 18,891 | https://huggingface.co/datasets/squad | text-embedding-ada-002 (1536) | None | ||
wikipedia-simple-text-embedding-ada-002-100K | 100,000 | wikipedia | gs://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002-100K | multiple | text-embedding-ada-002 (1536) | None |
wikipedia-simple-text-embedding-ada-002 | 283,945 | wikipedia | gs://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002 | multiple | text-embedding-ada-002 (1536) | None |
youtube-transcripts-text-embedding-ada-002 | 38,950 | youtube | gs://pinecone-datasets-dev/youtube-transcripts-text-embedding-ada-002 | multiple | text-embedding-ada-002 (1536) | None |
Install the pinecone-datasets
library
Pinecone provides a Python library for working with public Pinecone datasets. To install the library, run the following command:
List public datasets
To list the available public Pinecone datasets as an object, use the list_datasets()
method:
To list the available datasets as a Panda dataframe, pass the as_df=True
argument:
Load a dataset
To load a dataset into memory, use the load_dataset()
method. You can use load a Pinecone public dataset or your own dataset.
Example
The following example loads the quora_al-MiniLM-L6-bm25
Pinecone public dataset.
Iterate over datasets
You can iterate over vector data in a dataset using the iter_documents()
method. You can use this method to upsert or update vectors, to automate benchmarking, or other tasks.
Example
The following example loads the quora_all-MiniLM-L6-bm25
dataset and then iterates over the documents in the dataset in batches of 100 and upserts the vector data to a Pinecone serverless index named example-index
.
Upsert a dataset as a dataframe
To quickly ingest data when using the Python SDK, use the upsert_from_dataframe
method. The method includes retry logic andbatch_size
, and is performant especially with Parquet file data sets.
The following example upserts the uora_all-MiniLM-L6-bm25
dataset as a dataframe.
See also
Was this page helpful?