This page lists the catalog of public Pinecone datasets and shows you how to work with them using the Python pinecone-datasets library.

To create, upload, and list your own dataset for use by other Pinecone users, see Creating datasets.

Available public datasets

namedocumentssourcebuckettaskdense model (dimensions)sparse model
ANN_DEEP1B_d96_angular9,990,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_DEEP1B_d96_angularANNANN benchmark (96)None
ANN_Fashion-MNIST_d784_euclidean60,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_Fashion-MNIST_d784_euclideanANNANN benchmark (784)None
ANN_GIST_d960_euclidean1,000,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GIST_d960_euclideanANNANN benchmark (960)None
ANN_GloVe_d100_angular1,183,514https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d100_angularANNANN benchmark (100)None
ANN_GloVe_d200_angular1,183,514https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d200_angularANNANN benchmark (200)None
ANN_GloVe_d25_angular1,183,514https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d25_angularANNANN benchmark (25)None
ANN_GloVe_d50_angular1,183,514https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d50_angularANNANN benchmark (50)None
ANN_GloVe_d64_angular292,385https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d64_angularANNANN benchmark (65)None
ANN_MNIST_d784_euclidean60,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_MNIST_d784_euclideanANNANN benchmark (784)None
ANN_NYTimes_d256_angular290,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_NYTimes_d256_angularANNANN benchmark (256)None
ANN_SIFT1M_d128_euclidean1,000,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_SIFT1M_d128_euclideanANNANN benchmark (128)None
amazon_toys_quora_all-MiniLM-L6-bm2510,000https://www.kaggle.com/datasets/PromptCloudHQ/toy-products-on-amazongs://pinecone-datasets-dev/amazon_toys_quora_all-MiniLM-L6-bm25QAsentence-transformers/all-MiniLM-L6-v2 (384)bm25
it-threat-data-test1,042,965https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csvit_threat_model.model (128)None
it-threat-data-train1,042,867https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csvit_threat_model.model (128)None
langchain-python-docs-text-embedding-ada-0023476https://huggingface.co/datasets/jamescalam/langchain-docs-23-06-27text-embedding-ada-002 (1536)None
movielens-user-ratings970,582https://huggingface.co/datasets/pinecone/movielens-recent-ratingsgs://pinecone-datasets-dev/movielens-user-ratingsclassificationpinecone/movie-recommender-user-model (32)None
msmarco-v1-bm25-allMiniLML6V28,841,823all-minilm-l6-v2 (384)bm25-k0.9-b0.4
quora_all-MiniLM-L6-bm25-100K100,000https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairsgs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25similar questionssentence-transformers/msmarco-MiniLM-L6-cos-v5 (384)naver/splade-cocondenser-ensembledistil
quora_all-MiniLM-L6-bm25522,931https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairsgs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25similar questionssentence-transformers/msmarco-MiniLM-L6-cos-v5 (384)naver/splade-cocondenser-ensembledistil
quora_all-MiniLM-L6-v2_Splade-100K100,000https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairsgs://pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Spladesimilar questionssentence-transformers/msmarco-MiniLM-L6-cos-v5 (384)naver/splade-cocondenser-ensembledistil
quora_all-MiniLM-L6-v2_Splade522,931https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairsgs://pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Spladesimilar questionssentence-transformers/msmarco-MiniLM-L6-cos-v5 (384)naver/splade-cocondenser-ensembledistil
squad-text-embedding-ada-00218,891https://huggingface.co/datasets/squadtext-embedding-ada-002 (1536)None
wikipedia-simple-text-embedding-ada-002-100K100,000wikipediags://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002-100Kmultipletext-embedding-ada-002 (1536)None
wikipedia-simple-text-embedding-ada-002283,945wikipediags://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002multipletext-embedding-ada-002 (1536)None
youtube-transcripts-text-embedding-ada-00238,950youtubegs://pinecone-datasets-dev/youtube-transcripts-text-embedding-ada-002multipletext-embedding-ada-002 (1536)None

Install the pinecone-datasets library

Pinecone provides a Python library for working with public Pinecone datasets. To install the library, run the following command:

Python
pip install pinecone-datasets

List public datasets

To list the available public Pinecone datasets as an object, use the list_datasets() method:

Python
from pinecone_datasets import list_datasets

list_datasets()

# Response:
# ['ANN_DEEP1B_d96_angular', 'ANN_Fashion-MNIST_d784_euclidean', 'ANN_GIST_d960_euclidean', 'ANN_GloVe_d100_angular', 'ANN_GloVe_d200_angular', 'ANN_GloVe_d25_angular', 'ANN_GloVe_d50_angular', 'ANN_LastFM_d64_angular', 'ANN_MNIST_d784_euclidean', 'ANN_NYTimes_d256_angular', 'ANN_SIFT1M_d128_euclidean', 'amazon_toys_quora_all-MiniLM-L6-bm25', 'it-threat-data-test', 'it-threat-data-train', 'langchain-python-docs-text-embedding-ada-002', 'movielens-user-ratings', 'msmarco-v1-bm25-allMiniLML6V2', 'quora_all-MiniLM-L6-bm25-100K', 'quora_all-MiniLM-L6-bm25', 'quora_all-MiniLM-L6-v2_Splade-100K', 'quora_all-MiniLM-L6-v2_Splade', 'squad-text-embedding-ada-002', 'wikipedia-simple-text-embedding-ada-002-100K', 'wikipedia-simple-text-embedding-ada-002', 'youtube-transcripts-text-embedding-ada-002']

To list the available datasets as a Panda dataframe, pass the as_df=True argument:

Python
from pinecone_datasets import list_datasets

list_datasets(as_df=True)

# Response:
#                                             name                    created_at  documents  ...  description  tags  args
# 0                         ANN_DEEP1B_d96_angular    2023-03-10 14:17:01.481785    9990000  ...         None  None  None
# 1               ANN_Fashion-MNIST_d784_euclidean    2023-03-10 14:17:01.481785      60000  ...         None  None  None
# 2                        ANN_GIST_d960_euclidean    2023-03-10 14:17:01.481785    1000000  ...         None  None  None
# 3                         ANN_GloVe_d100_angular    2023-03-10 14:17:01.481785    1183514  ...         None  None  None
# 4                         ANN_GloVe_d200_angular    2023-03-10 14:17:01.481785    1183514  ...         None  None  None
# 5                          ANN_GloVe_d25_angular    2023-03-10 14:17:01.481785    1183514  ...         None  None  None
# ...

Load a dataset

To load a dataset into memory, use the load_dataset() method. You can use load a Pinecone public dataset or your own dataset.

Example

The following example loads the quora_al-MiniLM-L6-bm25 Pinecone public dataset.

Python
from pinecone_datasets import list_datasets, load_dataset

list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

dataset.head()

# Response:
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id  ┆ values                    ┆ sparse_values                       ┆ metadata          ┆ blob │
# │     ┆                           ┆                                     ┆                   ┆      │
# │ str ┆ list[f32]                 ┆ struct[2]                           ┆ struct[3]         ┆      │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0   ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │     ┆ 0.0060...                 ┆                                     ┆                   ┆      │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘

Iterate over datasets

You can iterate over vector data in a dataset using the iter_documents() method. You can use this method to upsert or update vectors, to automate benchmarking, or other tasks.

Example

The following example loads the quora_all-MiniLM-L6-bm25 dataset and then iterates over the documents in the dataset in batches of 100 and upserts the vector data to a Pinecone serverless index named example-index.

Python
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
from pinecone_datasets import list_datasets, load_dataset

pinecone = Pinecone(api_key="API_KEY")

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

pinecone.create_index(
  name="example-index",
  dimension=384,
  metric="cosine",
  spec=ServerlessSpec(
    cloud="aws",
    region="us-east-1"
  )
)

index = pinecone.Index("example-index")

for batch in dataset.iter_documents(batch_size=100):
    index.upsert(vectors=batch)

Upsert a dataset as a dataframe

To quickly ingest data when using the Python SDK, use the upsert_from_dataframe method. The method includes retry logic andbatch_size, and is performant especially with Parquet file data sets.

The following example upserts the uora_all-MiniLM-L6-bm25 dataset as a dataframe.

Python
from pinecone import Pinecone, ServerlessSpec
from pinecone_datasets import list_datasets, load_dataset

pc = Pinecone(api_key="API_KEY")

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

pc.create_index(
  name="example-index",
  dimension=384,
  metric="cosine",
  spec=ServerlessSpec(
    cloud="aws",
    region="us-east-1"
  )
)

# To get the unique host for an index, 
# see https://docs.pinecone.io/guides/data/target-an-index
index = pc.Index(host="INDEX_HOST")

index.upsert_from_dataframe(dataset.drop(columns=["blob"]))

See also