Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.pinecone.io/llms.txt

Use this file to discover all available pages before exploring further.

This page lists the catalog of public Pinecone datasets and shows you how to work with them using the Python pinecone-datasets library. To create, upload, and list your own dataset for use by other Pinecone users, see Creating datasets.

Available public datasets

namedocumentssourcebuckettaskdense model (dimensions)sparse model
ANN_DEEP1B_d96_angular9,990,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_DEEP1B_d96_angularANNANN benchmark (96)None
ANN_Fashion-MNIST_d784_euclidean60,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_Fashion-MNIST_d784_euclideanANNANN benchmark (784)None
ANN_GIST_d960_euclidean1,000,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GIST_d960_euclideanANNANN benchmark (960)None
ANN_GloVe_d100_angular1,183,514https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d100_angularANNANN benchmark (100)None
ANN_GloVe_d200_angular1,183,514https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d200_angularANNANN benchmark (200)None
ANN_GloVe_d25_angular1,183,514https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d25_angularANNANN benchmark (25)None
ANN_GloVe_d50_angular1,183,514https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d50_angularANNANN benchmark (50)None
ANN_GloVe_d64_angular292,385https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_GloVe_d64_angularANNANN benchmark (65)None
ANN_MNIST_d784_euclidean60,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_MNIST_d784_euclideanANNANN benchmark (784)None
ANN_NYTimes_d256_angular290,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_NYTimes_d256_angularANNANN benchmark (256)None
ANN_SIFT1M_d128_euclidean1,000,000https://github.com/erikbern/ann-benchmarksgs://pinecone-datasets-dev/ANN_SIFT1M_d128_euclideanANNANN benchmark (128)None
amazon_toys_quora_all-MiniLM-L6-bm2510,000https://www.kaggle.com/datasets/PromptCloudHQ/toy-products-on-amazongs://pinecone-datasets-dev/amazon_toys_quora_all-MiniLM-L6-bm25QAsentence-transformers/all-MiniLM-L6-v2 (384)bm25
it-threat-data-test1,042,965https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csvit_threat_model.model (128)None
it-threat-data-train1,042,867https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csvit_threat_model.model (128)None
langchain-python-docs-text-embedding-ada-0023476https://huggingface.co/datasets/jamescalam/langchain-docs-23-06-27text-embedding-ada-002 (1536)None
movielens-user-ratings970,582https://huggingface.co/datasets/pinecone/movielens-recent-ratingsgs://pinecone-datasets-dev/movielens-user-ratingsclassificationpinecone/movie-recommender-user-model (32)None
msmarco-v1-bm25-allMiniLML6V28,841,823all-minilm-l6-v2 (384)bm25-k0.9-b0.4
quora_all-MiniLM-L6-bm25-100K100,000https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairsgs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25similar questionssentence-transformers/msmarco-MiniLM-L6-cos-v5 (384)naver/splade-cocondenser-ensembledistil
quora_all-MiniLM-L6-bm25522,931https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairsgs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25similar questionssentence-transformers/msmarco-MiniLM-L6-cos-v5 (384)naver/splade-cocondenser-ensembledistil
quora_all-MiniLM-L6-v2_Splade-100K100,000https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairsgs://pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Spladesimilar questionssentence-transformers/msmarco-MiniLM-L6-cos-v5 (384)naver/splade-cocondenser-ensembledistil
quora_all-MiniLM-L6-v2_Splade522,931https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairsgs://pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Spladesimilar questionssentence-transformers/msmarco-MiniLM-L6-cos-v5 (384)naver/splade-cocondenser-ensembledistil
squad-text-embedding-ada-00218,891https://huggingface.co/datasets/squadtext-embedding-ada-002 (1536)None
wikipedia-simple-text-embedding-ada-002-100K100,000wikipediags://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002-100Kmultipletext-embedding-ada-002 (1536)None
wikipedia-simple-text-embedding-ada-002283,945wikipediags://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002multipletext-embedding-ada-002 (1536)None
youtube-transcripts-text-embedding-ada-00238,950youtubegs://pinecone-datasets-dev/youtube-transcripts-text-embedding-ada-002multipletext-embedding-ada-002 (1536)None

Install the pinecone-datasets library

Pinecone provides a Python library for working with public Pinecone datasets. To install the library, run the following command:
Python
pip install pinecone-datasets

List public datasets

To list the available public Pinecone datasets as an object, use the list_datasets() method:
Python
from pinecone_datasets import list_datasets

list_datasets()

# Response:
# ['ANN_DEEP1B_d96_angular', 'ANN_Fashion-MNIST_d784_euclidean', 'ANN_GIST_d960_euclidean', 'ANN_GloVe_d100_angular', 'ANN_GloVe_d200_angular', 'ANN_GloVe_d25_angular', 'ANN_GloVe_d50_angular', 'ANN_LastFM_d64_angular', 'ANN_MNIST_d784_euclidean', 'ANN_NYTimes_d256_angular', 'ANN_SIFT1M_d128_euclidean', 'amazon_toys_quora_all-MiniLM-L6-bm25', 'it-threat-data-test', 'it-threat-data-train', 'langchain-python-docs-text-embedding-ada-002', 'movielens-user-ratings', 'msmarco-v1-bm25-allMiniLML6V2', 'quora_all-MiniLM-L6-bm25-100K', 'quora_all-MiniLM-L6-bm25', 'quora_all-MiniLM-L6-v2_Splade-100K', 'quora_all-MiniLM-L6-v2_Splade', 'squad-text-embedding-ada-002', 'wikipedia-simple-text-embedding-ada-002-100K', 'wikipedia-simple-text-embedding-ada-002', 'youtube-transcripts-text-embedding-ada-002']
To list the available datasets as a Panda dataframe, pass the as_df=True argument:
Python
from pinecone_datasets import list_datasets

list_datasets(as_df=True)

# Response:
#                                             name                    created_at  documents  ...  description  tags  args
# 0                         ANN_DEEP1B_d96_angular    2023-03-10 14:17:01.481785    9990000  ...         None  None  None
# 1               ANN_Fashion-MNIST_d784_euclidean    2023-03-10 14:17:01.481785      60000  ...         None  None  None
# 2                        ANN_GIST_d960_euclidean    2023-03-10 14:17:01.481785    1000000  ...         None  None  None
# 3                         ANN_GloVe_d100_angular    2023-03-10 14:17:01.481785    1183514  ...         None  None  None
# 4                         ANN_GloVe_d200_angular    2023-03-10 14:17:01.481785    1183514  ...         None  None  None
# 5                          ANN_GloVe_d25_angular    2023-03-10 14:17:01.481785    1183514  ...         None  None  None
# ...

Load a dataset

To load a dataset into memory, use the load_dataset() method. You can use load a Pinecone public dataset or your own dataset. Example The following example loads the quora_al-MiniLM-L6-bm25 Pinecone public dataset.
Python
from pinecone_datasets import list_datasets, load_dataset

list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

dataset.head()

# Response:
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id  ┆ values                    ┆ sparse_values                       ┆ metadata          ┆ blob │
# │     ┆                           ┆                                     ┆                   ┆      │
# │ str ┆ list[f32]                 ┆ struct[2]                           ┆ struct[3]         ┆      │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0   ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │     ┆ 0.0060...                 ┆                                     ┆                   ┆      │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘

Iterate over datasets

You can iterate over vector data in a dataset using the iter_documents() method. You can use this method to upsert or update vectors, to automate benchmarking, or other tasks. Example The following example loads the quora_all-MiniLM-L6-bm25 dataset and then iterates over the documents in the dataset in batches of 100 and upserts the vector data to a Pinecone serverless index named docs-example.
Python
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
from pinecone_datasets import list_datasets, load_dataset

pinecone = Pinecone(api_key="API_KEY")

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

pinecone.create_index(
  name="docs-example",
  dimension=384,
  metric="cosine",
  spec=ServerlessSpec(
    cloud="aws",
    region="us-east-1"
  )
)

index = pinecone.Index("docs-example")

for batch in dataset.iter_documents(batch_size=100):
    index.upsert(vectors=batch)

Upsert a dataset as a dataframe

To quickly ingest data when using the Python SDK, use the upsert_from_dataframe method. The method includes retry logic andbatch_size, and is performant especially with Parquet file data sets. The following example upserts the uora_all-MiniLM-L6-bm25 dataset as a dataframe.
Python
from pinecone import Pinecone, ServerlessSpec
from pinecone_datasets import list_datasets, load_dataset

pc = Pinecone(api_key="API_KEY")

dataset = load_dataset("quora_all-MiniLM-L6-bm25")

pc.create_index(
  name="docs-example",
  dimension=384,
  metric="cosine",
  spec=ServerlessSpec(
    cloud="aws",
    region="us-east-1"
  )
)

# To get the unique host for an index, 
# see https://docs.pinecone.io/guides/manage-data/target-an-index
index = pc.Index(host="INDEX_HOST")

index.upsert_from_dataframe(dataset.drop(columns=["blob"]))

See also