Use public Pinecone datasets
This page lists the catalog of public Pinecone datasets and shows you how to work with them using the Python pinecone-datasets library.
To create, upload, and list your own dataset for use by other Pinecone users, see Creating datasets.
Available public datasets
name | documents | source | bucket | task | dense model (dimensions) | sparse model |
---|---|---|---|---|---|---|
ANN_DEEP1B_d96_angular | 9,990,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_DEEP1B_d96_angular | ANN | ANN benchmark (96) | None |
ANN_Fashion-MNIST_d784_euclidean | 60,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_Fashion-MNIST_d784_euclidean | ANN | ANN benchmark (784) | None |
ANN_GIST_d960_euclidean | 1,000,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GIST_d960_euclidean | ANN | ANN benchmark (960) | None |
ANN_GloVe_d100_angular | 1,183,514 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d100_angular | ANN | ANN benchmark (100) | None |
ANN_GloVe_d200_angular | 1,183,514 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d200_angular | ANN | ANN benchmark (200) | None |
ANN_GloVe_d25_angular | 1,183,514 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d25_angular | ANN | ANN benchmark (25) | None |
ANN_GloVe_d50_angular | 1,183,514 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d50_angular | ANN | ANN benchmark (50) | None |
ANN_GloVe_d64_angular | 292,385 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_GloVe_d64_angular | ANN | ANN benchmark (65) | None |
ANN_MNIST_d784_euclidean | 60,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_MNIST_d784_euclidean | ANN | ANN benchmark (784) | None |
ANN_NYTimes_d256_angular | 290,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_NYTimes_d256_angular | ANN | ANN benchmark (256) | None |
ANN_SIFT1M_d128_euclidean | 1,000,000 | https://github.com/erikbern/ann-benchmarks | gs://pinecone-datasets-dev/ANN_SIFT1M_d128_euclidean | ANN | ANN benchmark (128) | None |
amazon_toys_quora_all-MiniLM-L6-bm25 | 10,000 | https://www.kaggle.com/datasets/PromptCloudHQ/toy-products-on-amazon | gs://pinecone-datasets-dev/amazon_toys_quora_all-MiniLM-L6-bm25 | QA | sentence-transformers/all-MiniLM-L6-v2 (384) | bm25 |
it-threat-data-test | 1,042,965 | https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv | it_threat_model.model (128) | None | ||
it-threat-data-train | 1,042,867 | https://cse-cic-ids2018.s3.ca-central-1.amazonaws.com/Processed%20Traffic%20Data%20for%20ML%20Algorithms/Thursday-22-02-2018_TrafficForML_CICFlowMeter.csv | it_threat_model.model (128) | None | ||
langchain-python-docs-text-embedding-ada-002 | 3476 | https://huggingface.co/datasets/jamescalam/langchain-docs-23-06-27 | text-embedding-ada-002 (1536) | None | ||
movielens-user-ratings | 970,582 | https://huggingface.co/datasets/pinecone/movielens-recent-ratings | gs://pinecone-datasets-dev/movielens-user-ratings | classification | pinecone/movie-recommender-user-model (32) | None |
msmarco-v1-bm25-allMiniLML6V2 | 8,841,823 | all-minilm-l6-v2 (384) | bm25-k0.9-b0.4 | |||
quora_all-MiniLM-L6-bm25-100K | 100,000 | https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs | gs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25 | similar questions | sentence-transformers/msmarco-MiniLM-L6-cos-v5 (384) | naver/splade-cocondenser-ensembledistil |
quora_all-MiniLM-L6-bm25 | 522,931 | https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs | gs://pinecone-datasets-dev/quora_all-MiniLM-L6-bm25 | similar questions | sentence-transformers/msmarco-MiniLM-L6-cos-v5 (384) | naver/splade-cocondenser-ensembledistil |
quora_all-MiniLM-L6-v2_Splade-100K | 100,000 | https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs | gs://pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Splade | similar questions | sentence-transformers/msmarco-MiniLM-L6-cos-v5 (384) | naver/splade-cocondenser-ensembledistil |
quora_all-MiniLM-L6-v2_Splade | 522,931 | https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs | gs://pinecone-datasets-dev/quora_all-MiniLM-L6-v2_Splade | similar questions | sentence-transformers/msmarco-MiniLM-L6-cos-v5 (384) | naver/splade-cocondenser-ensembledistil |
squad-text-embedding-ada-002 | 18,891 | https://huggingface.co/datasets/squad | text-embedding-ada-002 (1536) | None | ||
wikipedia-simple-text-embedding-ada-002-100K | 100,000 | wikipedia | gs://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002-100K | multiple | text-embedding-ada-002 (1536) | None |
wikipedia-simple-text-embedding-ada-002 | 283,945 | wikipedia | gs://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002 | multiple | text-embedding-ada-002 (1536) | None |
youtube-transcripts-text-embedding-ada-002 | 38,950 | youtube | gs://pinecone-datasets-dev/youtube-transcripts-text-embedding-ada-002 | multiple | text-embedding-ada-002 (1536) | None |
Install the pinecone-datasets
library
Pinecone provides a Python library for working with public Pinecone datasets. To install the library, run the following command:
pip install pinecone-datasets
List public datasets
To list the available public Pinecone datasets as an object, use the list_datasets()
method:
from pinecone_datasets import list_datasets
list_datasets()
# Response:
# ['ANN_DEEP1B_d96_angular', 'ANN_Fashion-MNIST_d784_euclidean', 'ANN_GIST_d960_euclidean', 'ANN_GloVe_d100_angular', 'ANN_GloVe_d200_angular', 'ANN_GloVe_d25_angular', 'ANN_GloVe_d50_angular', 'ANN_LastFM_d64_angular', 'ANN_MNIST_d784_euclidean', 'ANN_NYTimes_d256_angular', 'ANN_SIFT1M_d128_euclidean', 'amazon_toys_quora_all-MiniLM-L6-bm25', 'it-threat-data-test', 'it-threat-data-train', 'langchain-python-docs-text-embedding-ada-002', 'movielens-user-ratings', 'msmarco-v1-bm25-allMiniLML6V2', 'quora_all-MiniLM-L6-bm25-100K', 'quora_all-MiniLM-L6-bm25', 'quora_all-MiniLM-L6-v2_Splade-100K', 'quora_all-MiniLM-L6-v2_Splade', 'squad-text-embedding-ada-002', 'wikipedia-simple-text-embedding-ada-002-100K', 'wikipedia-simple-text-embedding-ada-002', 'youtube-transcripts-text-embedding-ada-002']
To list the available datasets as a Panda dataframe, pass the as_df=True
argument:
from pinecone_datasets import list_datasets
list_datasets(as_df=True)
# Response:
# name created_at documents ... description tags args
# 0 ANN_DEEP1B_d96_angular 2023-03-10 14:17:01.481785 9990000 ... None None None
# 1 ANN_Fashion-MNIST_d784_euclidean 2023-03-10 14:17:01.481785 60000 ... None None None
# 2 ANN_GIST_d960_euclidean 2023-03-10 14:17:01.481785 1000000 ... None None None
# 3 ANN_GloVe_d100_angular 2023-03-10 14:17:01.481785 1183514 ... None None None
# 4 ANN_GloVe_d200_angular 2023-03-10 14:17:01.481785 1183514 ... None None None
# 5 ANN_GloVe_d25_angular 2023-03-10 14:17:01.481785 1183514 ... None None None
# ...
Load a dataset
To load a dataset into memory, use the load_dataset()
method. You can use load a Pinecone public dataset or your own dataset.
Example
The following example loads the quora_al-MiniLM-L6-bm25
Pinecone public dataset.
from pinecone_datasets import list_datasets, load_dataset
list_datasets()
# ["quora_all-MiniLM-L6-bm25", ... ]
dataset = load_dataset("quora_all-MiniLM-L6-bm25")
dataset.head()
# Response:
# ┌─────┬───────────────────────────┬─────────────────────────────────────┬───────────────────┬──────┐
# │ id ┆ values ┆ sparse_values ┆ metadata ┆ blob │
# │ ┆ ┆ ┆ ┆ │
# │ str ┆ list[f32] ┆ struct[2] ┆ struct[3] ┆ │
# ╞═════╪═══════════════════════════╪═════════════════════════════════════╪═══════════════════╪══════╡
# │ 0 ┆ [0.118014, -0.069717, ... ┆ {[470065541, 52922727, ... 22364... ┆ {2017,12,"other"} ┆ .... │
# │ ┆ 0.0060... ┆ ┆ ┆ │
# └─────┴───────────────────────────┴─────────────────────────────────────┴───────────────────┴──────┘
Iterate over datasets
You can iterate over vector data in a dataset using the iter_documents()
method. You can use this method to upsert or update vectors, to automate benchmarking, or other tasks.
Example
The following example loads the quora_all-MiniLM-L6-bm25
dataset and then iterates over the documents in the dataset in batches of 100 and upserts the vector data to a Pinecone serverless index named example-index
.
from pinecone.grpc import PineconeGRPC as Pinecone
from pinecone import ServerlessSpec
from pinecone_datasets import list_datasets, load_dataset
pinecone = Pinecone(api_key="API_KEY")
dataset = load_dataset("quora_all-MiniLM-L6-bm25")
pinecone.create_index(
name="example-index",
dimension=384,
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
index = pinecone.Index("example-index")
for batch in dataset.iter_documents(batch_size=100):
index.upsert(vectors=batch)
Upsert a dataset as a dataframe
To quickly ingest data when using the Python SDK, use the upsert_from_dataframe
method. The method includes retry logic andbatch_size
, and is performant especially with Parquet file data sets.
The following example upserts the uora_all-MiniLM-L6-bm25
dataset as a dataframe.
from pinecone import Pinecone, ServerlessSpec
from pinecone_datasets import list_datasets, load_dataset
pc = Pinecone(api_key="API_KEY")
dataset = load_dataset("quora_all-MiniLM-L6-bm25")
pc.create_index(
name="example-index",
dimension=384,
metric="cosine",
spec=ServerlessSpec(
cloud="aws",
region="us-east-1"
)
)
index = pc.Index("example-index")
index.upsert_from_dataframe(dataset.drop(columns=["blob"]))
See also
Was this page helpful?