Spark-Pinecone connector
Use the spark-pinecone
connector to efficiently create, ingest, and update vector embeddings at scale with Databricks and Pinecone.
In this guide, you’ll create embeddings based on the sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face, but the approach demonstrated here should work with any other model and dataset.
Before you begin
Ensure you have the following:
1. Install the Spark-Pinecone connector
- Install the Spark-Pinecone connector as a library.
- Configure the library as follows:
-
Select File path/S3 as the Library Source.
-
Enter the S3 URI for the Pinecone assembly JAR file:
Databricks platform users must use the Pinecone assembly jar listed above to ensure that the proper dependecies are installed.
-
Click Install.
-
2. Load the dataset into partitions
As your example dataset, use a collection of news articles from Hugging Face’s datasets library:
-
Create a new notebook attached to your cluster.
-
Install dependencies:
-
Load the dataset:
Python -
Convert the dataset from the Hugging Face format and repartition it:
PythonOnce the repartition is complete, you get back a DataFrame, which is a distributed collection of the data organized into named columns. It is conceptually equivalent to a table in a relational database or a dataframe in R/Python, but with richer optimizations under the hood. As mentioned above, each partition in the dataframe has an equal amount of the original data.
-
The dataset doesn’t have identifiers associated with each document, so add them:
PythonAs its name suggests,
withColumn
adds a column to the dataframe, containing a simple increasing identifier that you cast to a string.
3. Create the vector embeddings
-
Create a UDF (User-Defined Function) to create the embeddings, using the AutoTokenizer and AutoModel classes from the Hugging Face transformers library:
Python -
Apply the UDF to the data:
PythonA dataframe in Spark is a higher-level abstraction built on top of a more fundamental building block called a resilient distributed dataset (RDD). Here, you use the
mapPartitions
function, which provides finer control over the execution of the UDF by explicitly applying it to each partition of the RDD. -
Convert the resulting RDD back into a dataframe with the schema required by Pinecone:
Python
4. Save the embeddings in Pinecone
-
Initialize the connection to Pinecone:
Python -
Create an index for your embeddings:
Python -
Use the Spark-Pinecone connector to save the embeddings to your index:
PythonThe process of writing the embeddings to Pinecone should take approximately 15 seconds. When it completes, you’ll see the following:
This means the process was completed successfully and the embeddings have been stored in Pinecone.
-
Perform a similarity search using the embeddings you loaded into Pinecone by providing a set of vector values or a vector ID. The query endpoint will return the IDs of the most similar records in the index, along with their similarity scores:
PythonIf you want to make a query with a text string (e.g.,
"Summarize this article"
), use thesearch
endpoint via integrated inference.
Was this page helpful?