Databricks
Using Databricks and Pinecone to create and index vector embeddings at scale
Databricks is a Unified Analytics Platform on top of Apache Spark. The primary advantage of using Spark is its ability to distribute workloads across a cluster of machines. By adding more machines or increasing the number of cores on each machine, it is easy to horizontally scale a cluster to handle computationally intensive tasks like vector embedding, where parallelization can save many hours of precious computation time and resources. Leveraging GPUs with Spark can produce even better results — enjoying the benefits of the fast computation of a GPU combined with parallelization will ensure optimal performance.
Efficiently create, ingest, and update vector embeddings at scale with Databricks and Pinecone.
Setup guide
In this guide, you’ll create embeddings based on the sentence-transformers/all-MiniLM-L6-v2 model from Hugging Face, but the approach demonstrated here should work with any other model and dataset.
Before you begin
Ensure you have the following:
1. Install the Spark-Pinecone connector
- Install the Spark-Pinecone connector as a library.
- Configure the library as follows:
-
Select File path/S3 as the Library Source.
-
Enter the S3 URI for the Pinecone assembly JAR file:
Databricks platform users must use the Pinecone assembly jar listed above to ensure that the proper dependecies are installed.
-
Click Install.
-
2. Load the dataset into partitions
As your example dataset, use a collection of news articles from Hugging Face’s datasets library:
-
Create a new notebook attached to your cluster.
-
Install dependencies:
-
Load the dataset:
Python -
Convert the dataset from the Hugging Face format and repartition it:
PythonOnce the repartition is complete, you get back a DataFrame, which is a distributed collection of the data organized into named columns. It is conceptually equivalent to a table in a relational database or a dataframe in R/Python, but with richer optimizations under the hood. As mentioned above, each partition in the dataframe has an equal amount of the original data.
-
The dataset doesn’t have identifiers associated with each document, so add them:
PythonAs its name suggests,
withColumn
adds a column to the dataframe, containing a simple increasing identifier that you cast to a string.
3. Create the vector embeddings
-
Create a UDF (User-Defined Function) to create the embeddings, using the AutoTokenizer and AutoModel classes from the Hugging Face transformers library:
Python -
Apply the UDF to the data:
PythonA dataframe in Spark is a higher-level abstraction built on top of a more fundamental building block called a resilient distributed dataset (RDD). Here, you use the
mapPartitions
function, which provides finer control over the execution of the UDF by explicitly applying it to each partition of the RDD. -
Convert the resulting RDD back into a dataframe with the schema required by Pinecone:
Python
4. Save the embeddings in Pinecone
-
Initialize the connection to Pinecone:
Python -
Create an index for your embeddings:
Python -
Use the Spark-Pinecone connector to save the embeddings to your index:
PythonThe process of writing the embeddings to Pinecone should take approximately 15 seconds. When it completes, you’ll see the following:
This means the process was completed successfully and the embeddings have been stored in Pinecone.
-
Perform a similarity search using the embeddings you loaded into Pinecone by providing a set of vector values or a vector ID. The query endpoint will return the IDs of the most similar records in the index, along with their similarity scores:
PythonIf you want to make a query with a text string (e.g.,
"Summarize this article"
), use thesearch
endpoint via integrated inference.