Voyage AI
Using Voyage AI and Pinecone to generate and index high-quality vector embeddings
Voyage AI provides cutting-edge embedding and rerankers. Voyage AI’s generalist embedding models continually top the MTEB leaderboard, and the domain-specific embeddings enhance the retrieval quality for enterprise use cases significantly.
Setup guide
In this guide, we use the Voyage Embedding API endpoint to generate text embeddings for terms of service and consumer contract documents, and then index those embeddings in the Pinecone vector database.
This is a powerful and common combination for building retrieval-augmented generation (RAG), semantic search, question-answering, code assistants, and other applications that rely on NLP and search over a large corpus of text data.
1. Set up the environment
Start by installing the Voyage and Pinecone clients and HuggingFace Datasets for downloading the LegalBench: Consumer Contracts QA (mteb/legalbench_consumer_contracts_qa
) dataset used in this guide:
2. Create embeddings
Sign up for an API key at Voyage AI and then use it to initialize your connection.
Load the LegalBench: Consumer Contracts QA dataset, which contains 154 consumer contract documents and 396 labeled queries about these documents.
Each document in mteb/legalbench_consumer_contracts_qa
contains a text
field by which we will embed using the Voyage AI client.
Check the dimensionality of the returned vectors. You will need to save the embedding dimensionality from this to be used when initializing your Pinecone index later.
In this example, you can see that for each of the 154
documents, we created a 1024
-dimensional embedding with the Voyage AI voyage-law-2
model.
3. Store the Embeddings
Now that you have your embeddings, you can move on to indexing them in the Pinecone vector database. For this, you need a Pinecone API key. Sign up for one here.
You first initialize our connection to Pinecone and then create a new index called voyageai-pinecone-legalbench
for storing the embeddings. When creating the index, you specify that you would like to use the cosine similarity metric to align with Voyage AI’s embeddings, and also pass the embedding dimensionality of 1024
.
Now you can begin populating the index with your embeddings. Pinecone expects you to provide a list of tuples in the format (id
, vector
, metadata
), where the metadata
field is an optional extra field where you can store anything you want in a dictionary format. For this example, you will store the original text of the embeddings.
While uploading your data, you will batch everything to avoid pushing too much data in one go.
You can see from index.describe_index_stats
that you have a 1024-dimensionality index populated with 154 embeddings. The indexFullness
metric tells you how full your index is. At the moment, it is empty. Using the default value of one p1 pod, you can fit around 750K embeddings before the indexFullness
reaches capacity. The Usage Estimator can be used to identify the number of pods required for a given number of n-dimensional embeddings.
4. Semantic search
Now that you have your indexed vectors, you can perform a few search queries. When searching, you will first embed your query using voyage-law-2
, and then search using the returned vector in Pinecone.
The response from Pinecone includes your original text in the metadata
field. Let’s print out the top_k
most similar questions and their respective similarity scores.
The semantic search pipeline with Voyage AI and Pinecone is able to identify the relevant consumer contract documents to answer the user query.
Was this page helpful?