Using Voyage AI and Pinecone to generate and index high-quality vector embeddings
mteb/legalbench_consumer_contracts_qa
) dataset used in this guide:
mteb/legalbench_consumer_contracts_qa
contains a text
field by which we will embed using the Voyage AI client.
154
documents, we created a 1024
-dimensional embedding with the Voyage AI voyage-law-2
model.
voyageai-pinecone-legalbench
for storing the embeddings. When creating the index, you specify that you would like to use the cosine similarity metric to align with Voyage AI’s embeddings, and also pass the embedding dimensionality of 1024
.
id
, vector
, metadata
), where the metadata
field is an optional extra field where you can store anything you want in a dictionary format. For this example, you will store the original text of the embeddings.
While uploading your data, you will batch everything to avoid pushing too much data in one go.
index.describe_index_stats
that you have a 1024-dimensionality index populated with 154 embeddings. The indexFullness
metric tells you how full your index is. At the moment, it is empty. Using the default value of one p1 pod, you can fit around 750K embeddings before the indexFullness
reaches capacity. The Usage Estimator can be used to identify the number of pods required for a given number of n-dimensional embeddings.
voyage-law-2
, and then search using the returned vector in Pinecone.
metadata
field. Let’s print out the top_k
most similar questions and their respective similarity scores.