OpenAI
In this guide you will learn how to use the OpenAI Embedding API to generate language embeddings, and then index those embeddings in the Pinecone vector database for fast and scalable vector search.
This is a powerful and common combination for building semantic search, question-answering, threat-detection, and other applications that rely on NLP and search over a large corpus of text data.
The basic workflow looks like this:
- Embed and index
- Use the OpenAI Embedding API to generate vector embeddings of your documents (or any text data).
- Upload those vector embeddings into Pinecone, which can store and index millions/billions of these vector embeddings, and search through them at ultra-low latencies.
- Search
- Pass your query text or document through the OpenAI Embedding API again.
- Take the resulting vector embedding and send it as a query to Pinecone.
- Get back semantically similar documents, even if they don't share any keywords with the query.
Let's get started...
Environment Setup
We start by installing the OpenAI and Pinecone clients, we will also need HuggingFace Datasets for downloading the TREC dataset that we will use in this guide.
pip install -U openai pinecone-client datasets
Creating Embeddings
To create embeddings we must first initialize our connection to OpenAI Embeddings, we sign up for an API key at OpenAI.
import openai
openai.organization = "<<YOUR_ORG_KEY>>"
# get this from top-right dropdown on OpenAI under organization > settings
openai.api_key = "<<YOUR_API_KEY>>"
# get API key from top-right dropdown on OpenAI website
openai.Engine.list() # check we have authenticated
The openai.Engine.list()
function should return a list of models that we can use. We will use OpenAI's Babbage model.
MODEL = "text-similarity-babbage-001"
res = openai.Embedding.create(
input=[
"Sample document text goes here",
"there will be several phrases in each batch"
], engine=MODEL
)
In res
we should find a JSON-like object containing two 2048-dimensional embeddings, these are the vector representations of the two inputs provided above. To access the embeddings directly we can write:
# extract embeddings to a list
embeds = [record['embedding'] for record in res['data']]
We will use this logic when creating our embeddings for the Text REtrieval Conference (TREC) question classification dataset later.
Initializing a Pinecone Index
Next, we initialize an index to store the vector embeddings. For this we need a Pinecone API key, sign up for one here.
import pinecone
# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(
api_key="<<YOUR_API_KEY>>",
environment="us-west1-gcp"
)
# check if 'openai' index already exists (only create index if not)
if 'openai' not in pinecone.list_indexes():
pinecone.create_index('openai', dimension=len(embeds[0]))
# connect to index
index = pinecone.Index('openai')
Populating the Index
With both OpenAI and Pinecone connections initialized, we can move onto populating the index. For this, we need the TREC dataset.
from datasets import load_dataset
# load the first 1K rows of the TREC dataset
trec = load_dataset('trec', split='train[:1000]')
Then we create a vector embedding for each question using OpenAI (as demonstrated earlier), and upsert
the ID, vector embedding, and original text for each phrase to Pinecone.
Warning
High-cardinality metadata values (like the unique text values we use here)
can reduce the number of vectors that fit on a single pod. See
Limits for more.
from tqdm.auto import tqdm # this is our progress bar
batch_size = 32 # process everything in batches of 32
for i in tqdm(range(0, len(trec['text']), batch_size)):
# set end position of batch
i_end = min(i+batch_size, len(trec['text']))
# get batch of lines and IDs
lines_batch = trec['text'][i: i+batch_size]
ids_batch = [str(n) for n in range(i, i_end)]
# create embeddings
res = openai.Embedding.create(input=lines_batch, engine=MODEL)
embeds = [record['embedding'] for record in res['data']]
# prep metadata and upsert batch
meta = [{'text': line} for line in lines_batch]
to_upsert = zip(ids_batch, embeds, meta)
# upsert to Pinecone
index.upsert(vectors=list(to_upsert))
Querying
With our data indexed, we're now ready to move onto performing searches. This follows a similar process to indexing. We start with a text query
, that we would like to use to find similar sentences. As before we encode this with OpenAI's text similarity Babbage model to create a query vector xq
. We then use xq
to query the Pinecone index.
query = "What caused the 1929 Great Depression?"
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']
Now we query.
res = index.query([xq], top_k=5, include_metadata=True)
The response from Pinecone includes our original text in the metadata
field, let's print out the top_k
most similar questions and their respective similarity scores.
for match in res['matches']:
print(f"{match['score']:.2f}: {match['metadata']['text']}")
[Out]:
0.95: Why did the world enter a global depression in 1929 ?
0.87: When was `` the Great Depression '' ?
0.86: What crop failure caused the Irish Famine ?
0.82: What caused the Lynmouth floods ?
0.79: What caused Harry Houdini 's death ?
Looks good, let's make it harder and replace "depression" with the incorrect term "recession".
query = "What was the cause of the major recession in the early 20th century?"
# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']
# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)
for match in res['matches']:
print(f"{match['score']:.2f}: {match['metadata']['text']}")
[Out]:
0.92: Why did the world enter a global depression in 1929 ?
0.85: What crop failure caused the Irish Famine ?
0.83: When was `` the Great Depression '' ?
0.82: What are some of the significant historical events of the 1990s ?
0.82: What is considered the costliest disaster the insurance industry has ever faced ?
Let's perform one final search using the definition of depression rather than the word or related words.
query = "Why was there a long-term economic downturn in the early 20th century?"
# create the query embedding
xq = openai.Embedding.create(input=query, engine=MODEL)['data'][0]['embedding']
# query, returning the top 5 most similar results
res = index.query([xq], top_k=5, include_metadata=True)
for match in res['matches']:
print(f"{match['score']:.2f}: {match['metadata']['text']}")
[Out]:
0.93: Why did the world enter a global depression in 1929 ?
0.83: What crop failure caused the Irish Famine ?
0.82: When was `` the Great Depression '' ?
0.82: How did serfdom develop in and then leave Russia ?
0.80: Why were people recruited for the Vietnam War ?
It's clear from this example that the semantic search pipeline is clearly able to identify the meaning between each of our queries. Using these embeddings with Pinecone allows us to return the most semantically similar questions from the already indexed TREC dataset.
Updated about 1 month ago