Hugging Face Inference Endpoints
Using Hugging Face Inference Endpoints and Pinecone to generate and index high-quality vector embeddings
Hugging Face Inference Endpoints offers a secure production solution to easily deploy any Hugging Face Transformers, Sentence-Transformers and Diffusion models from the Hub on dedicated and autoscaling infrastructure managed by Hugging Face.
Coupled with Pinecone, you can use Hugging Face to generate and index high-quality vector embeddings with ease.
Setup guide
Hugging Face Inference Endpoints allows access to straightforward model inference. Coupled with Pinecone we can generate and index high-quality vector embeddings with ease.
Let’s get started by initializing an Inference Endpoint for generating vector embeddings.
Create an endpoint
We start by heading over to the Hugging Face Inference Endpoints homepage and signing up for an account if needed. After, we should find ourselves on this page:
We click on Create new endpoint, choose a model repository (eg name of the model), endpoint name (this can be anything), and select a cloud environment. Before moving on it is very important that we set the Task to Sentence Embeddings (found within the Advanced configuration settings).
Other important options include the Instance Type, by default this uses CPU which is cheaper but also slower. For faster processing we need a GPU instance. And finally, we set our privacy setting near the end of the page.
After setting our options we can click Create Endpoint at the bottom of the page. This action should take use to the next page where we will see the current status of our endpoint.
Once the status has moved from Building to Running (this can take some time), we’re ready to begin creating embeddings with it.
Create embeddings
Each endpoint is given an Endpoint URL, it can be found on the endpoint Overview page. We need to assign this endpoint URL to the endpoint_url
variable.
endpoint = "<ENDPOINT_URL>"
We will also need the organization API token, we find this via the organization settings on Hugging Face (https://huggingface.co/organizations/<ORG_NAME>/settings/profile
). This is assigned to the api_org
variable.
api_org = "<API_ORG_TOKEN>"
Now we’re ready to create embeddings via Inference Endpoints. Let’s start with a toy example.
import requests
# add the api org token to the headers
headers = {
'Authorization': f'Bearer {api_org}'
}
# we add sentences to embed like so
json_data = {"inputs": ["a happy dog", "a sad dog"]}
# make the request
res = requests.post(
endpoint,
headers=headers,
json=json_data
)
We should see a 200
response.
res
<Response [200]>
Inside the response we should find two embeddings…
len(res.json()['embeddings'])
2
We can also see the dimensionality of our embeddings like so:
dim = len(res.json()['embeddings'][0])
dim
768
We will need more than two items to search through, so let’s download a larger dataset. For this we will use Hugging Face datasets.
from datasets import load_dataset
snli = load_dataset("snli", split='train')
snli
Downloading: 100%|██████████| 1.93k/1.93k [00:00<00:00, 992kB/s]
Downloading: 100%|██████████| 1.26M/1.26M [00:00<00:00, 31.2MB/s]
Downloading: 100%|██████████| 65.9M/65.9M [00:01<00:00, 57.9MB/s]
Downloading: 100%|██████████| 1.26M/1.26M [00:00<00:00, 43.6MB/s]
Dataset({
features: ['premise', 'hypothesis', 'label'],
num_rows: 550152
})
SNLI contains 550K sentence pairs, many of these include duplicate items so we will take just one set of these (the hypothesis) and deduplicate them.
passages = list(set(snli['hypothesis']))
len(passages)
480042
We will drop to 50K sentences so that the example is quick to run, if you have time, feel free to keep the full 480K.
passages = passages[:50_000]
Create a Pinecone index
With our endpoint and dataset ready, all that we’re missing is a vector database. For this, we need to initialize our connection to Pinecone, this requires a free API key.
import pinecone
# initialize connection to pinecone (get API key at app.pinecone.io)
pinecone.init(api_key="YOUR_API_KEY", environment="YOUR_ENVIRONMENT")
Now we create a new index called 'hf-endpoints'
, the name isn’t important but the dimension
must align to our endpoint model output dimensionality (we found this in dim
above) and the model metric (typically cosine
is okay, but not for all models).
index_name = 'hf-endpoints'
# check if the hf-endpoints index exists
if index_name not in pinecone.list_indexes():
# create the index if it does not exist
pinecone.create_index(
index_name,
dimension=dim,
metric="cosine"
)
# connect to hf-endpoints index we created
index = pinecone.Index(index_name)
Create and index embeddings
Now we have all of our components ready; endpoints, dataset, and Pinecone. Let’s go ahead and create our dataset embeddings and index them within Pinecone.
from tqdm.auto import tqdm
# we will use batches of 64
batch_size = 64
for i in tqdm(range(0, len(passages), batch_size)):
# find end of batch
i_end = min(i+batch_size, len(passages))
# extract batch
batch = passages[i:i_end]
# generate embeddings for batch via endpoints
res = requests.post(
endpoint,
headers=headers,
json={"inputs": batch}
)
emb = res.json()['embeddings']
# get metadata (just the original text)
meta = [{'text': text} for text in batch]
# create IDs
ids = [str(x) for x in range(i, i_end)]
# add all to upsert list
to_upsert = list(zip(ids, emb, meta))
# upsert/insert these records to pinecone
_ = index.upsert(vectors=to_upsert)
# check that we have all vectors in index
index.describe_index_stats()
100%|██████████| 782/782 [11:02<00:00, 1.18it/s]
{'dimension': 768,
'index_fullness': 0.1,
'namespaces': {'': {'vector_count': 50000}},
'total_vector_count': 50000}
With everything indexed we can begin querying. We will take a few examples from the premise column of the dataset.
query = snli['premise'][0]
print(f"Query: {query}")
# encode with HF endpoints
res = requests.post(endpoint, headers=headers, json={"inputs": query})
xq = res.json()['embeddings']
# query and return top 5
xc = index.query(xq, top_k=5, include_metadata=True)
# iterate through results and print text
print("Answers:")
for match in xc['matches']:
print(match['metadata']['text'])
Query: A person on a horse jumps over a broken down airplane.
Answers:
The horse jumps over a toy airplane.
a lady rides a horse over a plane shaped obstacle
A person getting onto a horse.
person rides horse
A woman riding a horse jumps over a bar.
These look good, let’s try a couple more examples.
query = snli['premise'][100]
print(f"Query: {query}")
# encode with HF endpoints
res = requests.post(endpoint, headers=headers, json={"inputs": query})
xq = res.json()['embeddings']
# query and return top 5
xc = index.query(xq, top_k=5, include_metadata=True)
# iterate through results and print text
print("Answers:")
for match in xc['matches']:
print(match['metadata']['text'])
Query: A woman is walking across the street eating a banana, while a man is following with his briefcase.
Answers:
A woman eats a banana and walks across a street, and there is a man trailing behind her.
A woman eats a banana split.
A woman is carrying two small watermelons and a purse while walking down the street.
The woman walked across the street.
A woman walking on the street with a monkey on her back.
And one more…
query = snli['premise'][200]
print(f"Query: {query}")
# encode with HF endpoints
res = requests.post(endpoint, headers=headers, json={"inputs": query})
xq = res.json()['embeddings']
# query and return top 5
xc = index.query(xq, top_k=5, include_metadata=True)
# iterate through results and print text
print("Answers:")
for match in xc['matches']:
print(match['metadata']['text'])
Query: People on bicycles waiting at an intersection.
Answers:
A pair of people on bikes are waiting at a stoplight.
Bike riders wait to cross the street.
people on bicycles
Group of bike riders stopped in the street.
There are bicycles outside.
All of these results look excellent. If you are not planning on running your endpoint and vector DB beyond this tutorial, you can shut down both.
Clean up
Shut down the endpoint by navigating to the Inference Endpoints Overview page and selecting Delete endpoint. Delete the Pinecone index with:
pinecone.delete_index(index_name)