Hugging Face Inference Endpoints

Hugging Face Inference Endpoints allows access to straightforward model inference. Coupled with Pinecone we can generate and index high-quality vector embeddings with ease.

Let's get started by initializing an Inference Endpoint for generating vector embeddings.

Endpoints

We start by heading over to the Hugging Face Inference Endpoints homepage and signing up for an account if needed. After, we should find ourselves on this page:

endpoints 0endpoints 0

We click on Create new endpoint, choose a model repository (eg name of the model), endpoint name (this can be anything), and select a cloud environment. Before moving on it is very important that we set the Task to Sentence Embeddings (found within the Advanced configuration settings).

endpoints 1endpoints 1

endpoints 2endpoints 2

Other important options include the Instance Type, by default this uses CPU which is cheaper but also slower. For faster processing we need a GPU instance. And finally, we set our privacy setting near the end of the page.

After setting our options we can click Create Endpoint at the bottom of the page. This action should take use to the next page where we will see the current status of our endpoint.

endpoints 3endpoints 3

Once the status has moved from Building to Running (this can take some time), we're ready to begin creating embeddings with it.

Creating Embeddings

Each endpoint is given an Endpoint URL, it can be found on the endpoint Overview page. We need to assign this endpoint URL to the endpoint_url variable.

endpoints 4endpoints 4

endpoint = "<<ENDPOINT_URL>>"

We will also need the organization API token, we find this via the organization settings on Hugging Face (https://huggingface.co/organizations/<ORG_NAME>/settings/profile). This is assigned to the api_org variable.

endpoints 5endpoints 5

api_org = "<<API_ORG_TOKEN>>"

Now we're ready to create embeddings via Inference Endpoints. Let's start with a toy example.

import requests

# add the api org token to the headers
headers = {
    'Authorization': f'Bearer {api_org}'
}
# we add sentences to embed like so
json_data = {"inputs": ["a happy dog", "a sad dog"]}
# make the request
res = requests.post(
    endpoint,
    headers=headers,
    json=json_data
)

We should see a 200 response.

res
<Response [200]>

Inside the response we should find two embeddings...

len(res.json()['embeddings'])
2

We can also see the dimensionality of our embeddings like so:

dim = len(res.json()['embeddings'][0])
dim
768

We will need more than two items to search through, so let's download a larger dataset. For this we will use Hugging Face datasets.

from datasets import load_dataset

snli = load_dataset("snli", split='train')
snli
Downloading: 100%|██████████| 1.93k/1.93k [00:00<00:00, 992kB/s]
Downloading: 100%|██████████| 1.26M/1.26M [00:00<00:00, 31.2MB/s]
Downloading: 100%|██████████| 65.9M/65.9M [00:01<00:00, 57.9MB/s]
Downloading: 100%|██████████| 1.26M/1.26M [00:00<00:00, 43.6MB/s]

Dataset({
    features: ['premise', 'hypothesis', 'label'],
    num_rows: 550152
})

SNLI contains 550K sentence pairs, many of these include duplicate items so we will take just one set of these (the hypothesis) and deduplicate them.

passages = list(set(snli['hypothesis']))
len(passages)
480042

We will drop to 50K sentences so that the example is quick to run, if you have time, feel free to keep the full 480K.

passages = passages[:50_000]

Vector DB

With our endpoint and dataset ready, all that we're missing is a vector database. For this, we need to initialize our connection to Pinecone, this requires a free API key.

import pinecone

# connect to pinecone environment
pinecone.init(
    api_key="<<YOUR_API_KEY>>",
    environment="us-west1-gcp"
)

Now we create a new index called 'hf-endpoints', the name isn't important but the dimension must align to our endpoint model output dimensionality (we found this in dim above) and the model metric (typically cosine is okay, but not for all models).

index_name = 'hf-endpoints'

# check if the hf-endpoints index exists
if index_name not in pinecone.list_indexes():
    # create the index if it does not exist
    pinecone.create_index(
        index_name,
        dimension=dim,
        metric="cosine"
    )

# connect to hf-endpoints index we created
index = pinecone.Index(index_name)

Create and Index Embeddings

Now we have all of our components ready; endpoints, dataset, and Pinecone. Let's go ahead and create our dataset embeddings and index them within Pinecone.

from tqdm.auto import tqdm

# we will use batches of 64
batch_size = 64

for i in tqdm(range(0, len(passages), batch_size)):
    # find end of batch
    i_end = min(i+batch_size, len(passages))
    # extract batch
    batch = passages[i:i_end]
    # generate embeddings for batch via endpoints
    res = requests.post(
        endpoint,
        headers=headers,
        json={"inputs": batch}
    )
    emb = res.json()['embeddings']
    # get metadata (just the original text)
    meta = [{'text': text} for text in batch]
    # create IDs
    ids = [str(x) for x in range(i, i_end)]
    # add all to upsert list
    to_upsert = list(zip(ids, emb, meta))
    # upsert/insert these records to pinecone
    _ = index.upsert(vectors=to_upsert)

# check that we have all vectors in index
index.describe_index_stats()
100%|██████████| 782/782 [11:02<00:00,  1.18it/s]


{'dimension': 768,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 50000}},
 'total_vector_count': 50000}

With everything indexed we can begin querying. We will take a few examples from the premise column of the dataset.

query = snli['premise'][0]
print(f"Query: {query}")
# encode with HF endpoints
res = requests.post(endpoint, headers=headers, json={"inputs": query})
xq = res.json()['embeddings']
# query and return top 5
xc = index.query(xq, top_k=5, include_metadata=True)
# iterate through results and print text
print("Answers:")
for match in xc['matches']:
    print(match['metadata']['text'])
Query: A person on a horse jumps over a broken down airplane.
Answers:
The horse jumps over a toy airplane.
a lady rides a horse over a plane shaped obstacle
A person getting onto a horse.
person rides horse
A woman riding a horse jumps over a bar.

These look good, let's try a couple more examples.

query = snli['premise'][100]
print(f"Query: {query}")
# encode with HF endpoints
res = requests.post(endpoint, headers=headers, json={"inputs": query})
xq = res.json()['embeddings']
# query and return top 5
xc = index.query(xq, top_k=5, include_metadata=True)
# iterate through results and print text
print("Answers:")
for match in xc['matches']:
    print(match['metadata']['text'])
Query: A woman is walking across the street eating a banana, while a man is following with his briefcase.
Answers:
A woman eats a banana and walks across a street, and there is a man trailing behind her.
A woman eats a banana split.
A woman is carrying two small watermelons and a purse while walking down the street.
The woman walked across the street.
A woman walking on the street with a monkey on her back.

And one more...

query = snli['premise'][200]
print(f"Query: {query}")
# encode with HF endpoints
res = requests.post(endpoint, headers=headers, json={"inputs": query})
xq = res.json()['embeddings']
# query and return top 5
xc = index.query(xq, top_k=5, include_metadata=True)
# iterate through results and print text
print("Answers:")
for match in xc['matches']:
    print(match['metadata']['text'])
Query: People on bicycles waiting at an intersection.
Answers:
A pair of people on bikes are waiting at a stoplight.
Bike riders wait to cross the street.
people on bicycles
Group of bike riders stopped in the street.
There are bicycles outside.

All of these results look excellent. If you are not planning on running your endpoint and vector DB beyond this tutorial, you can shut down both.

Once the index is deleted, you cannot use it again.

Shut down the endpoint by navigating to the Inference Endpoints Overview page and selecting Delete endpoint. Delete the Pinecone index with:

pinecone.delete_index(index_name)