Retrieval augmentation

Open In Colab Open nbviewer

Retrieval Augmentation

Large Language Models (LLMs) have a data freshness problem. The most powerful LLMs in the world, like GPT-4, have no idea about recent world events.

The world of LLMs is frozen in time. Their world exists as a static snapshot of the world as it was within their training data.

A solution to this problem is retrieval augmentation. The idea behind this is that we retrieve relevant information from an external knowledge base and give that information to our LLM. In this notebook we will learn how to do that.

Open full notebook

To begin, we must install the prerequisite libraries that we will be using in this notebook.

!pip install -qU \
  langchain==0.0.162 \
  openai==0.27.7 \
  tiktoken==0.4.0 \
  "pinecone-client[grpc]"==2.2.1 \

šŸšØ Note: the above pip install is formatted for Jupyter notebooks. If running elsewhere you may need to drop the !.

Building the Knowledge Base

We will download a pre-embedding dataset from pinecone-datasets. Allowing us to skip the embedding and preprocessing steps, if you'd rather work through those steps you can find the full notebook here.

import pinecone_datasets

dataset = pinecone_datasets.load_dataset('wikipedia-simple-text-embedding-ada-002')
/usr/local/lib/python3.10/dist-packages/pinecone_datasets/ UserWarning: WARNING: No data found at: gs://pinecone-datasets-dev/wikipedia-simple-text-embedding-ada-002/queries/*.parquet. Returning empty DF
id values sparse_values metadata blob
0 1-0 [-0.011254455894231796, -0.01698738895356655, ... None None {'chunk': 0, 'source': 'https://simple.wikiped...
1 1-1 [-0.0015197008615359664, -0.007858820259571075... None None {'chunk': 1, 'source': 'https://simple.wikiped...
2 1-2 [-0.009930099360644817, -0.012211072258651257,... None None {'chunk': 2, 'source': 'https://simple.wikiped...
3 1-3 [-0.011600767262279987, -0.012608098797500134,... None None {'chunk': 3, 'source': 'https://simple.wikiped...
4 1-4 [-0.026462381705641747, -0.016362832859158516,... None None {'chunk': 4, 'source': 'https://simple.wikiped...

<svg xmlns="" height="24px"viewBox="0 0 24 24"


We'll format the dataset ready for upsert and reduce what we use to a subset of the full dataset.

# we drop sparse_values as they are not needed for this example
dataset.documents.drop(['sparse_values', 'metadata'], axis=1, inplace=True)
dataset.documents.rename(columns={'blob': 'metadata'}, inplace=True)
# we will use rows of the dataset up to index 30_000
dataset.documents.drop(dataset.documents.index[30_000:], inplace=True)

Now we move on to initializing our Pinecone vector database.

Vector Database

To create our vector database we first need a free API key from Pinecone. Then we initialize like so:

import os
import pinecone

# find API key in console at
# find ENV (cloud region) next to API key in console

index_name = 'langchain-retrieval-augmentation-fast'

if index_name not in pinecone.list_indexes():
    # we create a new index
        dimension=1536,  # 1536 dim of text-embedding-ada-002
        metadata_config={'indexed': ['wiki-id', 'title']}
/usr/local/lib/python3.10/dist-packages/pinecone/ TqdmExperimentalWarning: Using `tqdm.autonotebook.tqdm` in notebook mode. Use `tqdm.tqdm` instead to force console mode (e.g. in jupyter console)
  from tqdm.autonotebook import tqdm

Then we connect to the new index:

import time

index = pinecone.GRPCIndex(index_name)
# wait a moment for the index to be fully initialized

{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We should see that the new Pinecone index has a total_vector_count of 0, as we haven't added any vectors yet.

Now we upsert the data to Pinecone:

index.upsert_from_dataframe(dataset.documents, batch_size=100)
sending upsert requests:   0%|          | 0/30000 [00:00<?, ?it/s]

collecting async responses:   0%|          | 0/300 [00:00<?, ?it/s]

upserted_count: 30000

We've now indexed everything. We can check the number of vectors in our index like so:

{'dimension': 1536,
 'index_fullness': 0.1,
 'namespaces': {'': {'vector_count': 30000}},
 'total_vector_count': 30000}

Creating a Vector Store and Querying

Now that we've build our index we can switch over to LangChain. We need to initialize a LangChain vector store using the same index we just built. For this we will also need a LangChain embedding object, which we initialize like so:

from langchain.embeddings.openai import OpenAIEmbeddings

# get openai api key from

model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(

Now initialize the vector store:

from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field

Now we can query the vector store directly using vectorstore.similarity_search:

query = "who was Benito Mussolini?"

    query,  # our search query
    k=3  # return 3 most relevant docs
[Document(page_content='Benito Amilcare Andrea Mussolini KSMOM GCTE (29 July 1883 ā€“ 28 April 1945) was an Italian politician and journalist. He was also the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party.\n\nBiography\n\nEarly life\nBenito Mussolini was named after Benito Juarez, a Mexican opponent of the political power of the Roman Catholic Church, by his anticlerical (a person who opposes the political interference of the Roman Catholic Church in secular affairs) father. Mussolini\'s father was a blacksmith. Before being involved in politics, Mussolini was a newspaper editor (where he learned all his propaganda skills) and elementary school teacher.\n\nAt first, Mussolini was a socialist, but when he wanted Italy to join the First World War, he was thrown out of the socialist party. He \'invented\' a new ideology, Fascism, much out of Nationalist\xa0and Conservative views.\n\nRise to power and becoming dictator\nIn 1922, he took power by having a large group of men, "Black Shirts," march on Rome and threaten to take over the government. King Vittorio Emanuele III gave in, allowed him to form a government, and made him prime minister. In the following five years, he gained power, and in 1927 created the OVRA, his personal secret police force. Using the agency to arrest, scare, or murder people against his regime, Mussolini was dictator\xa0of Italy by the end of 1927. Only the King and his own Fascist party could challenge his power.', metadata={'chunk': 0.0, 'source': '', 'title': 'Benito Mussolini', 'wiki-id': '6754'}),
 Document(page_content='Fascism as practiced by Mussolini\nMussolini\'s form of Fascism, "Italian Fascism"- unlike Nazism, the racist ideology that Adolf Hitler followed- was different and less destructive than Hitler\'s. Although a believer in the superiority of the Italian nation and national unity, Mussolini, unlike Hitler, is quoted "Race? It is a feeling, not a reality. Nothing will ever make me believe that biologically pure races can be shown to exist today".\n\nMussolini wanted Italy to become a new Roman Empire. In 1923, he attacked the island of Corfu, and in 1924, he occupied the city state of Fiume. In 1935, he attacked the African country Abyssinia (now called Ethiopia). His forces occupied it in 1936. Italy was thrown out of the League of Nations because of this aggression. In 1939, he occupied the country Albania. In 1936, Mussolini signed an alliance with Adolf Hitler, the dictator of Germany.\n\nFall from power and death\nIn 1940, he sent Italy into the Second World War on the side of the Axis countries. Mussolini attacked Greece, but he failed to conquer it. In 1943, the Allies landed in Southern Italy. The Fascist party and King Vittorio Emanuel III deposed Mussolini and put him in jail, but he was set free by the Germans, who made him ruler of the Italian Social Republic puppet state which was in a small part of Central Italy. When the war was almost over, Mussolini tried to escape to Switzerland with his mistress, Clara Petacci, but they were both captured and shot by partisans. Mussolini\'s dead body was hanged upside-down, together with his mistress and some of Mussolini\'s helpers, on a pole at a gas station in the village of Millan, which is near the border  between Italy and Switzerland.', metadata={'chunk': 1.0, 'source': '', 'title': 'Benito Mussolini', 'wiki-id': '6754'}),
 Document(page_content='Veneto was made part of Italy in 1866 after a war with Austria. Italian soldiers won Latium in 1870. That was when they took away the Pope\'s power. The Pope, who was angry, said that he was a prisoner to keep Catholic people from being active in politics. That was the year of Italian unification.\n\nItaly participated in World War I. It was an ally of Great Britain, France, and Russia against the Central Powers. Almost all of Italy\'s fighting was on the Eastern border, near Austria. After the "Caporetto defeat", Italy thought they would lose the war. But, in 1918, the Central Powers surrendered. Italy gained the Trentino-South Tyrol, which once was owned by Austria.\n\nFascist Italy \nIn 1922, a new Italian government started. It was ruled by Benito Mussolini, the leader of Fascism in Italy. He became head of government and dictator, calling himself "Il Duce" (which means "leader" in Italian). He became friends with German dictator Adolf Hitler. Germany, Japan, and Italy became the Axis Powers. In 1940, they entered World War II together against France, Great Britain, and later the Soviet Union. During the war, Italy controlled most of the Mediterranean Sea.', metadata={'chunk': 5.0, 'source': '', 'title': 'Italy', 'wiki-id': '363'})]

All of these are good, relevant results. But what can we do with this? There are many tasks, one of the most interesting (and well supported by LangChain) is called "Generative Question-Answering" or GQA.

Generative Question-Answering

In GQA we take the query as a question that is to be answered by a LLM, but the LLM must answer the question based on the information it is seeing being returned from the vectorstore.

To do this we initialize a RetrievalQA object like so:

from langchain.chat_models import ChatOpenAI
from langchain.chains import RetrievalQA

# completion llm
llm = ChatOpenAI(

qa = RetrievalQA.from_chain_type(
'Benito Mussolini was an Italian politician and journalist who served as the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party and was also a newspaper editor and elementary school teacher before being involved in politics. Mussolini was a dictator of Italy by the end of 1927, and his form of Fascism, "Italian Fascism," was different and less destructive than Hitler\'s Nazism. He wanted Italy to become a new Roman Empire and attacked several countries, including Abyssinia (now called Ethiopia) and Greece. Mussolini was captured and shot by partisans in 1945.'

We can also include the sources of information that the LLM is using to answer our question. We can do this using a slightly different version of RetrievalQA called RetrievalQAWithSourcesChain:

from langchain.chains import RetrievalQAWithSourcesChain

qa_with_sources = RetrievalQAWithSourcesChain.from_chain_type(
{'question': 'who was Benito Mussolini?',
 'answer': 'Benito Mussolini was an Italian politician and journalist who was the Prime Minister of Italy from 1922 until 1943. He was the leader of the National Fascist Party and created his own form of Fascism called "Italian Fascism". He wanted Italy to become a new Roman Empire and attacked several countries, including Abyssinia (now called Ethiopia) and Greece. He was dictator of Italy by the end of 1927 and was deposed in 1943. He was later captured and shot by partisans. Mario Draghi is the current head of government of Italy. Italy was not a state before 1861 and was a group of separate states ruled by other countries. In 1860, Giuseppe Garibaldi took control of Sicily, creating the Kingdom of Italy in 1861. Victor Emmanuel II was made the king. \n',
 'sources': ','}

Now we answer the question being asked, and return the source of this information being used by the LLM.

Once done, we can delete the index to save resources.