LangChain Retrieval Agent

Open In Colab Open nbviewer Open Github

Chatbots can struggle with data freshness, knowledge about specific domains, or accessing internal documentation. By coupling agents with retrieval augmentation tools we no longer have these problems.

One the other side, using "naive" retrieval augmentation without the use of an agent means we will retrieve contexts with every query. Again, this isn't always ideal as not every query requires access to external knowledge.

Merging these methods gives us the best of both worlds. In this notebook we'll learn how to do this.

(See our LangChain Handbook for more on LangChain).

To begin, we must install the prerequisite libraries that we will be using in this notebook.

!pip install -qU openai pinecone-client[grpc] langchain==0.0.162 tiktoken datasets

Building the Knowledge Base

We start by constructing our knowledge base. We'll use a mostly prepared dataset called Stanford Question-Answering Dataset (SQuAD) hosted on Hugging Face Datasets. We download it like so:

from datasets import load_dataset

data = load_dataset('squad', split='train')
data
Dataset({
    features: ['id', 'title', 'context', 'question', 'answers'],
    num_rows: 87599
})

The dataset does contain duplicate contexts, which we can remove like so:

data = data.to_pandas()
data.head()
id title context question answers
0 5733be284776f41900661182 University_of_Notre_Dame Architecturally, the school has a Catholic cha... To whom did the Virgin Mary allegedly appear i... {'text': ['Saint Bernadette Soubirous'], 'answ...
1 5733be284776f4190066117f University_of_Notre_Dame Architecturally, the school has a Catholic cha... What is in front of the Notre Dame Main Building? {'text': ['a copper statue of Christ'], 'answe...
2 5733be284776f41900661180 University_of_Notre_Dame Architecturally, the school has a Catholic cha... The Basilica of the Sacred heart at Notre Dame... {'text': ['the Main Building'], 'answer_start'...
3 5733be284776f41900661181 University_of_Notre_Dame Architecturally, the school has a Catholic cha... What is the Grotto at Notre Dame? {'text': ['a Marian place of prayer and reflec...
4 5733be284776f4190066117e University_of_Notre_Dame Architecturally, the school has a Catholic cha... What sits on top of the Main Building at Notre... {'text': ['a golden statue of the Virgin Mary'...
data.drop_duplicates(subset='context', keep='first', inplace=True)
data.head()
id title context question answers
0 5733be284776f41900661182 University_of_Notre_Dame Architecturally, the school has a Catholic cha... To whom did the Virgin Mary allegedly appear i... {'text': ['Saint Bernadette Soubirous'], 'answ...
5 5733bf84d058e614000b61be University_of_Notre_Dame As at most other universities, Notre Dame's st... When did the Scholastic Magazine of Notre dame... {'text': ['September 1876'], 'answer_start': [...
10 5733bed24776f41900661188 University_of_Notre_Dame The university is the major seat of the Congre... Where is the headquarters of the Congregation ... {'text': ['Rome'], 'answer_start': [119]}
15 5733a6424776f41900660f51 University_of_Notre_Dame The College of Engineering was established in ... How many BS level degrees are offered in the C... {'text': ['eight'], 'answer_start': [487]}
20 5733a70c4776f41900660f64 University_of_Notre_Dame All of Notre Dame's undergraduate students are... What entity provides help with the management ... {'text': ['Learning Resource Center'], 'answer...

Initialize the Embedding Model and Vector DB

We'll be using OpenAI's text-embedding-ada-002 model initialize via LangChain and the Pinecone vector DB. We start by initializing the embedding model, for this we need an OpenAI API key.

(Note that OpenAI is a paid service and so running the remainder of this notebook may incur some small cost)

from getpass import getpass
from langchain.embeddings.openai import OpenAIEmbeddings

OPENAI_API_KEY = getpass("OpenAI API Key: ")  # platform.openai.com
model_name = 'text-embedding-ada-002'

embed = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=OPENAI_API_KEY
)

Next we initialize the vector database. For this we need a free API key, then we create the index:

import pinecone

# find API key in console at app.pinecone.io
YOUR_API_KEY = getpass("Pinecone API Key: ")
# find ENV (cloud region) next to API key in console
YOUR_ENV = input("Pinecone environment: ")

index_name = 'langchain-retrieval-agent'
pinecone.init(
    api_key=YOUR_API_KEY,
    environment=YOUR_ENV
)

if index_name not in pinecone.list_indexes():
    # we create a new index
    pinecone.create_index(
        name=index_name,
        metric='dotproduct',
        dimension=1536  # 1536 dim of text-embedding-ada-002
    )

Then connect to the index:

index = pinecone.GRPCIndex(index_name)
index.describe_index_stats()
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {},
 'total_vector_count': 0}

We should see that the new Pinecone index has a total_vector_count of 0, as we haven't added any vectors yet.

Indexing

We can perform the indexing task using the LangChain vector store object. But for now it is much faster to do it via the Pinecone python client directly. We will do this in batches of 100 or more.

from tqdm.auto import tqdm
from uuid import uuid4

batch_size = 100

texts = []
metadatas = []

for i in tqdm(range(0, len(data), batch_size)):
    # get end of batch
    i_end = min(len(data), i+batch_size)
    batch = data.iloc[i:i_end]
    # first get metadata fields for this record
    metadatas = [{
        'title': record['title'],
        'text': record['context']
    } for j, record in batch.iterrows()]
    # get the list of contexts / documents
    documents = batch['context']
    # create document embeddings
    embeds = embed.embed_documents(documents)
    # get IDs
    ids = batch['id']
    # add everything to pinecone
    index.upsert(vectors=zip(ids, embeds, metadatas))

We've indexed everything, now we can check the number of vectors in our index like so:

index.describe_index_stats()
{'dimension': 1536,
 'index_fullness': 0.0,
 'namespaces': {'': {'vector_count': 18891}},
 'total_vector_count': 18891}

Creating a Vector Store and Querying

Now that we've build our index we can switch back over to LangChain. We start by initializing a vector store using the same index we just built. We do that like so:

from langchain.vectorstores import Pinecone

text_field = "text"

# switch back to normal index for langchain
index = pinecone.Index(index_name)

vectorstore = Pinecone(
    index, embed.embed_query, text_field
)

As in previous examples, we can use the similarity_search method to do a pure semantic search (without the generation component).

query = "when was the college of engineering in the University of Notre Dame established?"

vectorstore.similarity_search(
    query,  # our search query
    k=3  # return 3 most relevant docs
)
[Document(page_content="In 1919 Father James Burns became president of Notre Dame, and in three years he produced an academic revolution that brought the school up to national standards by adopting the elective system and moving away from the university's traditional scholastic and classical emphasis. By contrast, the Jesuit colleges, bastions of academic conservatism, were reluctant to move to a system of electives. Their graduates were shut out of Harvard Law School for that reason. Notre Dame continued to grow over the years, adding more colleges, programs, and sports teams. By 1921, with the addition of the College of Commerce, Notre Dame had grown from a small college to a university with five colleges and a professional law school. The university continued to expand and add new residence halls and buildings with each subsequent president.", metadata={'title': 'University_of_Notre_Dame'}),
 Document(page_content='The College of Engineering was established in 1920, however, early courses in civil and mechanical engineering were a part of the College of Science since the 1870s. Today the college, housed in the Fitzpatrick, Cushing, and Stinson-Remick Halls of Engineering, includes five departments of study – aerospace and mechanical engineering, chemical and biomolecular engineering, civil engineering and geological sciences, computer science and engineering, and electrical engineering – with eight B.S. degrees offered. Additionally, the college offers five-year dual degree programs with the Colleges of Arts and Letters and of Business awarding additional B.A. and Master of Business Administration (MBA) degrees, respectively.', metadata={'title': 'University_of_Notre_Dame'}),
 Document(page_content='Since 2005, Notre Dame has been led by John I. Jenkins, C.S.C., the 17th president of the university. Jenkins took over the position from Malloy on July 1, 2005. In his inaugural address, Jenkins described his goals of making the university a leader in research that recognizes ethics and building the connection between faith and studies. During his tenure, Notre Dame has increased its endowment, enlarged its student body, and undergone many construction projects on campus, including Compton Family Ice Arena, a new architecture hall, additional residence halls, and the Campus Crossroads, a $400m enhancement and expansion of Notre Dame Stadium.', metadata={'title': 'University_of_Notre_Dame'})]

Looks like we're getting good results. Let's take a look at how we can begin integrating this into a conversational agent.

Initializing the Conversational Agent

Our conversational agent needs a Chat LLM, conversational memory, and a RetrievalQA chain to initialize. We create these using:

from langchain.chat_models import ChatOpenAI
from langchain.chains.conversation.memory import ConversationBufferWindowMemory
from langchain.chains import RetrievalQA

# chat completion llm
llm = ChatOpenAI(
    openai_api_key=OPENAI_API_KEY,
    model_name='gpt-3.5-turbo',
    temperature=0.0
)
# conversational memory
conversational_memory = ConversationBufferWindowMemory(
    memory_key='chat_history',
    k=5,
    return_messages=True
)
# retrieval qa chain
qa = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

Using these we can generate an answer using the run method:

qa.run(query)
'The College of Engineering was established in 1920 at the University of Notre Dame.'

But this isn't yet ready for our conversational agent. For that we need to convert this retrieval chain into a tool. We do that like so:

from langchain.agents import Tool

tools = [
    Tool(
        name='Knowledge Base',
        func=qa.run,
        description=(
            'use this tool when answering general knowledge queries to get '
            'more information about the topic'
        )
    )
]

Now we can initialize the agent like so:

from langchain.agents import initialize_agent

agent = initialize_agent(
    agent='chat-conversational-react-description',
    tools=tools,
    llm=llm,
    verbose=True,
    max_iterations=3,
    early_stopping_method='generate',
    memory=conversational_memory
)

With that our retrieval augmented conversational agent is ready and we can begin using it.

Using the Conversational Agent

To make queries we simply call the agent directly.

agent(query)
> Entering new AgentExecutor chain...
{
    "action": "Knowledge Base",
    "action_input": "When was the College of Engineering at the University of Notre Dame established?"
}
Observation: The College of Engineering at the University of Notre Dame was established in 1920.
Thought:{
    "action": "Final Answer",
    "action_input": "The College of Engineering at the University of Notre Dame was established in 1920."
}

> Finished chain.





{'input': 'when was the college of engineering in the University of Notre Dame established?',
 'chat_history': [],
 'output': 'The College of Engineering at the University of Notre Dame was established in 1920.'}

Looks great, now what if we ask it a non-general knowledge question?

agent("what is 2 * 7?")
> Entering new AgentExecutor chain...
{
    "action": "Final Answer",
    "action_input": "The result of 2 * 7 is 14."
}

> Finished chain.





{'input': 'what is 2 * 7?',
 'chat_history': [HumanMessage(content='when was the college of engineering in the University of Notre Dame established?', additional_kwargs={}, example=False),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.', additional_kwargs={}, example=False)],
 'output': 'The result of 2 * 7 is 14.'}

Perfect, the agent is able to recognize that it doesn't need to refer to it's general knowledge tool for that question. Let's try some more questions.

agent("can you tell me some facts about the University of Notre Dame?")
> Entering new AgentExecutor chain...
{
    "action": "Knowledge Base",
    "action_input": "University of Notre Dame facts"
}
Observation: - The University of Notre Dame is a Catholic research university located in South Bend, Indiana, USA.
- It is consistently ranked among the top twenty universities in the United States and as a major global university.
- The undergraduate component of the university is organized into four colleges (Arts and Letters, Science, Engineering, Business) and the Architecture School.
- The university's graduate program has more than 50 master's, doctoral and professional degree programs offered by the five schools, with the addition of the Notre Dame Law School and a MD-PhD program offered in combination with IU medical School.
- Notre Dame's campus covers 1,250 acres in a suburban setting and it contains a number of recognizable landmarks, such as the Golden Dome, the "Word of Life" mural (commonly known as Touchdown Jesus), and the Basilica.
- The university counts approximately 120,000 alumni, considered among the strongest alumni networks among U.S. colleges.
- The university has many multi-disciplinary institutes devoted to research in varying fields, including the Medieval Institute, the Kellogg Institute for International Studies, the Kroc Institute for International Peace studies, and the Center for Social Concerns.
- Recent research includes work on family conflict and child development, genome mapping, the increasing trade deficit of the United States with China, studies in fluid mechanics, computational science and engineering, and marketing trends on the Internet.
- In 2014 the Notre Dame student body consisted of 12,179 students, with 8,448 undergraduates, 2,138 graduate and professional and 1,593 professional (Law, M.Div., Business, M.Ed.) students.
- The university's intramural sports program was named in 2004 by Sports Illustrated as the best program in the country, while in 2007 The Princeton Review named it as the top school where "Everyone Plays Intramural Sports."
Thought:{
    "action": "Final Answer",
    "action_input": "The University of Notre Dame is a Catholic research university located in South Bend, Indiana, USA. It is consistently ranked among the top twenty universities in the United States and as a major global university. The undergraduate component of the university is organized into four colleges (Arts and Letters, Science, Engineering, Business) and the Architecture School. The university's graduate program has more than 50 master's, doctoral and professional degree programs offered by the five schools, with the addition of the Notre Dame Law School and a MD-PhD program offered in combination with IU medical School. Notre Dame's campus covers 1,250 acres in a suburban setting and it contains a number of recognizable landmarks, such as the Golden Dome, the \"Word of Life\" mural (commonly known as Touchdown Jesus), and the Basilica. The university counts approximately 120,000 alumni, considered among the strongest alumni networks among U.S. colleges. The university has many multi-disciplinary institutes devoted to research in varying fields, including the Medieval Institute, the Kellogg Institute for International Studies, the Kroc Institute for International Peace studies, and the Center for Social Concerns. Recent research includes work on family conflict and child development, genome mapping, the increasing trade deficit of the United States with China, studies in fluid mechanics, computational science and engineering, and marketing trends on the Internet. In 2014 the Notre Dame student body consisted of 12,179 students, with 8,448 undergraduates, 2,138 graduate and professional and 1,593 professional (Law, M.Div., Business, M.Ed.) students. The university's intramural sports program was named in 2004 by Sports Illustrated as the best program in the country, while in 2007 The Princeton Review named it as the top school where \"Everyone Plays Intramural Sports.\""
}

> Finished chain.





{'input': 'can you tell me some facts about the University of Notre Dame?',
 'chat_history': [HumanMessage(content='when was the college of engineering in the University of Notre Dame established?', additional_kwargs={}, example=False),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.', additional_kwargs={}, example=False),
  HumanMessage(content='what is 2 * 7?', additional_kwargs={}, example=False),
  AIMessage(content='The result of 2 * 7 is 14.', additional_kwargs={}, example=False)],
 'output': 'The University of Notre Dame is a Catholic research university located in South Bend, Indiana, USA. It is consistently ranked among the top twenty universities in the United States and as a major global university. The undergraduate component of the university is organized into four colleges (Arts and Letters, Science, Engineering, Business) and the Architecture School. The university\'s graduate program has more than 50 master\'s, doctoral and professional degree programs offered by the five schools, with the addition of the Notre Dame Law School and a MD-PhD program offered in combination with IU medical School. Notre Dame\'s campus covers 1,250 acres in a suburban setting and it contains a number of recognizable landmarks, such as the Golden Dome, the "Word of Life" mural (commonly known as Touchdown Jesus), and the Basilica. The university counts approximately 120,000 alumni, considered among the strongest alumni networks among U.S. colleges. The university has many multi-disciplinary institutes devoted to research in varying fields, including the Medieval Institute, the Kellogg Institute for International Studies, the Kroc Institute for International Peace studies, and the Center for Social Concerns. Recent research includes work on family conflict and child development, genome mapping, the increasing trade deficit of the United States with China, studies in fluid mechanics, computational science and engineering, and marketing trends on the Internet. In 2014 the Notre Dame student body consisted of 12,179 students, with 8,448 undergraduates, 2,138 graduate and professional and 1,593 professional (Law, M.Div., Business, M.Ed.) students. The university\'s intramural sports program was named in 2004 by Sports Illustrated as the best program in the country, while in 2007 The Princeton Review named it as the top school where "Everyone Plays Intramural Sports."'}
agent("can you summarize these facts in two short sentences")
> Entering new AgentExecutor chain...
{
    "action": "Final Answer",
    "action_input": "The University of Notre Dame is a Catholic research university located in South Bend, Indiana, USA. It is consistently ranked among the top twenty universities in the United States and as a major global university."
}

> Finished chain.





{'input': 'can you summarize these facts in two short sentences',
 'chat_history': [HumanMessage(content='when was the college of engineering in the University of Notre Dame established?', additional_kwargs={}, example=False),
  AIMessage(content='The College of Engineering at the University of Notre Dame was established in 1920.', additional_kwargs={}, example=False),
  HumanMessage(content='what is 2 * 7?', additional_kwargs={}, example=False),
  AIMessage(content='The result of 2 * 7 is 14.', additional_kwargs={}, example=False),
  HumanMessage(content='can you tell me some facts about the University of Notre Dame?', additional_kwargs={}, example=False),
  AIMessage(content='The University of Notre Dame is a Catholic research university located in South Bend, Indiana, USA. It is consistently ranked among the top twenty universities in the United States and as a major global university. The undergraduate component of the university is organized into four colleges (Arts and Letters, Science, Engineering, Business) and the Architecture School. The university\'s graduate program has more than 50 master\'s, doctoral and professional degree programs offered by the five schools, with the addition of the Notre Dame Law School and a MD-PhD program offered in combination with IU medical School. Notre Dame\'s campus covers 1,250 acres in a suburban setting and it contains a number of recognizable landmarks, such as the Golden Dome, the "Word of Life" mural (commonly known as Touchdown Jesus), and the Basilica. The university counts approximately 120,000 alumni, considered among the strongest alumni networks among U.S. colleges. The university has many multi-disciplinary institutes devoted to research in varying fields, including the Medieval Institute, the Kellogg Institute for International Studies, the Kroc Institute for International Peace studies, and the Center for Social Concerns. Recent research includes work on family conflict and child development, genome mapping, the increasing trade deficit of the United States with China, studies in fluid mechanics, computational science and engineering, and marketing trends on the Internet. In 2014 the Notre Dame student body consisted of 12,179 students, with 8,448 undergraduates, 2,138 graduate and professional and 1,593 professional (Law, M.Div., Business, M.Ed.) students. The university\'s intramural sports program was named in 2004 by Sports Illustrated as the best program in the country, while in 2007 The Princeton Review named it as the top school where "Everyone Plays Intramural Sports."', additional_kwargs={}, example=False)],
 'output': 'The University of Notre Dame is a Catholic research university located in South Bend, Indiana, USA. It is consistently ranked among the top twenty universities in the United States and as a major global university.'}

Looks great! We're also able to ask questions that refer to previous interactions in the conversation and the agent is able to refer to the conversation history to as a source of information.

That's all for this example of building a retrieval augmented conversational agent with OpenAI and Pinecone (the OP stack) and LangChain.

Once finished, we delete the Pinecone index to save resources:

pinecone.delete_index(index_name)