Skip to content Skip to footer

Build a Chatbot with RAG (Retrieval Augmented Generation)

Build a Chatbot with Retrieval Augmented Generation (RAG)

Do you want to build a chatbot which can answer questions based on knowledge you provide? One of the biggest challenges when working with Large Language Models is that LLM-generated answers are often too generic and LLM’s don’t have any knowledge about your business.

Retrieval Augmented Generation (RAG) solves this issue! The chatbot which we’re going to build in this tutorial will perform a vector search on every question asked by the user. It will use the information from the vector database to answer the question from the user and provide the answer via the chatbot.

Architecture

In this article I describe how you can build a simple chatbot that uses RAG. For this we will be using two Python scripts:

  • ingest_database.py: you only need to run this script once, as it will take a PDF file (the knowledge base), cut it in small chunks and ingest it in the database.
  • chatbot.py: this is the actual chatbot. You can run it as many times as you want, after you have executed ingest_database.py to fill the semantic database

How does Retrieval Augmented Generation (RAG) work?

Let’s first take a look at the part of our script which is stored in ingest_database.py. We call this the indexing. In the below picture (source: LangChain website) the four stages of indexing are described:

  1. Load: we load a data source into our script. This can be a PDF file, an Excel file, but also a website which is scraped for example. In the load phase, the data source is loaded into the Python script.
  2. Split: in the second phase, we split the document in smaller parts. It’s easy to imagine that in order to answer a question, the chatbot only needs to access a few phrases from the document. LLM’s are still limited to a context window, and charge based on the amount of tokens processed. So we definitely want to avoid it to parse the entire document. We only want to feed it phrases which are relevant to the question. During the splitting (chunking) process we split the document in smaller parts of about 300 characters.
  3. Embed: for every chunk (this is how we call the parts from the document), a model will calculate the embeddings. This is mathematical representation of the phrase and can be used to calculate the similarity of two phrases (during the retrieval process we’ll use this to find phrases which are similar to the question the user asks to the chatbot).
  4. Store: the phrases are ingested into a semantic database. In this example we’ll use Chroma.

Source: https://python.langchain.com/docs/tutorials/rag/

We now have a database where we stored the document (but in smaller chunks). The chatbot will use this database to find relevant phrases to answer the questions from the users. Let’s see how that works!

  1. Retrieval: the retrieval part is quite simple. Even before we provide the question of the user to the LLM, we provide it to the database and use the calculated embeddings to find phrases which are similar to the user’s question.
  2. Generation: the results from the previous step (phrases which seem similar to the question) are now passed to the LLM together with the question from the user and the prompt. We’ll ask the LLM to answer the user’s question – using the results from the semantic database.

Source: https://python.langchain.com/docs/tutorials/rag/

Preparation

If you want to build this script together with me, make sure that you have the following:

1. Install the necessary libraries

In order to install the necessary libraries, we’re going to run the following command in VS Code:

pip install langchain_community langchain_text_splitters langchain_openai langchain_chroma gradio python-dotenv pypdf

2. Download and save the PDF file

Click here to download the PDF file here and save it in the directory data

3. Download the Python scripts:

.env

OPENAI_API_KEY = "[YOUR API KEY HERE]"

ingest_database.py

from langchain_community.document_loaders import PyPDFDirectoryLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai.embeddings import OpenAIEmbeddings
from langchain_chroma import Chroma
from uuid import uuid4

# import the .env file
from dotenv import load_dotenv
load_dotenv()

# configuration
DATA_PATH = r"data"
CHROMA_PATH = r"chroma_db"

# initiate the embeddings model
embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")

# initiate the vector store
vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings_model,
    persist_directory=CHROMA_PATH,
)

# loading the PDF document
loader = PyPDFDirectoryLoader(DATA_PATH)

raw_documents = loader.load()

# splitting the document
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,
    chunk_overlap=100,
    length_function=len,
    is_separator_regex=False,
)

# creating the chunks
chunks = text_splitter.split_documents(raw_documents)

# creating unique ID's
uuids = [str(uuid4()) for _ in range(len(chunks))]

# adding chunks to vector store
vector_store.add_documents(documents=chunks, ids=uuids)

chatbot.py

from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_chroma import Chroma
import gradio as gr

# import the .env file
from dotenv import load_dotenv
load_dotenv()

# configuration
DATA_PATH = r"data"
CHROMA_PATH = r"chroma_db"

embeddings_model = OpenAIEmbeddings(model="text-embedding-3-large")

# initiate the model
llm = ChatOpenAI(temperature=0.5, model='gpt-4o-mini')

# connect to the chromadb
vector_store = Chroma(
    collection_name="example_collection",
    embedding_function=embeddings_model,
    persist_directory=CHROMA_PATH, 
)

# Set up the vectorstore to be the retriever
num_results = 5
retriever = vector_store.as_retriever(search_kwargs={'k': num_results})

# call this function for every message added to the chatbot
def stream_response(message, history):
    #print(f"Input: {message}. History: {history}\n")

    # retrieve the relevant chunks based on the question asked
    docs = retriever.invoke(message)

    # add all the chunks to 'knowledge'
    knowledge = ""

    for doc in docs:
        knowledge += doc.page_content+"\n\n"


    # make the call to the LLM (including prompt)
    if message is not None:

        partial_message = ""

        rag_prompt = f"""
        You are an assistent which answers questions based on knowledge which is provided to you.
        While answering, you don't use your internal knowledge, 
        but solely the information in the "The knowledge" section.
        You don't mention anything to the user about the povided knowledge.

        The question: {message}

        Conversation history: {history}

        The knowledge: {knowledge}

        """

        #print(rag_prompt)

        # stream the response to the Gradio App
        for response in llm.stream(rag_prompt):
            partial_message += response.content
            yield partial_message

# initiate the Gradio app
chatbot = gr.ChatInterface(stream_response, textbox=gr.Textbox(placeholder="Send to the LLM...",
    container=False,
    autoscroll=True,
    scale=7),
)

# launch the Gradio app
chatbot.launch()

4. Run the scripts

Now execute the script ingest_database.py once in order to populate the database. You can run chatbot.py whenever you want to chat with the PDF file.

Leave a comment

Receive my Python cheatsheet today!

Do you want to become a Python expert? I summarized all my expertise in a 3 pages cheatsheet, so you never have to Google again :)

Socials

Tom’s Tech Academy © 2025. All Rights Reserved.