Retrieval Augmented Generation With Llama 3, ChromaDB and Langchain

Published in

Towards AI

4 min readMay 1, 2024

When we first encountered a Large Language Model (LLM), we were genuinely wowed by their ability to churn out human-like conversations so effortlessly. In the world of generative AI, heavyweight players like Claude, GPT, and Gemini are well-known, but their vast sizes mean they also come with a hefty price tag for automating tasks. That’s where open-source LLMs like Mistral and Llama step in. Their more manageable size makes them perfect for many applications, particularly in areas like Retrieval-Augmented Generation (RAG), where the focus leans more towards the retrieval aspect than on generation. In this post, we will explore how to implement RAG using Llama-3 and Langchain.

Before we begin Let us first try to understand the prompt format of llama 3. Llama 3 has a very complex prompt format compared to other models such as Mistral. Here is

  <|begin_of_text|>
  <|start_header_id|>
    user
  <|end_header_id|>
    Hello it is nice to meet you!
  <|eot_id|>
  <|start_header_id|>
     assistant
  <|end_header_id|>

<|begin_of_text|> is used to indicate the start of the prompt and <|eot_id|> tags are used denote the end of each header section. Currently llama-3 supports 3 user roles namely “system” , “user” and “assistant”. The above prompt just contains a simple user message for the LLM which say “Hello it is nice to meet you!”. To Perform RAG we need a slightly complex prompt as shown below:

<|begin_of_text|>
<|start_header_id|>
  system
<|end_header_id|>
   You are a helpful, respectful and honest assistant designated answer
   questions related to the user's document.If the user tries to ask out of 
   topic questions do not engange in the conversation.If the given context 
   is not sufficient to answer the question,Do not answer the question.
<|eot_id|>
<|start_header_id|>
   user
<|end_header_id|>
  Answer the user question based on the context provided below
  Context :{context}
  Question: {question}
<|eot_id|>
<|start_header_id|>
  assistant
<|end_header_id|>

The “system” is used to give instructions to the model and “user” part of the prompt is usually augmented with the retrieved context alongside the user question. Conversely one can put everything from the “system” into “user” role but it is generally not recommended to do so. Now that we have a clear idea on the prompt structure let us build a simple RAG system using Llama 3 and Langhian.

Installing the dependencies.

!pip install langchain
!pip install chromadb
!pip install sentence-transformers
!pip install pypdf
!pip install -U bitsandbytes
!pip install -U git+https://github.com/huggingface/peft.git
!pip install -U git+https://github.com/huggingface/accelerate.git
!pip install -U einops
!pip install -U safetensors
!pip install -U xformers
!pip install -U ctransformers[cuda]
!pip install huggingface_hub

from transformers import AutoTokenizer,AutoModelForCausalLM,AutoConfig
from time import time
import transformers
import torch

To use gated models like llama 3 we have to request access via huggingface. It helps to prevent these models from used for causing harm in any way. Once we get access we can access the model just by logging in to huggingface using our personal access token.

from huggingface_hub import notebook_login
notebook_login()

Let us now load the model and tokenizer.

model_checkpoint = 'meta-llama/Meta-Llama-3-8B-Instruct'
model_config = AutoConfig.from_pretrained(model_checkpoint,
                                          trust_remote_code=True,
                                          max_new_tokens=1024)

model = AutoModelForCausalLM.from_pretrained(model_checkpoint,
                                             trust_remote_code=True,
                                             config=model_config,
                                             device_map='auto')

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)

Now let us set up a text generation pipeline locally and create a HuggingFacePipeline object using that.

from transformers import pipeline
from langchain.llms import HuggingFacePipeline

pipeline = pipeline("text-generation",
                    model=model,
                    tokenizer=tokenizer,
                    torch_dtype=torch.float16,
                    max_length=3000,
                    device_map="auto",)

llm = HuggingFacePipeline(pipeline=pipeline)

Let us quickly check if everything is working by prompting the llm.

prompt = """<|begin_of_text|>
            <|start_header_id|>
              user
            <|end_header_id|>
              Hello it is nice to meet you!
            <|eot_id|>
            <|start_header_id|>
              assistant
            <|end_header_id|>
         """
out = llm.invoke(prompt)
print(out)

We will get something like the below as out and we have to parse it in order to extract the result

"<|begin_of_text|><|start_header_id|>user<|end_header_id|>hold my beer!<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n*cracks open an imaginary beer* Ah, hold my beer, eh? What's the occasion?"

Let us write a function to do the same

def parse(string):
        return string.split("<|end_header_id|>")[-1]

Now that we have the LLM set up perfectly, let us finish setting up the retriever and our database.

from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings.sentence_transformer import SentenceTransformerEmbeddings
from langchain_community.vectorstores import Chroma

loader = PyPDFLoader("<path to your PDF>")

documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
texts = text_splitter.split_documents(documents)

embedding_function = SentenceTransformerEmbeddings(model_name="BAAI/bge-small-en-v1.5")
vectorstore = Chroma(collection_name="sample_collection", embedding_function = embedding_function)

vectorstore.add_documents(texts)

retriever = vectorstore.as_retriever(k=7)

Currently, due to the messed up prompt format meta has used for llama-3, it is very difficult to use LangChain Expression Language to create chains; instead, we have to connect the components ourselves manually.

class Pipeline:
    def __init__(self,llm,retriever):
        self.llm = llm
        self.retriever = retriever
    def retrieve(self,question):
        docs = self.retriever.invoke(question)
        return "\n\n".join([d.page_content for d in docs])
    def augment(self,question,context):
        return f"""
            <|begin_of_text|>
            <|start_header_id|>
              system
            <|end_header_id|>
               You are a helpful, respectful and honest assistant designated answer
               questions related to the user's document.If the user tries to ask out of 
               topic questions do not engange in the conversation.If the given context 
               is not sufficient to answer the question,Do not answer the question.
            <|eot_id|>
            <|start_header_id|>
               user
            <|end_header_id|>
              Answer the user question based on the context provided below
              Context :{context}
              Question: {question}
            <|eot_id|>
            <|start_header_id|>
              assistant
            <|end_header_id|>"""
    def parse(self,string):
        return string.split("<|end_header_id|>")[-1]
    def generate(self,question):
        context = self.retrieve(question)
        prompt  = self.augment(question,context)
        answer  = self.llm.invoke(prompt)
        return self.parse(answer)

pipe = pipeline(llm,retriever)

now we have our pipeline ready, we can use it to ask questions about the file we uploaded earlier.

def llama3_chat():
    print("Hello!!!! I am llama3 and I can help with your document. \nIf you want to stop you can enter STOP at any point!")
    print()
    print("-------------------------------------------------------------------------------------")
    pipe = pipeline(llm,retriever)
    question = input()
    while question != "STOP":
        out  = pipe.generate(question)
        print(out)
        print("\nIs there anything else you would like my help with?")
        print("-------------------------------------------------------------------------------------")
        question = input()
llama3_chat()

That’s it!

We have built a simple RAG pipeline using llama-3 and langchain. If you find this blog useful, please support me by clapping for this blog.

Retrieval Augmented Generation With Llama 3, ChromaDB and Langchain

Written by Raagulbharatwaj K