Practical Considerations in RAG Application Design

Kelvin Lu
Towards AI
Published in
14 min readOct 16, 2023

--

Photo by Markus Spiske on Unsplash

This is the second part of the RAG analysis:

The RAG (Retrieval Augmented Generation) architecture has been proven to be efficient in overcoming the LLM input length limit and the knowledge cutoff problem. In today’s LLM technical stack, RAG is among the bedstones for grounding the application on local knowledge, mitigating hallucinations, and making LLM applications auditable. There are plenty of examples of how to build a RAG application. And there are various types of vector databases as well.

Many people have noticed that, despite the fact that RAG applications are easy to demo, they are difficult to put into production. In this article, let's look into some practical details of RAG application development.

The Perfect LLM Baseline

Assuming that we have a generative LLM with an unlimited input length, the length of the input string has no bearing on the generative LLM’s accuracy. Other than that, it behaves exactly like all other popular LLMs. Let’s call this model the perfect LLM. We consider it perfect, not because it has excellent performance, but because it has the desired unlimited input length, which is impossible today. The unlimited input length is really an attractive feature to have. In fact, a few ambitious projects are already working on incredibly long-input LLMs. One of them is researching the possibility of LLM with 1 million tokens of input length! However, I’m afraid even the 1 million tokens limit may still not be enough in the application because it’s only equivalent to 4–5MB. It’s still smaller than a large number of documents in real business.

Now the question is: when you had such a perfect LLM, would you still consider RAG architecture? The perfect LLM with unlimited input length reduces the necessity of building a complicated RAG. However, probably yes, you still need to consider RAG architecture. RAG architecture helps not only overcome the LLM input length limit but also reduces the costs of LLM invocation and improves the processing speed. The generative LLMs have to process the content in sequence. The longer the input, the slower.

When we design a RAG application, we need to use the assumed perfect LLM as the baseline to scrutinise the RAG application, so we can get a clear view of the pros and cons of RAG and find out ways to improve our RAG applications.

RAG Performance Expectation

With the baseline model, the input of the generative AI application was directly fed into the LLM in a single shot. So that the LLM has the opportunity to digest all the information in the input text. The accuracy of the final result depends only on how well the generative LLM performs.

The Perfect Model

For a vanilla RAG application, there’re two more components that impact the final performance: the semantic search method and the RAG implementation. The RAG architecture uses an embedding model to generate vectors of both ground-truth knowledge and the query. Then use the vector similarity function to retrieve the most relevant content. The power of the embedding model to extract meaning from the text is very critical. Besides the embedding model, there’re plenty of implementation details in the RAG development, which also heavily impacts the final outcome. That is, the accuracy of the RAG output would be the same as the product of the generative LLM's accuracy, the semantic search's accuracy, and the RAG information preservation rate. I’ll explain the concept of the RAG information preservation rate later.

The RAG Performance Chain

Because all three factors are less than 100%, the expected accuracy of an RAG application is lower than that of an application based on the same but perfect LLM model. If the RAG is not designed properly, its performance drops very significantly. That’s the first concept to bear in mind when we start to think of our RAG application design. Otherwise, the unexpected performance will surprise us.

Since the final performance was driven by these three factors, RAG application design must also be centred around all three components to achieve a satisfying result.

Information Preservation

It’s easy to understand that both the LLM model and semantic search can’t achieve 100% accuracy. Let me explain what the RAG information preservation rate is.

The text corpus we feed into the application can have very rich information. Let’s have a look at what is in the corpus and how the information is fed into the LLM:

The chart depicts the entity relationships in the text corpus. The entities are spread over the whole corpus, and the reference relationships are also everywhere. After the chunking, the entities are constrained in each silos, and the relationships across chunks are all cut off. In the retrieval phrase, only the top-k chunks have the opportunity to be sent through the LLM. That means only a portion of the entities and relationships can be forwarded to the LLM. The LLM will have trouble if it requires extensive relationship knowledge to respond to the query.

In addition to the entity relationships, the chunking operation will also have an impact on a variety of other types of information in the input:

1. Contextual information:

In most cases, the text has multiple layers of contextual information. For instance, the book “The Elements of Statistical Learning” has 18 chapters, each of which focuses on a single topic. It has subtopic and second-layer subtopics in each chapter, etc. People get used to comprehending the text in context. The chunking strategy makes the content disconnected from its context.

2. Positional information:

Texts carry different weights depending on their position in the document. Texts at the beginning and end of a document are more important than the ones in the middle. They are more important when they are at the beginning or end of a chapter than when they are in the middle of a chapter.

3. sequential information:

Natural text frequently uses explicit and implicit linguistic linkages to connect topics as well. For example, a story may start with “in the beginning”, then continue with “then”, “therefore”, “after that”, until it ends with “eventually”, “finally”, etc. With the chunking strategy, this kind of connection is no longer complete. Not only are the puzzles missing, but the sequencing order also gets shuffled.

4. descriptive information:

This refers to the information describing a single subject. With the chunking, descriptive information may not be guaranteed to come together. Imagine you are in the middle of a phone call, and all of a sudden the phone line is cut off. It depends on how important your call is, and when it has happened, the impact ranges from trivial to very frustrating.

Strengths and Weaknesses of RAG

If we call RAGs that only use chunking and vector similarity search "vanilla RAGs," we can see that they can only handle a few types of queries because they lose some of the input information we talked about earlier:
1. Good at narrow-scoped descriptive question answering. For example, which subject possesses certain features?

2. Not good at relationship reasoning, i.e., finding a path from entity A to entity B or identifying cliques of entities.

3. Not good at long-range summarization. For instance, “List all Harry Potter’s fights" or “How many fights does Harry Potter have?”.

RAG applications perform poorly on these kinds of tasks because only a few chunks can be fed into the LLM, and these chunks are scattered. It would be impossible for LLM to collect the necessary information to get started.

RAG applications are mostly toped with a generative LLM, which gives users the impression that the RAG application must have high-level reasoning ability that is similar to the perfect LLM application. However, because LLM has inadequate input compared to the perfect model, RAG applications don’t have the same level of reasoning power. Awareness of the input information limitation can help us understand what RAG can and cannot do. We should seek the most suitable field for the RAG and avoid forcing it into the wrong place.

Towards a Better RAG Application

Having discussed the limitations of the RAG application, let’s see how we can improve its performance.

Treat Your LLM

Very often, when dealing with the input query, we simply accept whatever the user sends in. This is not ideal, not only because of the security risks like prompt leaking and prompt injection, but also because the performance may be disappointing as well.

According to the researchers, LLMs are sensitive to typos and wording differences in the prompt [1]. To make sure LLMs run at their peak performance, consider correcting all typos and rephrasing the input into a form that is easier for LLMs to follow.

Keeping Embedding Model on The Same Page

In most cases, the user sends short queries, like ‘Tell me more about Tony Abott’. Then, the query was converted into an embedding vector, which captures the essence of that specific query. Doing a semantic search with a direct query can be challenging because:

  1. The user queries are short and in the form of questions. They contain limited semantic features. While the document embeddings are long, in the form of various forms of statements, the document embeddings have far richer information in their vectors.
  2. Because of the limited semantic features in the user query, the semantic search function would be keen to overinterpret trivial details in the query. The document embedding may have a high level of noise. The chunking makes it worse because many relationships, contexts, and sequential linkages are void.
  3. Embedding models and generative LLMs belong to different families. They are trained differently, and their behaviours are different. Embedding models don’t have the same level of reasoning ability as generative LLMs. They don’t even respect linguistic detail as much as generative LLMs. Querying with user input directly, in the worst case, will make the semantic search function downgrade to keyword search.
  4. As embedding and generative LLMs are two different models that serve different roles in the whole process, they are not on the same page. The two models do their part of the work according to their own understanding of what is required, but they don’t talk to each other. The retrieved information may not be something the LLM needs to produce the best result. These two models don’t have a way to align with each other.

To avoid this problem, you probably want to use a generative LLM to augment user queries first. Consider the following example:

Original user query:
Tell me about Tony Abbott.

And the augmented queries that were rephrased based on the original query using Bard:
- What is Tony Abbott’s political background?

- What are Tony Abbott’s most notable achievements?

- What are Tony Abbott’s political views?

- What are Tony Abbott’s personal interests?

- What are some of the controversies that Tony Abbott has been involved in?

Can you see the improved information richness? The augmented queries provide more features, thus producing a better retrieval outcome. Moreover, by sending the augmented queries, the LLM has an opportunity to tell the embedding model what it needs, and the embedding model can do a better job of providing high-quality chunks to the LLM. That’s how the two models can work together.

Chunking Strategy Matters

Chunk size is one of the few super-parameters we can tune for a RAG application. To achieve a better result, it’s recommended to use smaller chunk sizes. One such analysis was from Microsoft [2]:

Chunk Size vs. Performance. From [2]

When splitting the text, we can also choose different splitting strategies. The simplest way is to cut off at the break of a word. We can also try different strategies, like cutting off at the break of a sentence or paragraph. And to achieve an even better outcome, we can overlap the adjacent chunks. The comparison of chunking strategies from the Microsoft analysis [2]:

Impact of Different Splitting Strategy. From [2]

Embedding models have limited semantic extraction power. They are less effective at presenting multi-topic, multi-turn corpora than simple ones. That is why the RAG prefers shorter chunks. Then what chunk size is the best? In the Microsoft analysis, the smallest chunk size was 512 tokens. The chunk size in some commercial RAG applications was only 100 tokens. Does the smallest chunk size always achieve a better outcome?

As discussed earlier, the chunking strategy will break up the text corpora into little pieces, resulting in information loss. The smaller the chunk size, the more information will be lost. So, there is an optimal chunk size. Too-small chunks may not be ideal. However, seeking the optimal chunk size is like superparameter tuning. You must experiment with your data.

Accuracy vs. Chunk Size. By Author

Reducing Information Loss

The Microsoft analysis found that chunking with a significant amount of overlap improves accuracy. Why does it help, and can we find a better way to enhance RAG performance?

The reason behind the overlap was that overlap can help link adjacent chunks together and provide better contextual information for the chunk. However, even the very aggressive 25% overlap can only improve the accuracy by 1.5%, from 42.4% to 43.9%. That means this is not the most efficient way to optimise RAG performance. We can’t further improve RAG performance by overlapping more. Remember that overlapped chunking is not even an option for small chunks.

As the overlapped chunking strategy has proven, preserving information can help LLM make better responses. How can we preserve input information as much as possible?

The overlap chunking strategy was just hoping that the last several sentences of the previous chunk could provide more contextual information. And as we can understand, the last several sentences may not be highly representative. We can probably use an LLM-generated summary of the previous chunk instead of the overlap.

And remember that the input text has multiple layers of contextual information. If that is the case with your application, maybe you want to prepend the layered contextual information into the chunk as well. Or a more efficient way may be to store the contextual information as metadata.

RAG with Knowledge Graph is trending now. With the help of knowledge graphs, RAG can store the relationships in the graph database. The connection between chunks can be fully reserved. That is a very considerable solution if relationship reasoning is critical to your project.

However, RAG with Knowledge Graph is not challenge-free. Establishing a knowledge graph from unstructured text is non-trivial. There’re quite a number of experiments on extracting entity-relationship triplets from the textual input. It’s a different story when you need to productionise the solution. The automatically extracted entities and relationships may contain a lot of noise and omit too much real information. You have to inspect the quality of the output very carefully. Even after the knowledge graph population, the supported queries are tightly coupled with the graph database design.

Not as fancy as RAG with Knowledge Graph, the vector-search-enabled relational database is also a very important component in the toolbox. A database like pgvector allows you to store sophisticated information as columns while preserving the semantic search functions. It is much easier to integrate with other enterprise systems and much more flexible than a knowledge graph.

These are all valid options to consider. The only heads-up is that many vector-enabled graph databases, search engines, and relational databases are not fully optimised as vector databases. Their speed when handling massively scaled vector indexes may not be ideal, especially when they have to update the index very often. Please check out [3] for details about the introduction to different types of vector stores.

LLM Empathy

Sometimes, we find the RAG doesn’t answer our questions very well. Instead of turning all the knobs to make it perform better, we should probably consider the following questions:

  • Does the LLM have all the information it needs?
  • Is the information organised in an LLM-friendly way?

Let's consider the following example:

We are building an RAG application on a SharePoint website. One of the webpages is about all the projects and their team members, including all the people's profiles. We need to make sure the RAG answers project vs. team member questions accurately; however, the initial result was very disappointing.

The initial investigation showed that the SharePoint website does not organise the contents in such a structured way that the affiliation of the information can be easily understood. After removing all HTML tags, the webpage content looks like the following:

project A
Client Contact: Steve
Team Members:
person A
person B
email of person A
email of person B
role of person A
role of person B
description of person A
description of person B

project B
...

If humans find it confusing to figure out who is who, RAG also struggles. To make the information better organised, we used Python code to aggregate the information together based on the HTML attributes, separated every project and the team members’ names into a single text file, and put every person’s information into its own file:

file project_A.txt:

project name: project_A
Client Contact: Steve
Team Members:
Adam Smith
Jobs Musk

file person_A.txt:
name: Adam Smith
email: adam.smith@xxx.com
role: engineer
description: Hobbies/passion: rock climbing

...

The generated text files are tiny, which seems to not align with the RAG chunking practise. The reason was that the consolidated files avoided the problem of splitting and completely removed the noise. With the newly generated files, the RAG has no problem answering questions like “who is working on project x?”, and “what is Adam Smith’s hobby?”.

However, the RAG struggled when we flipped the question around: “Which project is Adam Smith working on?” We saw Adam Smith listed among the project members. We are not very sure why the embedding model can’t pick it up. To help the LLM get the job done, we can make the information stand out. We added a line in the person’s file that explicitly states the project engagement:

file person_A.txt:
name: Adam Smith
email: adam.smith@xxx.com
role: engineer
description: Hobbies/passion: rock climbing

project: project_A

And this additional line makes the RAG application able to answer the above questions 100% accurately.

Closing Words

RAG as an emerging technology, is fast evolving. I found that it helped me a lot to investigate its building blocks piece by piece. By looking into the details, I can get a deeper insight into the pros and cons of the technology and develop an idea of whether a new proposal works or doesn’t work. There are a few very popular frameworks that help people develop RAG applications faster. I found some of the implementation ideas inspiring; however, I don’t recommend starting to learn or develop your RAG based on those libraries solely because they are easy to start with.

If you have followed this article this far, you must agree that RAG is a complicated architecture. The popular frameworks covered up all the details, which made people think those details were not important. When they run into problems in their project, they will find it difficult to find a way out because there are too many implementation details.

My suggestion is to start learning RAG with the bare-bone implementation and consider additional features afterwards. Thus, you will know the essentials and the impact of each moving part. It is not that difficult to start with such a minimum RAG and analyse how it works. Please check out my post, Disadvantages of RAG.

--

--