Evaluation is Key to Retrieval Augmented Generation

Retrieval Augmented Generation (RAG) is a technique for enhancing the accuracy and reliability of your large language model (LLM), and building a proof of concept (POC) with RAG is fast and simple. Based on our own introductory article and additional research, we’ve shown the need for successful RAG.

The key to success, though, is building a reliable and sustainable RAG-based solution—and evaluation is the way.

Evaluation ensures your architectural choices are made correctly, are data-driven, and are created with set parameters for optimal RAG performance. You can expect to find long-term benefits as well, since you’ll have built a model with fitting prompts and accurate contexts, and confidence in future releases.

First things first: Let’s define our challenge

RAG solutions are a combination of retrieval and generation systems. The retrieval system matches queries to indexed data and extracts relevant information. Those results augment the question handed off to the generation system, which takes it as context and generates a response.

Popular RAG implementations use an embedding model like ada or davinci, vector databases like Pinecone, Chroma, FAISS, and Milvus, and similarity scores for retrieval. Retrieval can also be done with lexical models or a combination of the two, called hybrid retrieval. For generation, RAG implementations use LLMs like GPT-3.5, GPT-4, and Claude.

With so many variables, we must make some choices up front, namely a number of architectural decisions and parameters for both retrieval and generation.

Retrieval

Input data preprocessing: Anything other than plain text needs to be parsed and preprocessed before being indexed. For example, parsing a PDF requires heuristics on word and paragraph splitting, which many parsing libraries or engines can do, and decisions on how to handle tables and images. This step is the most critical and yet, often underexplored. Any change here can influence results and the remaining steps.
Chunking: Documents are usually too large to easily fit into the context of a query, since there might be multiple search results needed to fully answer it. So, we split them into chunks. First, we need to decide the right chunking strategy: chunk size, how to split them, and how to handle overlap.
Architecture: To make the right choice between semantic/embedding, lexical, or hybrid retrieval, we have to weigh considerations such as which embedding mode—OpenAI, ada, Llama, davinci—gives the best scores. Other factors to consider include the minimum similarity in scores for relevancy, and “top k” (the maximum number of relevant chunks)—without which our choices for retrieval are incomplete.

Generation

Prompt choice: What prompt best combines system preferences, extracted relevant document chunks, and user query?
LLM choice: Which LLM generates the best responses? GPT-3.5, GPT-4, Claude, or Llama?
Temperature: What is the optimal temperature for best response generation? Adjust temperature to control the diversity and style of generated text.

Every additional choice and parameter leads to a variety of experiments. For example, an optimal prompt for GPT-3.5 will not be the same as an optimal prompt for GPT-4.

Evaluating LLM systems: Define metrics

Users expect quality, meaningful responses when they make a query, and we need metrics to ensure we’re delivering. Such metrics may be accuracy, fluency, coherence, relevance, bias, toxicity, timing, and token counts.

Since RAG is a two-part system, however, it’s important to measure both generation and retrieval quality. If retrieval fails, no LLM can answer the question correctly.

Retrieval Metrics

Precision@k	Answers the question how many of the top-k document chunks were relevant?
Recall@k	Answers the question how many of the relevant document chunks were in top-k results?
F1@k	Harmonic mean of Precision@k and Recall@k, a general quality metric. There is a tradeoff between precision and recall, often when one increases the other decreases. F1 is a single score balancing both.
Mean Reciprocal Rank at k (MRR@k)	Answers the question does the first truly relevant result have the best score? Useful if a single relevant result is expected, but not very useful when relevant information spans several chunks.
Mean Average Precision at k (MRR@k)	Answers the question do relevant results appear first? Useful if the generation model is sensitive to context ordering, which current LLMs are. Source

Read more about Retrieval Metrics

Generation Metrics

Accuracy	Responses must be factually correct.
Fluency	Responses must be structured and organizable to be digestible and readable.
Coherence	The answer must fit into the conversation flow. This is applicable in chat settings where previous user-system interactions influence results.
Relevance	The response must align with the user query.
Bias	Results must be accurate for all demographic groups and avoid social stereotypes.
Toxicity	Responses cannot be rude, disrespectful, or unreasonable.

Operational Metrics

A separate group of useful metrics to track in order to understand expected costs and latency. These can include:

Token count per question	Statistics (mean, min, max, median) helpful for estimating costs, especially after adding complex LLM reasoning capabilities.
Response time per question	Statistics (mean, min, max, median, P10, P90) which ensure the system is still responsive enough for the user, especially after adding complex multi-step LLM reasoning capabilities.

Gather data

To evaluate RAG, we need to consider real use cases and determine data that measures both retrieval and generation.

Start by gathering documents for the use case—context documents to which the user has access and other relevant documents needed to provide a complete answer based on the query.

Here’s an example. A legal firm knows the answer to its query is within its file of PDF contracts signed by clients.

An example data point might be:

   – Question: “Who signed the contract with company A?”
   – Answer: “The contract with company A was signed by CEO of company A John Doe and our CEO Jane Doe.”
   – Context documents: PDF collection of contracts (indices of documents in a data store)
   – Relevant documents: signed contract with company A (index of document in a data store)

Important note—data and documents should be private (especially for factual queries), to avoid LLM answering questions from its own implicit knowledge base instead of taking information from retrieved context. This can lead to overly positive generation metrics even if retrieval fails.

Now it’s time to evaluate

While computing retrieval metrics is straightforward, computing generation metrics is not. Assessing text for accuracy, fluency, coherence, relevance, bias, and toxicity requires a deep understanding of semantics, the context around the user query, and factual knowledge.

Language is ambiguous and there can be multiple equally accurate responses to the same question. But there are no formulas or algorithms to compute those metrics automatically.

Usual metrics used to evaluate dialogue systems like BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) do not correlate strongly with human judgment. The most reliable way to evaluate is to get a team of annotators, preferably system users, and let them score system outputs. This is expensive and time consuming; can we approximate human judgment?

LLM-as-a-judge is asked to compare LLM-generated responses to human annotations (gold truth) and score them with generation metrics. Unlike traditional dialogue system metrics, these scores mostly agree with humans and are a reliable approximation of how people judge text accuracy, bias, and other criteria.

The approach is not perfect as LLMs tend to have different biases, notably self-enhancing bias which leads to them scoring their own generations higher, but it is negligible in practice. Time and operational cost savings from using LLM-as-a-judge heavily outweigh these disadvantages.

Now, we know how to compute the metrics during evaluation, so we move on to running standard scientific experimentation. Ideally, you’ll do this with each change to your RAG.

1. Define the experiments (different combinations of RAG parameters)

2. Run the evaluation dataset through the experimental version of the system

3. Compute the evaluation metrics

4. Log it into an experiment tracking system.

Finally, choose the system variant that scores best according to your business needs.

Conclusion

Evaluation ensures your architectural choices are made correctly, are data-driven, and are created with set parameters for optimal RAG performance. New RAG architectures, ideas, and methods are constantly designed, and with evaluation you can rapidly test if they work for your application.

Evaluation is the way to ensure you’re getting a reliable, sustainable, and less costly RAG system. With the proliferation of new LLMs and architectural choices, having an evaluation pipeline established early helps make the best data-driven decisions long term.

All Posts

February 17, 2025

What if you could predict breakdowns before they happen?

Prevent elevator failures before they occur—learn how AI-driven predictive maintenance uses data to cut costs and boost tenant satisfaction.

January 30, 2025

Buckle up for a wild ride in tech in 2025

Explore 2025’s top tech trends—from AI and quantum computing to sustainable innovations—and discover their impact on business, healthcare, and daily life.