Assessing a RAG Pipeline

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

Assessing a RAG Pipeline

A RAG app has two main components: the retrieval component and the generation component. The former retrieves dynamic data from some data source such as a website, text, or database. The generation component combines the retrieved data with the query to generate a response with an LLM. Each of these components consists of smaller moving parts. Considering all these components and their subcomponents, it’s accurate to call the RAG process a chain or a pipeline.

Assessing the Retriever Component

Many parameters control a retriever’s output. The retrieval phase begins with loading the source data. How quickly is data loaded? Is all desired data loaded? How much irrelevant data is included in the source? For media sources, for instance, what qualifies as unnecessary data? Will you get the same or better results if, for example, your videos were compressed?

Assessing the Generator Component

The story is similar for the generator component. Many parameters significantly affect its performance. There’s the temperature, which controls the randomness or creativity of the LLM. It ranges between 0 and 1. 0 means it sticks to the given context strictly, and 1 means it has the freedom to respond with whatever it thinks is suitable to your question.

Evaluation Metrics

Due to the complex, integrated nature of RAG systems, evaluating them is a bit tricky. Because you’re dealing with unstructured textual data, how do you assess a scoring scheme that reliably grades correct responses? Consider the following prompts and their responses:

Prompt:
"What is the capital of South Africa?"

Answer 1:
"South Africa has three capitals: Pretoria (executive), Bloemfontein (judicial), 
  and Cape Town (legislative)."

Answer 2:
"While Cape Town serves as the legislative capital of South Africa, Pretoria 
  is the seat of the executive branch, and Bloemfontein is the judicial capital."
Prompt:
"What was the cause of the American Civil War?"

Answer 1:
"The primary cause of the American Civil War was the issue of slavery, 
  specifically its expansion into new territories."

Answer 2:
"While states' rights and economic differences played roles, the main 
  cause of the American Civil War was the debate over slavery and its expansion."

Exploring RAG Metrics

Over the years, several useful metrics have emerged, targeting different aspects of the RAG pipeline. For the retrieval component, common evaluation metrics are nDCG (Normalized Discounted Cumulative Gain), Recall, and Precision. nDCG measures the ranking quality, evaluating how well the retrieved results are ordered in terms of relevance. Higher scores are given for relevant results that appear at the top. Recall measures the model’s ability to retrieve relevant information from the given dataset. Precision measures how many of the search results are relevant. For best results, use all metrics. Other kinds of metrics available are LLM Wins, Balance Between Precision and Recall, Mean Reciprocal Rank, and Mean Average Precision.

Evaluating RAG Evaluation Tools

Just as there’s no shortage of RAG evaluation metrics, there’s equally a good number of evaluation tools. Some use custom metrics not previously mentioned, and proprietary metrics, too. Depending on your use case, one or a combination of specific metrics will boost your RAG’s performance significantly. Examples of RAG evaluation frameworks include Arize, Automated Retrieval Evaluation System (ARES), Benchmarking Information Retrieval (BEIR), DeepEval, Ragas, OpenAI Evals, Traceloop, TruLens, and Galileo.

See forum comments
Download course materials from Github
Previous: Introduction Next: Assessing a RAG Pipeline Demo