Advanced RAG Techniques

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Heads up... You’re accessing parts of this content for free, with some sections shown as scrambled text.

Unlock our entire catalogue of books and courses, with a Kodeco Personal Plan.

Unlock now

A basic RAG system consists of indexing, retrieval, and generation. Several other steps can be integrated with these to build an advanced RAG system. These might include storage, prompt construction, and translation. When SportsBuddy generates a response, it displays only the top result from a list of responses. You can adjust the retriever’s k search argument to return more documents. This is a form of re-ranking that the vector store does automatically.

The starter notebook currently has a few modifications from the basic RAG implementation. It doesn’t use the rag prompt from “rlm/rag-prompt,” because this prompt conditions the response, which wouldn’t serve the purpose here. Because the response will be long, the response’s length is shown instead of the response itself. You can print the response to see for yourself if desired. Finally, the chunk_overlap has been reduced because you’re working with a relatively small dataset. With these changes, a quick test will reveal that there are four documents for the given query.

It could just as well be three or more depending on the query. How, then, were you getting the kinds of responses you were earlier? It’s mainly because of the prompt. But there’s more. Your RAG prompt forced the LLM to condense the output to the best possible answer that fit. This could be undesirable in some cases, because it might leave out plenty of good information. It might even be that it specifies the k argument on the retriever to 1, thus delegating the responsibility to the vector store to determine the best possible response. Although the vector store’s search capabilities are good, they’re not optimized to always return the best results.

Assessing the response again, you’ll realize that the document, although the best in terms of relevance to the given query, contains other irrelevant information. So apart from some of the retrieved documents being irrelevant to the query, the relevant documents could also contain irrelevant text. There are a few ways to tackle this, and contextual compression is one of them.

Contextual Compression

Contextual compression is a technique for compressing responses based on the query to filter out irrelevant responses. This is essentially smoothing out rough edges. Though the retrieved documents are the best match for the query, contextual compression introduces a post-processing phase to remove noise, thus resulting in a better response.

Introducing Re-ranking

Contextual compression is at the core of many re-ranking techniques. Vector databases by default have a score by which responses can be ranked based on relevance. Re-ranking strategies do something similar but with the given query and the list of retrieved documents. That, combined with contextual compression strategies and other fine-tuning techniques, produces more accurate responses.

See forum comments
Download course materials from Github
Previous: Introduction Next: OpenAI & LangChain Demo