'Kölle'—what's that? - Improving RAG via Multi-Query Retrieval

 

TL; DR

"'Kölle'—what's that? Our users are asking for 'the most beautiful city in the world,' and the chatbot has no idea what they mean? Multi-Query Retrieval can help. Multi-Query Retrieval is an advanced concept within Retrieval-Augmented Generation (RAG) architectures. It aims to improve the quality and diversity of the retrieved contextual information from a knowledge base before it's passed to a Large Language Model (LLM) to answer a query.


Introduction

The RAG (Retrieval-Augmented Generation) approach provides a solid foundation for delivering meaningful and relevant content for user queries. It's used in Knowledge Management and, for example, in creating situational, personalized assistants. Over time, however, shortcomings in standard RAG have been identified, and corresponding solutions proposed. These improvements target one of the three RAG areas: Retrieve, Augment, or Generate. In today's blog post, we will look at a specific technique that extends the standard retrieval mechanism of RAG.

🧠 The Basic Idea of RAG

Figure 1: RAG Architecture (Source: Prompt Engineering)

Based on user input or a query, the standard RAG process runs in three steps: Retrieve, Augment, and Generate (see figure). 

In the Retrieve or Retrieval phase, a retriever searches for content relevant to the user query (the provided prompt) from a set of existing data. To ensure the search is efficient, queries are typically converted into optimized data structures, such as vectors. These vectors are then compared with data (other vectors) in a vector index. This provides a result set. Since not all parts of the result set are necessarily equally important, the set might be further constrained, for example, by only considering the top 5 results.

The results obtained this way are then combined with the prompt for the Large Language Model and passed to it. This combination is called Augment, or enriching the original prompt. How the context and prompt are enriched is often application-specific. For example, the vector index might provide us with examples that we include in the prompt as few-shot examples.

When the enriched prompt is passed to the LLM, the LLM generates the corresponding prompt completion. This concludes the RAG process with the Generate phase.

If we look specifically at the Retrieve phase, we recognize that a single query can be ambiguous, too general, or too narrow. This means that comparing the user prompt with the knowledge in our content database might not find any matching results. This leads to no context being available to enrich the original prompt for our LLM.

This might not be a problem for general queries or questions for which the LLM has a sufficiently large knowledge base. But especially when we want to base answers for user questions exclusively on our own data (Grounding), high retrieval accuracy and semantic coverage are necessary.

Let's consider our example of the tourism assistant for Cologne again. Users can refer to Cologne in various ways, e.g., "the most beautiful city in the world," "Köln," "Kölle," "Coeln," or "the city with a K." If our knowledge base only had "Köln" stored as the term for the city "Cologne," then queries like, "What can I do this weekend in the most beautiful city in the world?" would be difficult for our AI system to answer. 

Now, this is where Multi-Query Retrieval comes into the picture.

🔍 What is Multi-Query Retrieval

Multi-Query Retrieval is a technique to increase the recall and semantic coverage in the retrieval step of a RAG pipeline. It uses the LLM's ability for query reformulation to explore different semantic perspectives, thereby improving the context for answer generation.

Instead of just one single search query, several slightly different variants of the same query are generated and used.

Let's go back to our question, "What can I do this weekend in the most beautiful city in the world?". The following variants could be helpful in getting an answer from our AI tourism guide:

  • "What's happening on Saturday and Sunday in Kölle?"

  • "What is there to do on the weekend in the most beautiful city in the world?"

  • "What activities are there in Cologne on the weekend?"

These three questions are equivalent alternatives to the user's query, differing only in the terms for "weekend," "Cologne," and "do."

How can we generate these different variants? Two options are conceivable:

  1. Automatically generated by an LLM (e.g., with the prompt: "Generate several alternative formulations of this question").

  2. Systematic modification (e.g., using synonyms, reformulations, or focus shifts).

Now, a retrieval query is executed against the vector index with each individual variant. This returns a set of answers for each query. All received answers are combined into one result set. Duplicate, irrelevant, or incorrect answers must, of course, be removed from this combined set (e.g., through ranking, clustering, or embedding similarity).

The final combined set of results is then used as context for the LLM.



The Multi-Query Process

Figure 2: Multi-Query Retrieval (Source: Langchain)

The image above illustrates the multi-query process schematically. Based on a user query, we go through the following four (instead of three) phases:

  1. Query Expansion
    The LLM or a specialized algorithm generates n alternative formulations of the input query. For example for the user question: "How does temperature affect the material strength of aluminum?" We can imagine the following variations 

    • Impact of heat on aluminum strength

    • Temperature dependence of aluminum tensile strength

    • Does the strength of aluminum change at high temperatures?

  2. Parallel Retrieval
    For each variant, the top-k most similar documents are retrieved from the knowledge base.

  3. Combination of Results
    The results are merged into a consolidated set of documents (e.g., by removing duplicates, reranking, or relevance assessment).

  4. Augmentation & Generation
    This expanded, more diverse context is then passed to the LLM to generate a more robust answer.

After successful integration of the multi-query retrieval method, it must be ensured that the approach actually leads to an improvement in RAG. Accordingly, performance should be evaluated with metrics such as Recall@k, MRR, nDCG, and human expertise (factual accuracy).

Discussion of Pros and Cons

Next, we want to briefly discuss the advantages and disadvantages of the Multi-Query Retrieval approach.

🧩 Advantages

  • Higher Recall Rate: More relevant documents are found as different semantic perspectives are covered.

  • Robustness against unclear queries: Ambiguous or incomplete questions are less likely to lead to poor retrieval results.

  • Better Generalization: Especially useful for heterogeneous knowledge bases or domain-specific texts.

⚠️ Disadvantages

  • Higher computational cost: Multiple queries increase costs and latency.

  • Increased integration effort: Results must be merged meaningfully.

  • Potential for noise injection: Too many irrelevant documents can dilute the generation.

Design Parameters

Fundamentally, multi-query retrieval expands the search space for a specific user query. This expansion must be controlled correctly and subsequently narrowed down again for result concretization. Consequently, we must make the following design decisions:

Number of query variants

We must find a trade-off between result quality and computational complexity. The more query variants we create, the higher the recall. However, this also leads to linearly (or sub-linearly) higher computational costs due to LLM API calls and latency. In our projects, we usually start with 5 variants. We have found this to be the best compromise between recall and latency. If necessary, we also make the number of queries dynamic depending on query complexity. However, this increases system complexity, as the complexity assessment is done by, for example, an additional LLM or a supervised AI model. 


How many results per query (TOP_K_PER_QUERY)

Too small a k can miss relevant documents. On the other hand, too large a k leads to noise. It is recommended to make the choice for k dynamic (5–20). Here, one can either use query complexity or iteratively increase k in case of poor answers.


Query-Expansion Method

We already mentioned the two basic approaches for generating new queries above: heuristics that need to be designed or the use of an LLM. LLM reformulations, in particular, easily provide semantically strong variants (e.g., HyDE — generate hypothetical documents/ answers and embed them). However, LLM-based expansion costs tokens/API costs and adds latency. In contrast, heuristics must be designed, implemented, and quality-assured.

Managing the expanded retrieval context

The basic approach of multi-query retrieval is to achieve better result quality by generating similar queries. These queries, in turn, produce their own answers, which we must handle appropriately in the Augmentation and Generation steps of RAG. This raises, among others, the following questions:

  • How do we deal with identical or very similar results?

  • How do we ensure that the important and correct answers are presented to the LLM?

  • How do we ensure that we make the best possible use of the available LLM context?

To address these, we can first consider general techniques: (i) suitable indexing of documents, e.g., chunking at the paragraph/node level instead of whole documents, (ii) integration of metadata (source, section, date) for tracking and filtering nodes, (iii) caching for query variants, embeddings, and successful answers, (iv) node and prompt compression, or (v) synthesis of content.

However, we must also specifically answer the challenge of answer filtering and answer sorting:

  • Answer filtering through result fusion & deduplication
    Fusion strategies combine scores from different queries/indexes (Union vs. Weighted Fusion).

  • Answer sorting through Re-ranking
    Due to the increased number of queries and answers, the best answers must be filtered and presented to the LLM in a targeted manner. For this, it is recommended to re-rank the answers. A Cross-Encoder (or a smaller supervised ranker) can be used for the final reordering. Note: Cross-Encoders scale poorly for a very large number of candidates (logical pipeline: Retrieval → top N → Re-Rank).

Summary and Outlook

Multi-Query Retrieval increases recall and robustness by generating and querying multiple semantic perspectives on the input.

In our next article, we will deal with the concrete Python implementation of Multi-Query Retrieval. Standard frameworks like LlamaIndex offer powerful abstractions and modular components (Query Transformations, TransformRetriever, Fusion Packs) that greatly simplify the multi-query implementation for us. We will also take a closer look at these.

What suggestions, tips, and tricks have you found for your RAG problems? We look forward to your feedback. Until next time, your team at AI Cologne.

Comments

Popular posts from this blog

Adventures in F# - F# and ADO.NET

Adventures in F# - A Plant for Terrarium 2