4 RAG

4 Retrieval Augmented Generation (RAG) #

\(\)

4.1 Introduction to RAG #

Current LLMs are already powerful, i.e., can

  1. answer questions using knowledge learned from training data,
  2. summarize text, and
  3. generate code.

However, they do not have information about events that occured after the knowledge cutoff date and about everything that was not part of the training material (such as proprietary information).

Retrieval Augmented Generation (RAG) makes additional information available to the LLM in order to allow for better and more reliable answers (and other types of LLM output).

The types of information needed depend on the task at hand, e.g.:

  1. code generation: context of existing codebase (classes, functions, definitions, coding style …)
  2. company chatbots: internal documents (manuals, support guides, FAQs, …)
  3. special domains such as legal or medicine: case files, journals, private data, …
  4. explore scientific papers: publication/preprint PDFs
  5. personalized RAG: personal documents (e.g. Word + PDF), calendars, emails, contacts, …

Central part of any RAG system is a retriever that

  1. manages a knowledge base of trusted information and
  2. finds the most relevant information for each specific prompt and shares this information with the LLM

The full pipeline in typical RAG use is shown in the figure:

RAG pipeline

Figure 4.1: General RAG pipeline (adapted from Coursera RAG course)

Instead of feeding the user prompt directly to the LLM, the prompt is first augmented and modified by the RAG system. Typically, this augmented prompt contains verbatim chunks of the source documents stored in the RAG vector database.

Advantages of RAG #

Injects missing knowledge: Adds info not in the training data (e.g. policies, updates)

Reduces hallucinations: Grounds answers with relevant context

Keeps models up to date: Reflects new info by updating the knowledge base

Enables source citation: Includes sources for verifiable answers

Focuses model on generation: Retriever finds facts, LLM writes responses

A key goal of RAG is to inject just the documents (or document chunks) that are relevant for the current prompt since

● Longer prompts take more computation to run

● Prompt + documents have to fit into the context window of the LLM (between several thousand and a million tokens)

● Irrelevant documents might distract the LLM

For this purpose, the central task of a retriever is to find the highest-scoring documents within its knowledge base - according to some relevance metrics, as illustrated by this example:

RAG ...

Figure 4.2: RAG retrieval example (source: Coursera RAG course)

4.2 RAG search algorithms #

RAG systems use two main search approaches:

  1. Semantic search (via embedding vectors): looks for documents with similar meaning to the prompt
  2. Keyword search: looks for documents containing the exact words found in the prompt

Both approaches can be combined with each other and with metadata filtering:

RAG ...

Figure 4.3: RAG search algorithms (source: Coursera RAG course)

RAG ...

Figure 4.4: RAG metadata filtering (source: Coursera RAG course)

Keyword search: TF-IDF + BM25 #

Simple approach, used for decades: keyword search

Basis: Bag of Words (for search string + each document)

Target: find documents that often contain relevant words (“terms”) of the prompt:

  1. Primary scoring: term frequency (TF)
  2. Suppress (too) common terms using “inverse document frequency” (IDF)

Score = TF(word, doc) × log(Total docs / Docs containing word)

RAG ...

Figure 4.5: RAG keyword search example using TF-IDF (source: Coursera RAG course)

Advanced algorithms such as BM25 also take the length of each document in account and contain tunable parameters for term frequency saturation and length normalization.

Keyword search has problems:

  1. it cannot match synonyms or different word forms (singular - plural etc.)
  2. it cannot distinguish different meanings of a word (“bank”)

Not surprisingly, vector embeddings generically lead to vastly better matches between prompts and documents.

How to create embedding models suitable for RAG applications?

Already seen: Word2vec methods (self-supervised learning)

also possible:

  1. Contrastive learning (supervised training with positive and negative phrase pairs)
  2. Pretrained LLM (self-supervised learning) - often specialized embedding LLM (e.g. encoder-only)

Retriever quality metrics #

Note: the retriever quality can only determined reliably if the correct answers are known, i.e. each document (in the test set) is labeled as relevant or irrelevant to a given test prompt!

Precision: relevant retrieved / total retrieved (reduced by false positives)

Recall: relevant retrieved / total relevant (reduced by false negatives)

RAG ...

Figure 4.6: RAG metrics example (source: Coursera RAG course)

Retrieval metrics are influenced by how many documents the retriever returns (top)

Often cited precision/recall@k: precision@5, recall@5, precision@15, recall@15

Mean Average Precision (MAP@K): evaluates average precision for relevant documents in first K documents.

Reciprocal rank: Measures the rank of the first relevant document in the returned list (Mean Reciprocal Rank, MRR: average over range of prompts)

4.3 Information retrieval in production #

Let us now turn to the question how a vector search can be performed in practice.

This is a highly nontrivial task for large document (chunk) data bases, possibly with billions of documents.

Note that an obvious approach that one might use for vector searches in two or three dimensions, partitioning the space in sections and only search in quadrants or sections close to the target vector, does not work in high dimensions.

KNN: K Nearest Neighbors #

Let us first state the naive solution: the K nearest neighbor (KNN) algorithm.

Parameter: number k of result vectors

Given: prompt embedding vector y, data base with document vectors \(x_i; 1\le i \le N\)

Task: find in the set \({x_i}\) the k closest vectors to \(y\)

  1. initialize result list with \(x1, …, x_k\); compute distances \(s_i = |x_i - y|\)
  2. sort result list by s_i
  3. for \(k+1\le i \le N\):
    1. compute \(s_{i}\)
    2. if \(s_i<s_k\): replace \(x_k\) in result list by \(x_i\); sort result list

Effort scales with number of vectors in data base - in general too expensive!

Approximate Nearest Neighbors #

Basis: Navigable Small World (NSW) algorithm

NSW

Figure 4.7: Navigable Small World algorithm: construction (source: Coursera RAG course)

NSW

Figure 4.8: Navigable Small World algorithm: search (source: Coursera RAG course)

NSW

Figure 4.9: Navigable Small World algorithm: search (source: Coursera RAG course)

Improvement: Hierarchical Navigable Small World (HNSW) speeds up early parts of the search

Relies on a hierarchical proximity graph

NSW

Figure 4.10: Hierarchical Navigable Small World: construction (source: Coursera RAG course)

NSW

Figure 4.11: Hierarchical Navigable Small World: search (source: Coursera RAG course)

● ANN is significantly faster than KNN at scale

● Find close documents, but can’t guarantee best matches

● Depends on proximity graph, computationally expensive to build but can be pre-computed

Vector Database #

Typical Vector Database operations in RAG context

● Database setup

● Load documents

● Create sparse vectors for keyword search

● Create dense vectors for semantic search

● Create HNSW index to power ANN algorithm

● Run searches!

4.4 Chunking #


RAG ...

Figure 4.: RAG (DL)

© 2025 Nils Blümer