4 RAG

4 Retrieval Augmented Generation (RAG) #

\(\)

4.1 Introduction to RAG #

Current LLMs are already powerful, i.e., can

answer questions using knowledge learned from training data,
summarize text, and
generate code.

However, they do not have information about events that occured after the knowledge cutoff date and about everything that was not part of the training material (such as proprietary information).

Retrieval Augmented Generation (RAG) makes additional information available to the LLM in order to allow for better and more reliable answers (and other types of LLM output).

The types of information needed depend on the task at hand, e.g.:

code generation: context of existing codebase (classes, functions, definitions, coding style …)
company chatbots: internal documents (manuals, support guides, FAQs, …)
special domains such as legal or medicine: case files, journals, private data, …
explore scientific papers: publication/preprint PDFs
personalized RAG: personal documents (e.g. Word + PDF), calendars, emails, contacts, …

Central part of any RAG system is a retriever that

manages a knowledge base of trusted information and
finds the most relevant information for each specific prompt and shares this information with the LLM

The full pipeline in typical RAG use is shown in the figure:

**Figure 4.1:** General RAG pipeline (adapted from Coursera RAG course)

Instead of feeding the user prompt directly to the LLM, the prompt is first augmented and modified by the RAG system. Typically, this augmented prompt contains verbatim chunks of the source documents stored in the RAG vector database.

Advantages of RAG #

Injects missing knowledge: Adds info not in the training data (e.g. policies, updates)

Reduces hallucinations: Grounds answers with relevant context

Keeps models up to date: Reflects new info by updating the knowledge base

Enables source citation: Includes sources for verifiable answers

Focuses model on generation: Retriever finds facts, LLM writes responses

A key goal of RAG is to inject just the documents (or document chunks) that are relevant for the current prompt since

● Longer prompts take more computation to run

● Prompt + documents have to fit into the context window of the LLM (between several thousand and a million tokens)

● Irrelevant documents might distract the LLM

For this purpose, the central task of a retriever is to find the highest-scoring documents within its knowledge base - according to some relevance metrics, as illustrated by this example:

RAG ... — **Figure 4.2:** RAG retrieval example (source: Coursera RAG course)

4.2 RAG search algorithms #

RAG systems use two main search approaches:

Semantic search (via embedding vectors): looks for documents with similar meaning to the prompt
Keyword search: looks for documents containing the exact words found in the prompt

Both approaches can be combined with each other and with metadata filtering:

RAG ... — **Figure 4.3:** RAG search algorithms (source: Coursera RAG course)

RAG ... — **Figure 4.4:** RAG metadata filtering (source: Coursera RAG course)

Keyword search: TF-IDF + BM25 #

Simple approach, used for decades: keyword search

Basis: Bag of Words (for search string + each document)

Target: find documents that often contain relevant words (“terms”) of the prompt:

Primary scoring: term frequency (TF)
Suppress (too) common terms using “inverse document frequency” (IDF)

Score = TF(word, doc) × log(Total docs / Docs containing word)

RAG ... — **Figure 4.5:** RAG keyword search example using TF-IDF (source: Coursera RAG course)

Advanced algorithms such as BM25 also take the length of each document in account and contain tunable parameters for term frequency saturation and length normalization.

Semantic search #

Keyword search has problems:

it cannot match synonyms or different word forms (singular - plural etc.)
it cannot distinguish different meanings of a word (“bank”)

Not surprisingly, vector embeddings generically lead to vastly better matches between prompts and documents.

How to create embedding models suitable for RAG applications?

Already seen: Word2vec methods (self-supervised learning)

also possible:

Contrastive learning (supervised training with positive and negative phrase pairs)
Pretrained LLM (self-supervised learning) - often specialized embedding LLM (e.g. encoder-only)

Retriever quality metrics #

Note: the retriever quality can only determined reliably if the correct answers are known, i.e. each document (in the test set) is labeled as relevant or irrelevant to a given test prompt!

Precision: relevant retrieved / total retrieved (reduced by false positives)

Recall: relevant retrieved / total relevant (reduced by false negatives)

RAG ... — **Figure 4.6:** RAG metrics example (source: Coursera RAG course)

Retrieval metrics are influenced by how many documents the retriever returns (top)

Often cited precision/recall@k: precision@5, recall@5, precision@15, recall@15

Mean Average Precision (MAP@K): evaluates average precision for relevant documents in first K documents.

Reciprocal rank: Measures the rank of the first relevant document in the returned list (Mean Reciprocal Rank, MRR: average over range of prompts)

4.3 Information retrieval in production #

Let us now turn to the question how a vector search can be performed in practice.

This is a highly nontrivial task for large document (chunk) data bases, possibly with billions of documents.

Note that an obvious approach that one might use for vector searches in two or three dimensions, partitioning the space in sections and only search in quadrants or sections close to the target vector, does not work in high dimensions.

KNN: K Nearest Neighbors #

Let us first state the naive solution: the K nearest neighbor (KNN) algorithm.

Parameter: number k of result vectors

Given: prompt embedding vector y, data base with document vectors \(x_i; 1\le i \le N\)

Task: find in the set \({x_i}\) the k closest vectors to \(y\)

initialize result list with \(x1, …, x_k\); compute distances \(s_i = |x_i - y|\)
sort result list by s_i
for \(k+1\le i \le N\):
1. compute \(s_{i}\)
2. if \(s_i<s_k\): replace \(x_k\) in result list by \(x_i\); sort result list

Effort scales with number of vectors in data base - in general too expensive!