Definition

RAG

RAG is a pattern where an LLM retrieves relevant snippets from an external store at query time and conditions its answer on them, keeping knowledge fresh.

Retrieval-Augmented Generation (RAG) is a pattern where an LLM retrieves relevant content from an external data store at query time and conditions its response on that content. Instead of hoping the model remembers a fact from training, you fetch the fact, stuff it into the context window, and ask the model to answer using it.

Why it matters

RAG solves two problems: stale knowledge (training cutoffs are months to years old) and context limits (you can't fit a 10M-line codebase in any window). For agentic coding specifically, RAG lets you build a "talk to your codebase" experience where the agent fetches only the files or functions relevant to a query instead of reading everything.

Many AI coding tools use lightweight RAG under the hood. Claude Code, Codex CLI, and similar agents default to direct file reads but can pair with MCP servers that implement RAG over a repo, a wiki, or external docs. Running them in SpaceSpider doesn't change the retrieval story — it just hosts the CLI.

How it works

A typical RAG pipeline:

Index — split source documents into chunks, compute an embedding for each, store in a vector database
Query — embed the user's question, find the top-K nearest chunks
Augment — construct a prompt that includes the retrieved chunks alongside the question
Generate — call the LLM with that prompt

Variations include hybrid search (combining keyword BM25 with vector similarity), reranking (a second model reorders retrieved chunks), and multi-hop retrieval (the model requests more chunks based on intermediate reasoning).

Production RAG pipelines are surprisingly hard to tune. Chunk size, overlap, retrieval count, embedding model, and rerank strategy all affect quality.

How it's used

RAG use cases in developer tooling:

"Ask my docs" — retrieve from a wiki, answer with citations
Code search — semantic grep over a monorepo
Agent tools that retrieve from company knowledge before editing
Long-term memory — store past conversations as embeddings, retrieve relevant bits next time

For simpler cases, modern long-context models make naive RAG unnecessary — just pass the whole file. See /blog/rag-vs-long-context for the tradeoff.

Embedding — the vector form retrieval uses
Vector database — where embeddings live
Context window — what RAG populates
MCP — a common way to expose RAG to coding agents
LLM — what consumes the retrieved context

RAG

Why it matters

How it works

How it's used

FAQ

Is RAG dead now that context windows are huge?

Do I need RAG for Claude Code?

Related terms