Back to Insights
AITechnologyRAGLLM

Understanding the Layers of RAG Systems

17 October 2025

Retrieval Augmented Generation (RAG) brings private or fast changing knowledge into the conversation with a language model. It is not a single component. It is a pipeline of layers that work together. Some parts can be reused across projects, others are tied to the data and must be rebuilt. The overview below explains each layer, how reusable it is in practice, and what that means in daily business work, with concrete examples to make the choices tangible. Data layer This is the raw material that everything else depends on, and it changes from one organisation to the next, which means low reusability beyond general scripts and playbooks. Sources include PDFs, policy manuals, knowledge bases, CRM notes, ticket logs, call transcripts, spreadsheets, and websites, and preparation involves cleaning, deduplication, normalising formats, and splitting text into coherent chunks that carry a single idea. Think of a chunk as a clause in a contract, one FAQ answer, a single meeting decision, or a short policy note with its heading. Poor chunking reduces relevance and increases hallucinations, so for a legal team you would chunk contracts into clauses with metadata for jurisdiction and effective dates so queries about termination rights return the exact clause rather than the whole document. Embedding layer Embeddings turn text into vectors that represent meaning, and while the pipeline that creates them is reusable, the vectors themselves are not because they are bound to a specific model, language, and corpus. Model choice matters for domain and update cadence. An embedding model that captures legal phrasing will struggle with customer support slang or product codes. A bank, for example, might reuse the same embedding pipeline for both retail policy content and developer runbooks, yet it would regenerate embeddings for each corpus and each model upgrade. Vector database layer The vector store enables fast similarity search with metadata filters, and the technology and indexing strategy are broadly reusable, but the index contents are not since they are derived from specific embeddings. Good metadata design is the difference between trusted retrieval and noisy answers. Compliance teams, for instance, filter by jurisdiction, document type, and effective date so the assistant cites only current, in force policies, even though the underlying database platform and index configuration can be reused across domains. Retrieval logic layer This layer governs how the system searches and ranks context, and the logic is largely reusable across datasets with sensible tuning for each domain. It may combine keyword filters with vector search, add a reranker, boost recent content, enforce source diversity, and set the number and size of chunks for a given task. A support assistant, for example, boosts content tagged with the current software version so customers receive instructions that match the release they are using, while a contracts assistant lowers the chunk size and tightens filters to avoid mixing clauses from different agreements. Generation layer The language model turns retrieved context into an answer, and the serving stack can serve many domains, yet prompt templates are tailored to task, tone, and risk posture. Strong prompts insist on grounded reasoning and citation to cut hallucinations, and they shape output for the audience. A finance assistant answers in short paragraphs, cites sources, and includes a confidence note, while a developer assistant includes code snippets and links to runbooks, all on the same base model and infrastructure but with prompts adjusted per use case. Evaluation and feedback layer Quality improves only when it is measured, which makes this layer highly reusable as a framework that any corpus can plug into, while test sets and scoring rubrics are corpus specific. Human review and automated checks track faithfulness to sources, answer relevance, completeness, latency, and cost, and logs point to missing content, weak chunks, or unclear prompts. Many teams run quarterly reviews that sample answers per domain, score usefulness and faithfulness, and feed precise fixes back to content owners and pipeline jobs. Why data preparation is the leverage point The most expensive mistakes start with messy input, which means investment here pays back across the stack. Scanned PDFs benefit from OCR with layout retention. Tables need structure extraction and cell level references. Long reports work best with section aware chunking and anchors. Transcripts improve with speaker labels, timestamps, and action item flags. Images need captions or OCR. Rich metadata on ownership, dates, versions, jurisdictions, and system of record allows retrieval to filter precisely and generation to cite cleanly. What you should expect your team to build A reproducible pipeline from raw sources to evaluated answers is the deliverable, not just a clever prompt. That pipeline includes source connectors, cleaning and chunking jobs, embedding generation, vector indexing with thoughtful metadata, retrieval logic with tests, prompt templates with guardrails, serving infrastructure with observability, and an evaluation loop tied to business outcomes. It should be easy to rebuild indexes after a content update or model change, and safe to test retrieval or prompt adjustments without disrupting service. A quick tour of RAG variations There is a wide family of RAG approaches that build on the same foundations but adjust retrieval and reasoning for different needs. Standard RAG fetches relevant chunks and merges them into answers. Conversational RAG keeps dialogue history to stay coherent across turns. Corrective RAG revisits earlier outputs and runs a refined second pass to fix errors. Hybrid RAG blends keyword and semantic search for broader coverage. Speculative RAG predicts likely follow up questions and preloads context to respond faster. You will also encounter memory augmented, fusion, context aware, agentic, reinforcement trained, self, sparse, adaptive, citation aware, retrieval feedback, multimodal, multi hop, reasoning heavy, long context, federated, hierarchical, context ranking, prompt augmented, few shot, and chain of retrieval variants. -- This summary of RAG variations is adapted with credit to Brij Kishore Pandey’s post on LinkedIn. Read it for a more clear overview of the different types and architectures. This article was created by people. We have used artificial intelligence (AI) to help articulate our message and refine the text. AI was employed as a tool to assist with structuring, identifying grammatical and spelling errors, and improving readability. The final document has been carefully reviewed and approved by our team.

Interested in working together?

If you're considering AI, data, or cloud modernisation, we can help you clarify what is feasible, what is safe, and what will create measurable value.

Get in touch