RAG in Production: Building Retrieval-Augmented Generation for Enterprise

Large language models are remarkably capable, but they hallucinate, their knowledge is frozen at training time, and they cannot access your proprietary data. Retrieval-Augmented Generation (RAG) solves all three problems by grounding LLM responses in retrieved documents. The concept is straightforward — feed relevant context into the prompt — but building a production-grade RAG system is an engineering challenge that catches many teams off guard.

This article distils hard-won lessons from enterprise RAG deployments, with specific attention to patterns that matter in the Dutch market: multilingual retrieval (Dutch/English), GDPR-compliant data pipelines, and integration with European cloud providers.

The RAG Architecture Stack

A production RAG pipeline consists of five stages:

1. Ingestion — documents are parsed, cleaned, and chunked
2. Embedding — chunks are converted to vector representations
3. Indexing — vectors are stored in a vector database
4. Retrieval — user queries are matched against stored vectors
5. Generation — retrieved context is passed to an LLM for answer synthesis

Each stage has its own set of engineering decisions.

Stage 1: Ingestion and Chunking

Chunking is where most RAG pipelines succeed or fail. The goal is to create self-contained, semantically meaningful units of text that can be retrieved independently.

Chunking Strategies

| Strategy | Best For | Typical Size | |----------|----------|-------------| | Fixed-size | Simple, uniform documents | 256-512 tokens | | Recursive/semantic | Long-form documents with structure | 512-1024 tokens | | Document-aware | PDFs, HTML with headers | Section-level | | Sentence-window | Conversational, Q&A-style retrieval | 3-5 sentences | | Parent-child | Legal, regulatory documents | Paragraph + section context |

The Dutch-language nuance: Dutch compound words (e.g., *arbeidsongeschiktheidsverzekering*) and longer average sentence length mean that token-based chunk sizes should be 10-15% larger than English defaults to avoid splitting semantic units mid-thought.

Practical Tips

Preserve metadata: Store document title, section heading, page number, and source URL alongside each chunk. This enables citation and filtering at query time.
Overlap chunks by 10-15%: Ensures context is not lost at boundaries.
Handle tables separately: Extract tables as structured data and embed them with descriptive captions rather than raw cell text.
Version your chunks: When source documents update, you need to re-chunk and re-embed — track which chunks correspond to which document versions.

Stage 2: Embedding Models

The embedding model determines how well your retrieval captures semantic meaning.

Model Options (2026 Landscape)

| Model | Dimensions | Multilingual | Notes | |-------|-----------|-------------|-------| | OpenAI text-embedding-3-large | 3072 | Yes | Strong Dutch performance, hosted API | | Cohere embed-v4 | 1024 | Yes | Good for search, supports compression | | E5-mistral-7b-instruct | 4096 | Yes | Open-source, self-hostable | | multilingual-e5-large | 1024 | Yes | Excellent for Dutch/English mixed corpora | | BGE-M3 | 1024 | Yes | Multi-granularity, supports sparse+dense |

For Dutch enterprises handling sensitive data (healthcare, finance, government), self-hosted models like E5-mistral or BGE-M3 are attractive because data never leaves your infrastructure — a key GDPR consideration.

Embedding Best Practices

Normalise text before embedding: consistent casing, whitespace, and encoding
Embed queries differently from documents: Many models support instruction-prefixed embeddings (e.g., "Retrieve relevant documents for: [query]") — use them
Benchmark on your own data: Public benchmarks (MTEB, BEIR) are useful baselines, but your domain vocabulary matters more than generic performance

Stage 3: Vector Databases

Your choice of vector database affects latency, scalability, and operational complexity.

Options Compared

| Database | Type | Managed | EU Hosting | Hybrid Search | |----------|------|---------|-----------|---------------| | Weaviate | Purpose-built | Yes | Yes (Amsterdam HQ) | Yes | | Qdrant | Purpose-built | Yes | Yes (EU regions) | Yes | | Pinecone | Purpose-built | Yes | Yes (EU region) | Yes | | pgvector | PostgreSQL extension | Via managed PG | Yes | Via SQL | | Milvus | Purpose-built | Yes (Zilliz) | Yes | Yes |

Dutch-market note: Weaviate was founded in Amsterdam and is popular among Dutch enterprises for its hybrid search capabilities and local support. If you already run PostgreSQL, pgvector offers a lower barrier to entry with trade-offs on scale.

Key Design Decisions

Namespace isolation: Separate vector spaces per tenant, department, or data classification level
Hybrid search: Combine dense vectors with sparse (BM25/keyword) search — this consistently outperforms either approach alone, especially for Dutch technical terminology
Filtering: Use metadata filters to scope searches to specific document types, date ranges, or access levels
Index tuning: HNSW parameters (ef_construction, M) trade accuracy for speed — benchmark with your real data

Stage 4: Retrieval and Re-Ranking

Raw vector similarity is a starting point, not the finish line.

The Two-Stage Retrieval Pattern

1. Broad retrieval: Fetch top-50 candidates using vector search (fast, approximate)
2. Re-ranking: Score candidates using a cross-encoder model (slower, more accurate) and return top-5

Cross-encoder re-rankers like Cohere Rerank, Jina Reranker, or the open-source BGE-reranker-v2-m3 dramatically improve answer quality — often the single highest-ROI change you can make to a RAG pipeline.

Advanced Retrieval Techniques

Query decomposition: Break complex questions into sub-queries, retrieve for each, and merge results
HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed it, and use that for retrieval — works well for abstract queries
Self-query: Let the LLM extract metadata filters from the user's natural-language query before retrieval
Contextual retrieval: Prepend document-level context to each chunk before embedding (as described in [Anthropic's research](https://www.anthropic.com/news/contextual-retrieval)) — reduces retrieval failures by 49%

Stage 5: Generation

With relevant context retrieved, the generation step is about crafting the right prompt and handling edge cases.

Key Considerations

Set a context window budget: Reserve tokens for retrieved context, system prompt, and expected answer length. With modern models supporting 128K+ tokens, the temptation is to stuff everything in — resist this. More context does not always mean better answers.
Handle "I don't know": The system should gracefully admit when it lacks sufficient information rather than hallucinating.
Language detection: In Dutch corporate environments, users switch between Dutch and English mid-conversation. Detect language and respond accordingly.
Citation: Include source references in generated answers so users can verify claims.

Evaluation: Measuring RAG Quality

You cannot improve what you do not measure. RAG evaluation requires assessing both retrieval and generation quality.

Retrieval Metrics

Recall@k: What percentage of relevant documents appear in the top-k results?
MRR (Mean Reciprocal Rank): How high does the first relevant result rank?
Precision@k: What percentage of top-k results are actually relevant?

Generation Metrics

Faithfulness: Does the answer stick to the retrieved context? (Use LLM-as-judge or [RAGAS](https://docs.ragas.io/) framework)
Relevance: Does the answer address the question?
Completeness: Does the answer cover all aspects of the question?

Practical Evaluation

Build a golden dataset of 100-200 question-answer pairs with annotated source documents. Run automated evaluations on every pipeline change. RAGAS and DeepEval are open-source frameworks that automate this.

Common Pitfalls

1. Skipping re-ranking: Vector similarity alone is not enough — re-ranking consistently improves quality
2. Ignoring chunk quality: Garbage in, garbage out. Invest time in your chunking strategy
3. Not versioning embeddings: When you change your embedding model, all stored vectors become incompatible
4. Treating RAG as set-and-forget: Documents change, models improve, and user needs evolve — build for continuous iteration
5. Over-retrieving: Flooding the context window with marginally relevant chunks degrades answer quality
6. Neglecting access control: In enterprise settings, not every user should see every document — implement permission-aware retrieval

Cost Optimisation

RAG costs come from three sources: embedding computation, vector storage, and LLM inference.

Cache embeddings: Never re-embed unchanged documents
Compress vectors: Quantisation (e.g., Weaviate's PQ, Qdrant's scalar quantisation) can reduce storage by 4-8x with minimal quality loss
Use tiered models: Fast, cheap models for initial retrieval; powerful models for final generation
Batch embed: Process documents in batches during off-peak hours

Getting Started

For Dutch enterprises beginning their RAG journey:

1. Start with a narrow scope — one document collection, one use case
2. Use a managed vector database to avoid operational overhead
3. Implement re-ranking from day one
4. Build evaluation into your pipeline, not as an afterthought
5. Plan for multilingual from the start — retrofitting Dutch support is harder than building it in

Explore our automation and DevOps services for help building production RAG systems, or read our articles on data engineering trends and data management.

RAG in Production: Building Retrieval-Augmented Generation for Enterprise

RAG in Production: Building Retrieval-Augmented Generation for Enterprise

The RAG Architecture Stack

Stage 1: Ingestion and Chunking

Chunking Strategies

Practical Tips

Stage 2: Embedding Models

Model Options (2026 Landscape)

Embedding Best Practices

Stage 3: Vector Databases

Options Compared

Key Design Decisions

Stage 4: Retrieval and Re-Ranking

The Two-Stage Retrieval Pattern

Advanced Retrieval Techniques

Stage 5: Generation

Key Considerations

Evaluation: Measuring RAG Quality

Retrieval Metrics

Generation Metrics

Practical Evaluation

Common Pitfalls

Cost Optimisation

Getting Started

Related Articles

AI-Powered Cybersecurity: How Machine Learning Is Transforming Threat Detection

The EU AI Act: What Dutch Businesses Need to Know in 2026

AI Agents and Autonomous Workflows: From Chatbots to Digital Coworkers

Need Help with Your IT Infrastructure?