RAG in Production: Building Retrieval-Augmented Generation for Enterprise
Large language models are remarkably capable, but they hallucinate, their knowledge is frozen at training time, and they cannot access your proprietary data. Retrieval-Augmented Generation (RAG) solves all three problems by grounding LLM responses in retrieved documents. The concept is straightforward — feed relevant context into the prompt — but building a production-grade RAG system is an engineering challenge that catches many teams off guard.
This article distils hard-won lessons from enterprise RAG deployments, with specific attention to patterns that matter in the Dutch market: multilingual retrieval (Dutch/English), GDPR-compliant data pipelines, and integration with European cloud providers.
The RAG Architecture Stack
A production RAG pipeline consists of five stages:
- 1. Ingestion — documents are parsed, cleaned, and chunked
- 2. Embedding — chunks are converted to vector representations
- 3. Indexing — vectors are stored in a vector database
- 4. Retrieval — user queries are matched against stored vectors
- 5. Generation — retrieved context is passed to an LLM for answer synthesis
Each stage has its own set of engineering decisions.
Stage 1: Ingestion and Chunking
Chunking is where most RAG pipelines succeed or fail. The goal is to create self-contained, semantically meaningful units of text that can be retrieved independently.
Chunking Strategies
| Strategy | Best For | Typical Size | |----------|----------|-------------| | Fixed-size | Simple, uniform documents | 256-512 tokens | | Recursive/semantic | Long-form documents with structure | 512-1024 tokens | | Document-aware | PDFs, HTML with headers | Section-level | | Sentence-window | Conversational, Q&A-style retrieval | 3-5 sentences | | Parent-child | Legal, regulatory documents | Paragraph + section context |
The Dutch-language nuance: Dutch compound words (e.g., *arbeidsongeschiktheidsverzekering*) and longer average sentence length mean that token-based chunk sizes should be 10-15% larger than English defaults to avoid splitting semantic units mid-thought.
Practical Tips
- Preserve metadata: Store document title, section heading, page number, and source URL alongside each chunk. This enables citation and filtering at query time.
- Overlap chunks by 10-15%: Ensures context is not lost at boundaries.
- Handle tables separately: Extract tables as structured data and embed them with descriptive captions rather than raw cell text.
- Version your chunks: When source documents update, you need to re-chunk and re-embed — track which chunks correspond to which document versions.
Stage 2: Embedding Models
The embedding model determines how well your retrieval captures semantic meaning.
Model Options (2026 Landscape)
| Model | Dimensions | Multilingual | Notes | |-------|-----------|-------------|-------| | OpenAI text-embedding-3-large | 3072 | Yes | Strong Dutch performance, hosted API | | Cohere embed-v4 | 1024 | Yes | Good for search, supports compression | | E5-mistral-7b-instruct | 4096 | Yes | Open-source, self-hostable | | multilingual-e5-large | 1024 | Yes | Excellent for Dutch/English mixed corpora | | BGE-M3 | 1024 | Yes | Multi-granularity, supports sparse+dense |
For Dutch enterprises handling sensitive data (healthcare, finance, government), self-hosted models like E5-mistral or BGE-M3 are attractive because data never leaves your infrastructure — a key GDPR consideration.
Embedding Best Practices
- Normalise text before embedding: consistent casing, whitespace, and encoding
- Embed queries differently from documents: Many models support instruction-prefixed embeddings (e.g., "Retrieve relevant documents for: [query]") — use them
- Benchmark on your own data: Public benchmarks (MTEB, BEIR) are useful baselines, but your domain vocabulary matters more than generic performance
Stage 3: Vector Databases
Your choice of vector database affects latency, scalability, and operational complexity.
Options Compared
| Database | Type | Managed | EU Hosting | Hybrid Search | |----------|------|---------|-----------|---------------| | Weaviate | Purpose-built | Yes | Yes (Amsterdam HQ) | Yes | | Qdrant | Purpose-built | Yes | Yes (EU regions) | Yes | | Pinecone | Purpose-built | Yes | Yes (EU region) | Yes | | pgvector | PostgreSQL extension | Via managed PG | Yes | Via SQL | | Milvus | Purpose-built | Yes (Zilliz) | Yes | Yes |
Dutch-market note: Weaviate was founded in Amsterdam and is popular among Dutch enterprises for its hybrid search capabilities and local support. If you already run PostgreSQL, pgvector offers a lower barrier to entry with trade-offs on scale.
Key Design Decisions
- Namespace isolation: Separate vector spaces per tenant, department, or data classification level
- Hybrid search: Combine dense vectors with sparse (BM25/keyword) search — this consistently outperforms either approach alone, especially for Dutch technical terminology
- Filtering: Use metadata filters to scope searches to specific document types, date ranges, or access levels
- Index tuning: HNSW parameters (ef_construction, M) trade accuracy for speed — benchmark with your real data
Stage 4: Retrieval and Re-Ranking
Raw vector similarity is a starting point, not the finish line.
The Two-Stage Retrieval Pattern
- 1. Broad retrieval: Fetch top-50 candidates using vector search (fast, approximate)
- 2. Re-ranking: Score candidates using a cross-encoder model (slower, more accurate) and return top-5
Cross-encoder re-rankers like Cohere Rerank, Jina Reranker, or the open-source BGE-reranker-v2-m3 dramatically improve answer quality — often the single highest-ROI change you can make to a RAG pipeline.
Advanced Retrieval Techniques
- Query decomposition: Break complex questions into sub-queries, retrieve for each, and merge results
- HyDE (Hypothetical Document Embeddings): Generate a hypothetical answer, embed it, and use that for retrieval — works well for abstract queries
- Self-query: Let the LLM extract metadata filters from the user's natural-language query before retrieval
- Contextual retrieval: Prepend document-level context to each chunk before embedding (as described in [Anthropic's research](https://www.anthropic.com/news/contextual-retrieval)) — reduces retrieval failures by 49%
Stage 5: Generation
With relevant context retrieved, the generation step is about crafting the right prompt and handling edge cases.
Key Considerations
- Set a context window budget: Reserve tokens for retrieved context, system prompt, and expected answer length. With modern models supporting 128K+ tokens, the temptation is to stuff everything in — resist this. More context does not always mean better answers.
- Handle "I don't know": The system should gracefully admit when it lacks sufficient information rather than hallucinating.
- Language detection: In Dutch corporate environments, users switch between Dutch and English mid-conversation. Detect language and respond accordingly.
- Citation: Include source references in generated answers so users can verify claims.
Evaluation: Measuring RAG Quality
You cannot improve what you do not measure. RAG evaluation requires assessing both retrieval and generation quality.
Retrieval Metrics
- Recall@k: What percentage of relevant documents appear in the top-k results?
- MRR (Mean Reciprocal Rank): How high does the first relevant result rank?
- Precision@k: What percentage of top-k results are actually relevant?
Generation Metrics
- Faithfulness: Does the answer stick to the retrieved context? (Use LLM-as-judge or [RAGAS](https://docs.ragas.io/) framework)
- Relevance: Does the answer address the question?
- Completeness: Does the answer cover all aspects of the question?
Practical Evaluation
Build a golden dataset of 100-200 question-answer pairs with annotated source documents. Run automated evaluations on every pipeline change. RAGAS and DeepEval are open-source frameworks that automate this.
Common Pitfalls
- 1. Skipping re-ranking: Vector similarity alone is not enough — re-ranking consistently improves quality
- 2. Ignoring chunk quality: Garbage in, garbage out. Invest time in your chunking strategy
- 3. Not versioning embeddings: When you change your embedding model, all stored vectors become incompatible
- 4. Treating RAG as set-and-forget: Documents change, models improve, and user needs evolve — build for continuous iteration
- 5. Over-retrieving: Flooding the context window with marginally relevant chunks degrades answer quality
- 6. Neglecting access control: In enterprise settings, not every user should see every document — implement permission-aware retrieval
Cost Optimisation
RAG costs come from three sources: embedding computation, vector storage, and LLM inference.
- Cache embeddings: Never re-embed unchanged documents
- Compress vectors: Quantisation (e.g., Weaviate's PQ, Qdrant's scalar quantisation) can reduce storage by 4-8x with minimal quality loss
- Use tiered models: Fast, cheap models for initial retrieval; powerful models for final generation
- Batch embed: Process documents in batches during off-peak hours
Getting Started
For Dutch enterprises beginning their RAG journey:
- 1. Start with a narrow scope — one document collection, one use case
- 2. Use a managed vector database to avoid operational overhead
- 3. Implement re-ranking from day one
- 4. Build evaluation into your pipeline, not as an afterthought
- 5. Plan for multilingual from the start — retrofitting Dutch support is harder than building it in
Explore our automation and DevOps services for help building production RAG systems, or read our articles on data engineering trends and data management.
