How to Build a Credible RAG System with Citations

In the rush to deploy AI systems, one critical element often gets overlooked: credibility. When your RAG (Retrieval-Augmented Generation) system makes a claim, can users verify it? Can they trace the information back to its source? In production environments, especially in finance, legal, healthcare, or enterprise knowledge management, the ability to cite sources isn’t just a nice-to-have feature, it’s essential for building trust and meeting compliance requirements.

This post explores how we at Pliant have built a RAG system that doesn’t just retrieve relevant information, but maintains a transparent audit trail from query to answer, complete with verifiable citations.

Why Source Citation Matters

Before diving into implementation, let’s understand why citation capability is crucial for production RAG systems:

Trust and Verification: Users need to verify claims, especially in high-stakes domains. A B2B finance RAG system should link back to transaction histories or financial statements. A medical RAG system citing treatment protocols might link back to clinical guidelines. A legal research tool must reference specific case law.

Accountability: When your system makes an error, you need to trace whether the issue originated from retrieval, the source documents themselves, or the generation process. Citations enable this forensic analysis.

Compliance: Many industries require audit trails. Financial services, healthcare (HIPAA), and legal sectors often mandate that you can demonstrate how decisions were made and what information was used.

Hallucination Detection: By comparing generated text against cited sources, you can implement automated checks to detect when the model extrapolates beyond retrieved context.

Architecture Overview

A production-ready RAG system with citation capabilities requires several key components working in sequence:

Document Ingestion with Metadata Preservation: Divide source document into chunks and save them while preserving document metadata and chunk boundaries. Each chunk maintains its traceability through deterministic IDs and appended source metadata.
Retrieval with Source Tracking: Fetches relevant chunks with their source metadata intact using vector similarity search. Retrieved chunks are labeled with citation markers to enable downstream referencing.
Generation with In-line Citations: Produces answers while maintaining references to which chunks informed each part of the response. The LLM is explicitly prompted to cite sources using inline citation markers.
Citation Verification: Validates that each generated claim is actually supported by the cited sources. Serves as an additional guardrail to prevent hallucination.
Citation Presentation: Formats verified citations for end users in a readable and interactive format. This includes converting inline markers to human-readable references, generating source lists with metadata, and adapting citation styles to domain-specific conventions.

Let’s examine each component in detail.

1. Document Processing with Metadata Preservation

The foundation of citability starts at ingestion. When processing documents, you need to preserve not just content but also precise source information.

1from dataclasses import dataclass
2from typing import List, Dict, Any
3import uuid
4
5@dataclass
6class DocumentChunk:
7    chunk_id: str
8    content: str
9    source_document: str
10    page_number: int
11    chunk_index: int
12    document_metadata: Dict[str, Any]
13    embedding: List[float]
14
15class DocumentProcessor:
16    def __init__(self, chunk_size: int = 512, overlap: int = 50):
17        self.chunk_size = chunk_size
18        self.overlap = overlap
19
20    def process_document(self, document_path: str,
21                        metadata: Dict[str, Any]) -> List[DocumentChunk]:
22        """
23        Process a document into chunks while preserving source information.
24        Metadata might include: author, date, document_type, version, etc.
25        """
26        chunks = []
27        content = self._extract_text(document_path)
28
29        for idx, chunk_text in enumerate(self._chunk_text(content)):
30            chunk = DocumentChunk(
31                chunk_id=str(uuid.uuid4()),
32                content=chunk_text,
33                source_document=document_path,
34                page_number=self._get_page_for_chunk(document_path, idx),
35                chunk_index=idx,
36                document_metadata=metadata,
37                embedding=self._generate_embedding(chunk_text)
38            )
39            chunks.append(chunk)
40
41        return chunks

The key insight here is that each chunk maintains its traceability. The chunk_id is deterministic, allowing you to trace back through your system. The metadata dictionary can store whatever context is relevant for your domain; publication dates, authors, document types, internal identifiers, or regulatory classifications.

2. Retrieval with Source Tracking

When retrieving relevant chunks, you need to maintain the association between content and source throughout the pipeline.

1from typing import List, Tuple
2import numpy as np
3
4class CitableRetriever:
5    def __init__(self, vector_store, top_k: int = 5):
6        self.vector_store = vector_store
7        self.top_k = top_k
8
9    def retrieve_with_sources(self, query: str) -> List[Tuple[DocumentChunk, float]]:
10        """
11        Retrieve chunks with their relevance scores.
12        Returns list of (chunk, score) tuples.
13        """
14        query_embedding = self._generate_embedding(query)
15
16        # Retrieve from vector store
17        results = self.vector_store.similarity_search(
18            query_embedding,
19            k=self.top_k
20        )
21
22        # Each result includes the full DocumentChunk with metadata
23        return [(chunk, score) for chunk, score in results]

3. Generation with In-line Citations

Now comes the critical step: prompting the LLM to generate responses that include citations.

1class CitableGenerator:
2    """
3    Generates answers with inline citations from retrieved sources.
4    """
5    
6    def __init__(self, llm_client):
7        self.llm_client = llm_client
8
9    def _format_context(self, 
10                       retrieved_chunks: List[Tuple['DocumentChunk', float]]) -> str:
11        """
12        Format retrieved chunks with citation markers for the LLM.
13        """
14        context_parts = []
15        
16        for idx, (chunk, score) in enumerate(retrieved_chunks, 1):
17            context_parts.append(
18                f"[Source{idx}]\n"
19                f"Document: {chunk.source_document}\n"
20                f"Content: {chunk.content}\n"
21            )
22        
23        return "\n\n".join(context_parts)
24
25    def generate_answer(self, 
26                       query: str,
27                       retrieved_chunks: List[Tuple['DocumentChunk', float]]) -> Dict[str, Any]:
28        """
29        Generate an answer with inline citations.
30        """
31        context = self._format_context(retrieved_chunks)
32
33        prompt = f"""You are a helpful assistant that answers questions based on provided sources.
34You must cite your sources using the format [Source N] immediately after each claim.
35
36Context:
37{context}
38
39Question: {query}
40
41Instructions:
421. Answer the question based ONLY on the provided sources
432. After each factual claim, include [Source N] to cite which source supports it
443. If multiple sources support a claim, cite all relevant sources like [Source 1, Source 3]
454. If the sources don't contain enough information, say so explicitly
465. Do not make claims without citations
47
48Answer:"""
49
50        response = self.llm_client.generate(prompt)
51
52        return {
53            "answer": response,
54            "retrieved_chunks": retrieved_chunks
55        }

_format_context helps us explicitly label each chunk with a citation marker in the prompt, you make it easy for the LLM to reference specific sources in its response.

The prompt engineering here is deliberate as well. By explicitly instructing the model to cite sources and providing a clear citation format, you increase the likelihood of getting properly attributed responses.

4. Citation Verification

Once you have a generated response with citations, you need to verify that the claims are actually supported by the cited sources. This step is crucial for catching hallucinations and ensuring credibility. For the sake of having a simple example, we will go for comparing the claim in the actual response to source text via word-overlap. If the overlap is lower than the threshold we set, it’s likely that the LLM ‘improvised’ too much.

1import re
2from typing import List, Dict, Any
3
4class CitationVerifier:
5    CITATION_PATTERN = r'\[Source(\d+(?:,\s*\d+)*)\]'
6    
7    def verify_citations(self, answer: str, 
8                        retrieved_chunks: List[DocumentChunk]) -> Dict[str, Any]:
9        """
10        Verify that claims in the answer are supported by cited sources.
11        """
12        # Split answer into sentences and verify each with citations
13        sentences = re.split(r'[.!?]+', answer)
14        verification_results = []
15        
16        for sentence in sentences:
17            cited_ids = self._extract_source_ids(sentence)
18            if cited_ids:
19                cited_chunks = [retrieved_chunks[i-1] for i in cited_ids]
20                verification_results.append({
21                    "sentence": sentence,
22                    "cited_sources": cited_ids,
23                    "supported": self._check_support(sentence, cited_chunks)
24                })
25        
26        return {
27            "verification": verification_results,
28            "all_supported": all(v["supported"] for v in verification_results)
29        }
30    
31    def _extract_source_ids(self, text: str) -> List[int]:
32        """Extract source IDs from citation markers in text."""
33        matches = re.findall(self.CITATION_PATTERN, text)
34        source_ids = []
35        for match in matches:
36            for id_str in match.split(','):
37                source_ids.append(int(id_str.strip()))
38        return source_ids
39    
40    def _check_support(self, claim: str, chunks: List[DocumentChunk]) -> bool:
41        """
42        Check if a claim is supported by the provided chunks.
43        Uses simple word overlap ratio (>60% match).
44        This is a simplified example.
45        Consider using NLI models or LLM-as-judge for more advanced verification.
46        """
47        # Remove citation markers and normalize
48        clean_claim = re.sub(self.CITATION_PATTERN, '', claim).lower()
49        claim_words = set(clean_claim.split())
50        
51        if not claim_words:
52            return False
53        
54        # Check if any chunk has sufficient word overlap
55        for chunk in chunks:
56            chunk_words = set(chunk.content.lower().split())
57            overlap = len(claim_words & chunk_words) / len(claim_words)
58            if overlap > 0.6:
59                return True
60        
61        return False

While the overly-simplified implementation above checks for content overlap to demonstrate the idea, more sophisticated systems can use NLI models or LLM-as-judge approach to verify that the source actually supports the claim being made in the answer.

5. Citation Presentation

After verification, you need to format citations in a way that's readable and useful for end users. This includes converting inline markers to appropriate formats and providing a structured source list.

1class CitationPresenter:
2    def format_for_display(self, 
3                          answer: str, 
4                          retrieved_chunks: List[Tuple['DocumentChunk', float]]) -> str:
5        """
6        Format answer with citations.
7        """
8        citation_pattern = r'\[Source(\d+(?:,\s*\d+)*)\]'
9        
10        # Replace citation markers (i.e. [Source 1]) with superscript (i.e. ¹) for UX.
11        def replace_citation(match):
12            source_ids = [id.strip() for id in match.group(1).split(',')]
13            return ''.join(self._to_superscript(id) for id in source_ids)
14        
15        formatted_answer = re.sub(citation_pattern, replace_citation, answer)
16        
17        # Append source list at the end of the response
18        source_list = "\n\n---\n\n**Sources:**\n\n"
19        for idx, (chunk, score) in enumerate(retrieved_chunks, 1):
20            document_name = chunk.source_document.split('/')[-1]
21            
22            source_entry = f"{idx}. **{document_name}**"
23            if chunk.page_number:
24                source_entry += f" (Page {chunk.page_number})"
25            if 'author' in chunk.document_metadata:
26                source_entry += f" - {chunk.document_metadata['author']}"
27            if 'date' in chunk.document_metadata:
28                source_entry += f", {chunk.document_metadata['date']}"
29            
30            source_list += source_entry + "\n"
31        
32        return formatted_answer + source_list
33    
34    def _to_superscript(self, num: str) -> str:
35        superscript_map = {
36            '0': '⁰', '1': '¹', '2': '²', '3': '³', '4': '⁴',
37            '5': '⁵', '6': '⁶', '7': '⁷', '8': '⁸', '9': '⁹'
38        }
39        return ''.join(superscript_map.get(d, d) for d in num)

The presentation layer is where domain-specific requirements come into play. Customer support systems might link to help articles or ticket IDs, academic applications need APA style, and internal tools might simply link to Notion URLs. By separating presentation from verification, you can easily adapt to different contexts without changing your core pipeline.

Putting It All Together

Here’s how these components work together in a complete RAG pipeline:

1class ProductionRAGSystem:
2    def __init__(self, vector_store, llm_client):
3        self.retriever = CitableRetriever(vector_store)
4        self.generator = CitableGenerator(llm_client)
5        self.verifier = CitationVerifier()
6        self.presenter = CitationPresenter()
7
8    def answer_question(self, query: str) -> Dict[str, Any]:
9        """Demonstration of complete RAG pipeline"""
10        
11        # Step 1: Retrieve relevant chunks
12        retrieved_chunks = self.retriever.retrieve_with_sources(query)
13
14        # Step 2: Generate answer with citations
15        generation_result = self.generator.generate_answer(query, retrieved_chunks)
16
17        # Step 3: Verify citations
18        verification = self.verifier.verify_citations(
19            generation_result["answer"],
20            [chunk for chunk, _ in retrieved_chunks] # pass chunks with source metadata
21        )
22
23        # Step 4: Format for display
24        formatted_output = self.presenter.format_for_display(
25            generation_result["answer"],
26            retrieved_chunks # pass chunks with source metadata
27        )
28
29        # Step 5: Return final response
30        return {
31            "query": query,
32            "answer": formatted_output,
33            "retrieved_chunks": retrieved_chunks,  # Available if needed
34            "verification": verification
35        }

Here is a sample frontend to demonstrate how the response can be displayed, including all the capabilities we’ve discussed in the article:

Granular citation marking within the response
Sources appendix with precise metadata (file name, URL, page number, author, year)
Citation verification indicator

Retrieval-augmented generation (RAG) output visual

For UX purposes, you might not want to display all the verification data in the representation layer for every use-case. Here is an example from our Pliant Assistant below, showcasing how the claims are linked back to their sources on our Helpcenter via simple markings and direct links at the appendix.

Advanced Considerations

Monitoring Citation Quality: Track three key metrics:

Citation coverage: What percentage of factual claims include citations?
- If coverage is too low, it could be a sign that your knowledge base doesn’t cover the topics that your users are asking for.
Citation accuracy: Do cited sources actually support the claims? (Validate through your verification step)
- If accuracy is low, it’s a sign that the LLM doesn’t take the sources into consideration. Try lowering the temperature (ideally below 0.3) to discourage the LLM from ‘improvising’, and re-engineer your prompt to have stronger emphasis on citing sources.
Source diversity: Are answers over-relying on single sources?
- This could also be a sign that your knowledge base doesn’t have enough documents on certain topics.

Implement logging that captures failed verifications to identify patterns in citation errors.

Citation Granularity: Page numbers work for short documents but become imprecise for longer ones. Add section headings, paragraph indices, or line numbers to your chunk metadata. For digital documents, include anchor links or character offsets so users can jump directly to the cited passage.

Handling Document Updates: Version your vector store to maintain citation validity when source documents change. Use document hashes to detect changes and update stale metadata for your chunks. For critical applications, consider maintaining historical versions so old citations remain valid.

Real-World Implementation Patterns

Different domains require different approaches to citation. In legal research, you might structure citations to match Bluebook format. In medical applications, you might link to PubMed IDs or DOIs. For enterprise knowledge bases you might need to store Jira ticket numbers, Notion/Confluence URLs or similar references.

Store citations internally as stable, structured identifiers (e.g., source_type, document_id, section_id) and defer formatting entirely to the presentation layer. Consider building a citation adapter layer that translates your internal chunk references into the appropriate format for your domain. This separation allows you to change citation formats without rebuilding your entire RAG pipeline.

Conclusion

Building a RAG system with proper citation capabilities requires thoughtful architecture from the ground up. You can’t approach response credibility as an afterthought, it must be integrated into every step; document processing, retrieval, generation, and presentation layers.

A RAG system that cites its sources builds user trust, enables verification, meets compliance requirements, and provides the transparency needed for production deployment in sensitive domains. As AI systems become more integrated into critical workflows, the ability to explain and substantiate their outputs will increasingly separate production-ready systems from mere prototypes.

The implementation patterns shown here provide a foundation, but remember that citation requirements vary by domain and use case. Adapt these patterns to your specific needs, always keeping the core principle in mind: every claim should be traceable back to its source.