
I wanted to experiment with RAG, so I built a tool that helps police officers analyze criminal cases and automatically generate investigation reports. Here's what I learned along the way.
What is RAG?
RAG stands for Retrieval-Augmented Generation. In simple terms, it's a technique for giving an AI model access to your own documents when generating a response.
Instead of relying solely on what the model was trained on, RAG lets you say: "Here are some relevant excerpts from my document, now answer based on these." The model becomes smarter about your specific domain without needing to be retrained.
It works in two stages:
- Retrieval: find the most relevant pieces of your document for a given query
- Generation: pass those pieces to the AI as context, then let it generate a response
The Problem
Every criminal case in Indonesia requires a document called a BAP (Berita Acara Pemeriksaan) an official police investigation report. It documents the incident, identifies which laws apply, and recommends charges.
Writing one manually means cross-referencing the Indonesian criminal code (KUHP/KUHAP), which is hundreds of pages long. An officer has to read through it, find the relevant articles, and write everything up in formal language. Every. Single. Case.
I thought: what if AI could do most of that?
A big thanks to my friend Adi Gilang who helped me research this problem and figure out that it was a good fit for RAG. Without him I probably wouldn't have landed on this use case.
The Stack
Before diving in, here's what I used:
How It Works: The Big Picture
The app has two main flows: ingestion (loading the law document) and inference (analyzing a case). Let me walk through both.
Step 1 Upload the Law Document
An admin uploads the KUHP/KUHAP as a PDF. The app then does three things with it:
1. Extract the text
Using a library called unpdf, the PDF is read and all its text is extracted into one big string.
2. Split it into chunks
That big string gets split into overlapping pieces called chunks:
Why overlap? Because legal sentences don't always start and end neatly at the 1000-character boundary. Overlap makes sure no sentence gets cut in half and lost.
Each chunk gets saved to the database.
3. Trigger a background job
After saving, the app fires an event to Inngest a tool for running background jobs. This is important because the next step (embedding) is slow and we don't want the user staring at a loading spinner.
Step 2 Embed the Chunks (Background Job)
This is where RAG starts. RAG stands for Retrieval-Augmented Generation a technique for giving an AI model access to your own documents at query time.
The key idea: convert text into numbers (called embeddings) so you can mathematically compare meaning.
The Inngest job does this:
Each chunk becomes a vector a list of 1536 numbers that represents its meaning. Similar text produces similar vectors.
One important design decision: only one law document is active at a time. When a new document is uploaded, the old one and all its vectors get deleted. This keeps things simple and as we'll see later helps with caching.
Step 3 Officer Fills In a Case Form
When an officer wants to analyze a case, they fill out a form with details like:
- Incident type (theft, assault, etc.)
- Location and date
- What the victim did
- The outcome
- A brief description of the context
This gets saved as a case record in the database.
Step 4 Retrieve Relevant Law Articles
Now the RAG magic happens. When analysis is triggered, the app assembles a search query from the case details:
This query gets embedded (converted to a vector), then Pinecone finds the 5 most similar law chunks:
Think of it like a very smart CTRL+F instead of matching exact words, it matches meaning. A query about "stealing someone's phone" will surface chunks about theft articles even if the exact words don't match.
Step 5 Build the Prompt (The Cache Part)
Here's where things get interesting and where I learned something valuable.
The app builds two separate prompts:
The system prompt contains the law articles:
The user prompt contains the case details:
I originally called this CAG (Cache-Augmented Generation). The idea was: if the system prompt is always formatted the same way, OpenAI can cache it and skip reprocessing it on repeated calls saving both time and money.
Technically, this is true. OpenAI does automatically cache prompt prefixes. But here's what I got wrong: real CAG means you explicitly control the KV cache at the model serving level pre-loading the full document once and reusing that state for every query. What I built is really just RAG with deterministic prompt formatting that happens to be eligible for OpenAI's passive caching.
Still useful! But worth being honest about.
Step 6 Stream the Analysis
The assembled prompts get sent to OpenAI, and the response streams back to the browser in real time:
The officer sees the analysis appear word by word, just like ChatGPT. Once complete, it's saved to the database.
Step 7 Generate the BAP Document
After the analysis, the officer can generate the formal BAP report. This is a second AI call that takes both the case data and the analysis result, and formats everything into an official Indonesian police document complete with headers, legal citations, and signature sections.
What I'd Do Differently
1. Verify the cache is actually hitting
OpenAI returns a cached_tokens field in the API response. I never added logging to check whether the cache was actually being used. That should have been step one.
2. Be more careful with terminology
I labeled the architecture "CAG" before fully understanding what CAG means at the infrastructure level. For a personal project it's fine, but if you're writing about it be precise.
3. Consider loading the full document
With RAG, I'm only giving the AI 5 chunks. There's a risk it misses a relevant article. If the law document is small enough, loading it fully into context (true CAG) would give more complete and reliable results.
Key Takeaways
If you're building something similar, here are the things that matter most:
- Chunk with overlap prevents important sentences from being split
- Keep one active document simplifies retrieval and maximizes cache hits
- Separate system and user prompts puts stable context in
systemwhere it can be cached - Use background jobs for embedding it's slow; don't block the user
- Stream the response makes the app feel fast even when the model takes a few seconds
If you want to look at the full source code, it's on GitHub: icaf-app

