What I Learned Automating Criminal Case Reports with RAG

I wanted to experiment with RAG, so I built a tool that helps police officers analyze criminal cases and automatically generate investigation reports. Here's what I learned along the way.

What is RAG?

RAG stands for Retrieval-Augmented Generation. In simple terms, it's a technique for giving an AI model access to your own documents when generating a response.

Instead of relying solely on what the model was trained on, RAG lets you say: "Here are some relevant excerpts from my document, now answer based on these." The model becomes smarter about your specific domain without needing to be retrained.

It works in two stages:

Retrieval: find the most relevant pieces of your document for a given query
Generation: pass those pieces to the AI as context, then let it generate a response

The Problem

Every criminal case in Indonesia requires a document called a BAP (Berita Acara Pemeriksaan) an official police investigation report. It documents the incident, identifies which laws apply, and recommends charges.

Writing one manually means cross-referencing the Indonesian criminal code (KUHP/KUHAP), which is hundreds of pages long. An officer has to read through it, find the relevant articles, and write everything up in formal language. Every. Single. Case.

I thought: what if AI could do most of that?

A big thanks to my friend Adi Gilang who helped me research this problem and figure out that it was a good fit for RAG. Without him I probably wouldn't have landed on this use case.

The Stack

Before diving in, here's what I used:

Tool	Category	What I used it for
Next.js	Framework	Full-stack app (frontend + API routes)
PostgreSQL + Drizzle ORM	Database & ORM	Storing cases and document chunks
Pinecone	Vector database	Semantic search over law chunks
OpenAI	AI Model	Embeddings + text generation
Inngest	Background jobs	Async embedding and indexing pipeline
Vercel AI SDK	Streaming	Streaming AI responses to the browser

How It Works: The Big Picture

The app has two main flows: ingestion (loading the law document) and inference (analyzing a case). Let me walk through both.

Step 1 Upload the Law Document

An admin uploads the KUHP/KUHAP as a PDF. The app then does three things with it:

1. Extract the text

Using a library called unpdf, the PDF is read and all its text is extracted into one big string.

2. Split it into chunks

That big string gets split into overlapping pieces called chunks:

const CHUNK_SIZE = 1000;    // characters per chunk
const CHUNK_OVERLAP = 150;  // overlap between chunks

Why overlap? Because legal sentences don't always start and end neatly at the 1000-character boundary. Overlap makes sure no sentence gets cut in half and lost.

Each chunk gets saved to the database.

3. Trigger a background job

After saving, the app fires an event to Inngest a tool for running background jobs. This is important because the next step (embedding) is slow and we don't want the user staring at a loading spinner.

Step 2 Embed the Chunks (Background Job)

This is where RAG starts. RAG stands for Retrieval-Augmented Generation a technique for giving an AI model access to your own documents at query time.

The key idea: convert text into numbers (called embeddings) so you can mathematically compare meaning.

The Inngest job does this:

// For every chunk in the database...
const embeddings = await embedTexts(chunks.map(c => c.content));

// Store the vectors in Pinecone
await index.upsert({ records: vectors });

Each chunk becomes a vector a list of 1536 numbers that represents its meaning. Similar text produces similar vectors.

One important design decision: only one law document is active at a time. When a new document is uploaded, the old one and all its vectors get deleted. This keeps things simple and as we'll see later helps with caching.

Step 3 Officer Fills In a Case Form

When an officer wants to analyze a case, they fill out a form with details like:

Incident type (theft, assault, etc.)
Location and date
What the victim did
The outcome
A brief description of the context

This gets saved as a case record in the database.

Step 4 Retrieve Relevant Law Articles

Now the RAG magic happens. When analysis is triggered, the app assembles a search query from the case details:

const query = `${caseRow.incidentType} ${caseRow.context} ${caseRow.victimAction}`;

This query gets embedded (converted to a vector), then Pinecone finds the 5 most similar law chunks:

const chunks = await retrieveRelevantChunks(query, 5);

Think of it like a very smart CTRL+F instead of matching exact words, it matches meaning. A query about "stealing someone's phone" will surface chunks about theft articles even if the exact words don't match.

Step 5 Build the Prompt (The Cache Part)

Here's where things get interesting and where I learned something valuable.

The app builds two separate prompts:

The system prompt contains the law articles:

const system = buildLawContextPrefix(chunks, template);
// Result:
// "You are a legal AI assistant...
//  === LAW REFERENCE ===
//  [Excerpt 1] Pasal 362 KUHP...
//  [Excerpt 2] Pasal 363 KUHP...
//  === END REFERENCE ==="

The user prompt contains the case details:

const prompt = buildAnalysisPrompt(caseRow, template);
// Result:
// "Analyze this case:
//  Incident Type: Theft
//  Location: Jakarta..."

I originally called this CAG (Cache-Augmented Generation). The idea was: if the system prompt is always formatted the same way, OpenAI can cache it and skip reprocessing it on repeated calls saving both time and money.

Technically, this is true. OpenAI does automatically cache prompt prefixes. But here's what I got wrong: real CAG means you explicitly control the KV cache at the model serving level pre-loading the full document once and reusing that state for every query. What I built is really just RAG with deterministic prompt formatting that happens to be eligible for OpenAI's passive caching.

Still useful! But worth being honest about.

Step 6 Stream the Analysis

The assembled prompts get sent to OpenAI, and the response streams back to the browser in real time:

const result = streamText({
  model: openai(AI_MODEL),
  system,   // <- law context (cached)
  prompt,   // <- case details (new each time)
  onFinish: async ({ text }) => {
    // Save the analysis to the database when done
    await db.update(cases).set({ analysisText: text });
  },
});

return result.toTextStreamResponse();

The officer sees the analysis appear word by word, just like ChatGPT. Once complete, it's saved to the database.

Step 7 Generate the BAP Document

After the analysis, the officer can generate the formal BAP report. This is a second AI call that takes both the case data and the analysis result, and formats everything into an official Indonesian police document complete with headers, legal citations, and signature sections.

What I'd Do Differently

1. Verify the cache is actually hitting

OpenAI returns a cached_tokens field in the API response. I never added logging to check whether the cache was actually being used. That should have been step one.

2. Be more careful with terminology

I labeled the architecture "CAG" before fully understanding what CAG means at the infrastructure level. For a personal project it's fine, but if you're writing about it be precise.

3. Consider loading the full document

With RAG, I'm only giving the AI 5 chunks. There's a risk it misses a relevant article. If the law document is small enough, loading it fully into context (true CAG) would give more complete and reliable results.

Key Takeaways

If you're building something similar, here are the things that matter most:

Chunk with overlap prevents important sentences from being split
Keep one active document simplifies retrieval and maximizes cache hits
Separate system and user prompts puts stable context in system where it can be cached
Use background jobs for embedding it's slow; don't block the user
Stream the response makes the app feel fast even when the model takes a few seconds

If you want to look at the full source code, it's on GitHub: icaf-app