How I Built a RAG Pipeline That Actually Ships

I built two AI retrieval systems in the span of a few months. One uses vector embeddings over a pgvector database. The other uses YOLO, BM25, and a ChromaDB instance. Both ship. Neither was straightforward. Here's what I actually learned.

The problem Studium was solving

It started during my second semester at BINUS. I had three subjects with dense lecture PDFs, and my study habit was to re-read slides until something stuck — which is a terrible strategy. I wanted to be able to ask my own notes questions. Not a chatbot with generic knowledge, but something that would surface the exact paragraph from week 7's slides when I asked about gradient descent.

That's a retrieval problem. The solution everyone reaches for is RAG — Retrieval-Augmented Generation: chunk the documents, embed the chunks, store them in a vector database, retrieve the most relevant ones at query time, pass them to an LLM as context. Simple in theory. I spent three weeks getting it to actually work.

Chunking is where you live or die

The first version of Studium chunked documents at fixed 500-token boundaries. The retrieval was terrible. A question about "what is backpropagation" would pull in a chunk that started mid-sentence from a paragraph about something else, because the split happened to land there.

I switched to semantic chunking — splitting on paragraph boundaries and section headings rather than token counts. Immediately better. Then I added a sliding overlap window (roughly 10% of chunk size) so context from the end of one chunk bled into the start of the next. That fixed most of the mid-sentence retrieval failures.

The rule I landed on: chunk by meaning, not by size. Token budgets matter for the LLM context window, but they should be a ceiling, not the primary splitting strategy.

Three weeks with pgvector

I chose Supabase with the pgvector extension because I was already using Supabase for auth and the database. One less service to manage. The embedding model was DigitalOcean's inference API — the same infrastructure I was using for the LLM calls, so latency was predictable.

Where I got stuck: pgvector's default index (IVFFlat) requires you to specify the number of lists at creation time, and performance degrades badly if your dataset grows past what the index was tuned for. I didn't realise this until retrieval times started creeping up when a user uploaded a lot of documents. The fix was switching to the HNSW index, which handles growing datasets far better, but it meant migrating the index on a live table — not fun at 1am.

Second gotcha: embedding dimensions need to match exactly at query time. I had one environment using a 1536-dimension model and another using 768. The mismatch silently returned garbage results rather than throwing an error. Adding a dimension assertion in the embedding helper caught this class of bug permanently.

Studium shipped. Then came Balatro Coach.

Balatro is a roguelike poker game. The idea for Balatro Coach was simple: upload a screenshot of your current hand, get tactical advice — which cards to play, which jokers synergise, whether you can survive the blind. The hard part was that the "documents" weren't PDFs. They were game states extracted from images.

I needed computer vision first. I used YOLO11n in ONNX format for card detection — fast enough to run on inference hardware without a GPU, and ONNX meant I could deploy it without a full PyTorch runtime. RapidOCR handled the text on joker cards, which YOLO wasn't reliable enough to read directly.

Once I had structured game state data (detected cards, joker names, current score targets), I had a retrieval problem again — but a different shape. I needed to look up joker interactions and scoring rules from a knowledge base, not user documents. For this, pure vector similarity was actually worse than keyword matching. The query "what does Joker X do with a flush" benefits from exact term matching on the joker name, not semantic similarity.

I ended up using a hybrid: BM25 for keyword recall over the rules corpus, ChromaDB for semantic lookup when the query was more conceptual. A simple reciprocal rank fusion merged both result sets. The combined approach outperformed either alone on the evaluation set I put together.

What carried over

A few things I learned from Studium that I applied directly to Balatro Coach:

Don't trust retrieval at the edges. Always log what gets retrieved for a sample of real queries. The failure modes are almost never what you expected in development.
The index is part of the system design, not an afterthought. Choosing pgvector HNSW vs IVFFlat, or ChromaDB vs a flat file, has real performance consequences at scale.
Hybrid retrieval beats pure vector search for structured domains. BM25 + vectors is almost always better than vectors alone when your corpus has specific terminology.
Keep the embedding pipeline boring. One model, one dimension, one normalisation strategy. Every variation is a new failure mode.

What I'd do differently

For Studium: I'd instrument retrieval quality from day one. I added logging late and spent time debugging issues that were invisible without it. A simple table tracking query → retrieved chunks → user rating would have surfaced the chunking problems weeks earlier.

For Balatro Coach: I'd containerise earlier. The gap between "works on my machine" and "works in Docker Compose" cost me a full day when the ONNX runtime had different behaviour between my local Python environment and the container. Lesson noted.

Both projects are on GitHub if you want to look at the actual implementation. The messy commits are part of the story.