Profile photo

Kyan

Hi, I'm Kristen (Kyan).
There is always something to build. The world never runs out of ore.

150 subscribers • 2 posts

6 min read kyanfeat

Your RAG System Is Not Broken. Your Data Is.

The hallucination problem everyone is solving from the wrong end.

You tuned the chunk size, swapped the embedding model, rewrote the retrieval prompt three times. The system still hallucinated a federal ruling that doesn’t exist. You blamed the model. The model was not the problem.

Every article says the same thing

Better embeddings. Smarter chunking. Hybrid retrieval. Rerankers. GraphRAG. Agentic loops. The field is obsessed with the middle of the pipeline: retrieval strategies, vector databases, search algorithms.

And honestly, this work is real. The research is serious. The benchmarks matter. But there is a step before all of it that almost nobody talks about.

The benchmark nobody wanted

Thomas Houssin published something uncomfortable this week. He tested 7 RAG strategies across 17,000 chunks and arrived at one finding that cut through everything:

The search engine mattered least. The ingestion pipeline mattered most.

His conclusion: start with BM25, classic keyword search. Invest in your agent and your ingestion pipeline. Not your vector database. Not your embedding model. Your ingestion pipeline, the part most teams build in an afternoon and never touch again.

I read that and felt it. Because I’ve seen what teams actually feed these systems.

What’s actually going in

The documents are raw. PDFs with headers repeated on every page. Web pages with navigation menus baked into the body. Regulatory documents with 40 footnotes pointing to sources the system never follows. Citation links that go nowhere.

You run a retrieval query against that. The retriever finds the most relevant chunk. The most relevant chunk is still noise.

BM25 wins in benchmarks not because it is smarter than vector search. It wins because precision beats semantic similarity when your work requires exact answers: the specific regulation number, the exact case citation, the one sentence in a 200-page document that actually matters.

Feed clean structured text to BM25 and it finds it. Feed a messy PDF to the best embedding model in the world and you get a confident hallucination.

The retriever is not the variable. The data is.

The compliance problem nobody is talking about

Tonic.ai published something equally important this week. They built PII detection directly into RAG ingestion, not at retrieval, not at the output, but at the moment the document enters the system. The order matters more than most people realize.

Once PII is embedded in a vector store, you cannot remove it. A GDPR right-to-erasure request becomes an emergency. HIPAA requirements, PCI rules, all of them require sensitive data to be handled before it gets into your infrastructure. Redacting the source document afterward doesn’t fix the embeddings. The data is already in the vector space.

For lawyers, compliance officers, and insurance professionals, this is not a footnote. It’s a liability sitting inside their pipeline right now.

What this actually costs

A compliance officer spends four hours chasing three citations her AI stated with total confidence.

  • 2 of 3 citations do not exist
  • 1 of 3 buried in a footnote the AI never saw
  • $20+ in API costs, gone
  • 47 tabs open, nothing resolved
  • 1 brief that is wrong

That afternoon is not a model failure. It is an ingestion failure. The citation tree was never followed. The documents were never structured. The noise was never stripped. The AI saw garbage and returned confident garbage.

Multiply that afternoon across every professional whose work lives inside documents: lawyers, tax attorneys, real estate professionals, investigative journalists, insurance underwriters, and you understand why this matters.

The question worth asking first

Everyone asks: which retrieval strategy should I use? Nobody asks first: what am I running retrieval against?

There is a missing layer in most AI workflows. It sits between the web, raw, messy, unverified, and every AI tool anyone uses. Its job is not glamorous. No conference tracks. No benchmark leaderboards. But its job is everything:

  • Capture the content, strip ads, navigation, boilerplate noise
  • Follow the citation tree, not just the surface, all the way down
  • Map which claims have evidence and which ones don’t
  • Export clean structured data the AI can actually read

Fix that layer and the retrieval debate collapses. BM25 or vector search? Either works when the input is clean. Neither works when it isn’t.

Before you go

The RAG conversation in 2026 is sophisticated. GraphRAG. Agentic RAG. Contextual retrieval. Hybrid search. All of it is real. All of it matters. But every one of those architectures inherits whatever was ingested before it ran.

  • 17K chunks tested across 7 RAG strategies
  • #1 factor: ingestion quality, not the search engine
  • 17-33% hallucination rate in legal RAG tools (Stanford, 2025)

Garbage in. Confident garbage out. Fix the input first. The model gets better without touching a single line of retrieval code. That is a data engineering principle as old as data engineering itself. The field is just now remembering it.

What does your ingestion step actually look like? Not in theory, in practice. Raw PDFs? Unstructured web pages? Citations you never verified? Answer that honestly and you’ll know exactly where your hallucinations come from.

Sources

  • Thomas Houssin, “Why Your RAG System Doesn’t Need Embeddings” (HackerNoon, March 23, 2026)
  • Tonic.ai, “Tonic Textual + Haystack Integration: PII-Safe RAG Pipelines” (March 2026)
  • Stanford Law, “Hallucinations in Legal AI Tools” (Journal of Empirical Legal Studies, 2025)
RAGAI HallucinationsData EngineeringLLMMachine Learning