How do you stop your AI from hallucinating?

Chris Morris
Sep 10, 2025
7 min read

Data preparation, retrieval hygiene, verification loops, guardrails & domain constraints, deterministic configurations, tracing—and where humans stay in the loop

If you’ve shipped anything using AI, you’ve seen it: a confident answer with no basis in your data, common sense or reality. At TackTech, we’ve spent the past months turning that problem into a process. Below are the practices we use to keep answers truthful, reasoning sound, and outcomes useful.

1) Start with agent-readable knowledge (the unglamorous foundation)

This may sound obvious, but it’s often the most overlooked element of AI product development: give your agents real, structured data to work with. Too often we see solutions hitting standard APIs and people are surprised when answers make no sense. It’s like asking a GPS for turn-by-turn directions without a map—and then complaining when it confidently invents roads.

We invest heavily in gathering real data: using APIs to map overlaps between datasets, extracting that data into our own infrastructure, and directing the agent to use that curated data to answer questions.

“What if I don’t have APIs?” Most organisations live in CSVs, PDFs, Word docs, Excel sheets with multiple crosstabs, screenshots of tables, and ad-hoc slide decks. Point an agent straight at that pile and you’re asking it to parse, normalise, and reason at the same time—where hallucinations are born.

Our rule: pre-process before you query. We create specialist ingestion agents whose only job is to extract specific data from specific formats and convert it into a canonical, agent-readable package. Only then do answering agents use that material to reason—based on the information they actually have.

Why standardise first (instead of “parsing on the fly”):

· Accuracy: Combining parsing and reasoning multiplies failure modes (unit mismatches, header misreads, merged cells, hidden rows).

· Consistency: Canonical schemas give repeatable answers; ad-hoc parsing changes with every file nuance.

· Traceability: Standard forms carry versions, sources, and checksums so you can reconstruct exactly what the agent saw.

· Speed & cost: Ingest once, answer many times. Avoid re-OCR/re-parse on every question.

So, yes, it’s extra work—but essential. Use two types of agents to create the context your AI answers with:

1) Ingestion agents (format specialists)

· Take anything (CSVs/Excels with crosstabs, PDFs/Word/Slides, APIs).

· Make it tidy: unpivot/clean tables, extract text (OCR if needed), align IDs across systems.

· Standardise outputs: strict JSON/SQL schemas + short text summaries + metadata (source, owner, effective_date, permissions, version, checksum).

· Validate: schema checks, row counts, totals, key integrity—tests per extractor.

2) Answering agents (reason within context)

· Work only on the standardised package.

· Cite evidence (table rows/chunk IDs) for every claim.

· If context is thin: abstain, request more data, or escalate.

2) Context (the magic word in prompting)

Hallucinations spike when an agent operates beyond its provided context. We engineer abstention and citation as defaults.

Implementation checklist

System message: “Only answer using the supplied context. If the answer isn’t in context, say you don’t have enough information and offer next steps to find it.”
Tooling contract: Require evidence_ids with each claim; refuse responses with no evidence.
Fallbacks before free-form: If context is insufficient, prefer “request more info,” “escalate to human,” or “create a research task”—not “guess.”

3) Retrieval hygiene (teach your agent what to trust)

If you’ve done steps 1 and 2, your agent is now looking at ordered, relevant data—and only that data. The next challenge: which parts should it read? Not all data is equal. Some sources are more authoritative or more current. A human would take this into consideration when deciding what the best answer- so make sure your AI does this too.

By default, an agent treats all data as equal. Your job is to point it at the right sources and keep the reading list short. Favour precision over recall so answers are grounded, not just plausible.

What we do in practice:

Search the right shelf.Keep separate indices for different data types, and a summary of what that data contains. Think of the context that your agent is using not as one big novel, but a filing cabinet. You need to ensure your agent can find the right files that are best for answering your question as quickly as possible. If someone is asking about a segmentation, ensure it knows which data is specifically about segmentations so it only looks at that when generating a response.
Use multiple lenses + filters.Make sure it understands the broader context of the question so it searches your files correctly. Combine semantic search and keyword search, so if your description doesn’t include ‘segmentation’, it’s still able to find the right file to direct its attention towards.
Keep the reading list short. AI doesn’t like noise. Start with a broad pool of information, re-rank this information based on relevancy, then keep just 5–8 top passages. More than ~8 adds noise.
Drop near-duplicates.Promote diverse snippets; suppress repeats so the model sees different evidence, not the same paragraph five times. Why? Because LLMs will prioritise repeated topics or themes, presuming its more important.
Prefer what’s current.When documents are versioned, bias to the latest valid one. Yesterday’s segmentations shouldn’t outrank today’s.
If it’s shaky, don’t guess.Confidence is king. When it generates an answer, ask it to provide a confidence lvel to the answer it has come up with. If it’s not confident, ask the agent abstain or escalates instead of improvising. This is incredibly important when it comes to multi-agent workflows (more of that In a later article)
Be consistent. Like it or not, most AI solutions involve chatbots, But unless your AI agent understands some of your previous context of the questions and the answers it has given, those chats just become a question bombardment that is clunky, repetitive and tedious. Cache chosen passages briefly during a multi-step task so every step uses the same context, and expire the cache quickly when content updates.

4) Verification loops (trust, then verify)

Agents should behave like (good) journalists: if you make a claim, show the evidence. We add checks before output reaches a user or downstream system.

Processes we implement:

Self-check prompt: After drafting, the model must prove each claim with quoted snippets. If it can’t, it revises or abstains.
Critic pass (rules or smaller model): Validates that all claims are evidenced, numbers/logic are consistent, and restricted actions aren’t attempted.
Selective re-retrieve (grab the missing piece): When you ask an agent to reason, it can miss specific facts. Sometimes that fact is there, sometimes it isn’t, but its wise to ask your agent to do one targeted search to fetch thiose items it isn missing after it has given tis first answer. How we do this:

1. Identify the missing fact (e.g., “profitability for Segment A”).

2. Re-query only for that item.

3. If found, add to context and revise once.

4. If still missing, don’t guess—abstain or escalate to a human.

5) Temperature (reduce randomness, boost repeatability)

Temperature is your friend when it comes to ensuring accuracy. This often underutilised element in code controls randomness in word choice. It’s not a “truth” knob—if retrieval is wrong, low temperature will just be consistently wrong.

What temperature does:

Low (≈ 0–0.2): steadier, copy-from-context behaviour → use for factual/execution tasks.
Medium (≈ 0.3–0.7): some variety → good for planning/brainstorming.
High (≈ 0.8–1.0+): creative but risky for facts.

How we use it

Executor tasks (facts, RAG, code, ops): 0–0.2.
Planner tasks (ideas/options/outlines): 0.5–0.8, then hand to a low-temp executor.
Evaluations/tests: Pin temperature (and model version) so runs are comparable.

This is why step 1 matters: clean JSON/SQL schemas let you keep temperature low for factual work and only raise it for ideation—never for final answers.

6) Tracing (measure hallucination, don’t just feel it)

We treat hallucination as a measurable defect. The mistakes aren’t random; there’s an underlying cause. With the right tracing you can follow the breadcrumbs and fix the right layer.

If you’ve followed the steps above, you’ll have a pipeline like:

retrieve → rerank → plan → draft → verify → act.

Log each step, not just the start and end. Capture: inputs/outputs, top-k doc/chunk IDs and their scores, which verification checks passed/failed, and any tools called with parameters. Do this and you’ll know whether errors come from bad retrieval (wrong shelf), reasoning, or stale data—and fix accordingly.

7) Humans in the loop (on purpose, not as an afterthought)

Our agents are designed to speed up retrieval and synthesise recommendations—not to deliver crystal-clear verdicts. The human is, and always will be, the final decision-maker.

The golden rule: 'You need to tell your agent to be explicit not certain.'

Every response should state:

What I have: sources used + citations
What I don’t have: missing data or permissions
What I wish I had: specific fields/files that would improve the answer
Contradictions found: where sources disagree
Confidence: a simple score or “low/med/high” with reasons
Recommend, don’t decree. Provide 1–3 options with pros/cons and the next best action (e.g., “this segment appears strongest because…; alternatively, consider X and Y based on …”).
Resist yes-manship. Prompts instruct the agent to abstain when evidence is thin and to surface ambiguity instead of smoothing it over.

When to route to a human:

· Low retrieval/verification confidence

· High-impact actions (financial, legal, reputational)

· Conflicting or out-of-date sources

· Missing mandatory fields/approvals

There is a stanard response mechanism we use with all our agents so humans really know the pro’s and con’s:

Summary
What I have / don’t have / wish I had
Contradictions & assumptions
Options with pros/cons & next steps
Confidence + citations

Summary

Thats our hallucination secrets. We have spent months building an infastructure that does all of this and we know it works.

What changed for us after doing this

Hallucination rate dropped because “no evidence → no answer” is the default, not an exception.
Consistency improved as retrieval and decoding became deterministic and version-pinned.
Onboarding new agents got faster because they inherit the same contracts (schemas, tools, policies).
Teams trust the system since every answer carries its receipts.

If there is one takeaway today, it’s this: Stop treating hallucination as a prompt problem. Treat it as a data, retrieval, and process problem—then wire your agents so they literally cannot answer beyond their context.

If you’re tackling this and want to compare notes, we’re happy to share more of the nitty-gritty from our TackTech trials (and errors).