---
title: "Why Does Your AI Keep Making the Same Mistake? We Audited Ours to Find Out."
description: "Instructions decay because they depend on remembering at the wrong moment. Mechanisms remove the remembering. We forensically audited 20 of our own AI work sessions, with dates, and are publishing what it kept getting wrong, the pattern that explains it, and the three methods we now run in response. All three are copy-pastable below."
publishedAt: 2026-07-04
updatedAt: 2026-07-04
author: Ena Pragma
url: https://enapragma.co/blog/why-ai-repeats-mistakes
tags: ["ai-operations", "methodology"]
---

If your AI agent keeps repeating a mistake you have already corrected, the instruction is not the problem. The delivery mechanism is. A written rule only works if the system remembers to consult it at the exact moment the wrong reflex fires, and that is precisely when nothing is thinking about rules. A mechanism, a gate it cannot skip, a tool that must run, a check built into the path, removes the remembering entirely. Rules get violated. Pipelines do not.

That is a clean theory. Here is the uncomfortable part: we proved it on ourselves, with dates, and we are publishing the logs-level detail because the pattern is the same one playing out inside every company running AI agents right now.

We run an AI teammate on real work daily: client deliverables, infrastructure, research, publishing. In early July we ran a forensic audit of its 20 most recent work sessions, five dense days of logs, against the full library of correction notes we had written for it. This post is what we found, the single most useful finding, and the three methods that came out of it. Each of the three main sections ends with its method as a copy-pastable resource, because a claim you cannot run is just content.

<Stat value="20 sessions" label="Forensically audited against 197 banked correction notes, with structured extraction and quoted evidence for every claim" />

## We audited our own AI. Doesn't that break our own rule?

Fair question, because [we published the rule ourselves](/blog/claude-science-what-its-design-teaches): the producer of work must never be the one who validates it. Self-review leaks. The producer normalizes its own errors exactly the way you read past your own typos. If we had let our agent grade its own sessions and shipped the conclusions, this post would be worthless.

So the audit was designed as self-study without self-trust, three layers deep:

1. **Fixed-schema extraction, no editorializing.** Four parallel extraction passes pulled structured facts from the session logs on a fixed schema (what was asked, what shipped, corrections received, verifications run or skipped, do-overs, and more), with a quote required for every claim. Extractors were forbidden to recommend or conclude anything.
2. **Cross-reference with timestamps.** Every "it repeated a banked lesson" claim was checked against version-control dates: a lesson must have been recorded *before* the repeat for the claim to count. Memory of "we already knew that" is not evidence; commit dates are.
3. **Blind adversarial review.** The draft findings then went to two independent reviewers that had not seen any of the producing analysis, with one instruction: refute. A finding claiming a *pattern* needed evidence in two or more distinct sessions; a single incident had to be labeled as exactly that, and the reviewers' verdicts shipped with the report.

The refuters earned their keep. They corrected the audit five times before we saw it: one "repeated after being warned" claim turned out to be backwards (the warning was written *because of* those incidents, not before them), one count was undercounted by half, one dramatic-sounding number (three machine crashes in four days) deflated under scrutiny to two, one quote could not be traced to any written record and was cut, and one finding actually got *stronger* when a reviewer found an older violated rule the draft had missed.

Read that list again. Five errors, in an audit about error-catching, written by the system being audited. That is not an embarrassment; that is the method working. A self-audit that comes back clean is telling you about the auditor, not the audit.

<Stat value="5 corrections" label="Forced by two blind adversarial reviewers before the audit's findings were accepted, including one finding that was reversed and one that got stronger" />

Here is the method, condensed to run on your own AI's history. It assumes nothing about which model or platform you use, only that your agent's work leaves records.

<SessionAuditSkill />

## The finding: procedures stick, admonitions repeat

The audit's sharpest pattern was not any single mistake. It was the difference between the lessons that held and the lessons that did not, and the line between them is structural.

The receipts, with dates:

- A rule recorded on **June 9** ("search everywhere before declaring something missing") and a second rule from **June 27** ("the recurring failure mode is not ignorance, it is not consulting what is already banked") were **both violated by the same incident on July 3**, when the agent declared a stored credential missing after one bad search, despite its own notes recording exactly where that credential lived. The rule about consulting your notes first was itself sitting in the notes.
- A rule recorded on **June 19** (verify which account a browser session belongs to before acting in it) was **violated again on June 29**, and only held without prompting on July 3, on its third exposure.
- Meanwhile, a verification harness built on **July 1**, a *mechanism* rather than a rule, held on its **first** unprompted application the next day, and every application after.

The asymmetry is the whole story. An instruction depends on retrieval at the moment of action: the system must remember to remember, precisely when the confident wrong reflex is firing. A mechanism removes the retrieval step. The gate is part of the path, so forgetting is structurally impossible.

<Stat value="24 days" label="A written rule sat in the agent's own notes (recorded June 9) before being violated anyway on July 3. The mechanized replacement has not been skipped once." />

Is it an iron law? No, and our own reviewers made us say so: one written rule did eventually hold on its third exposure, and one early mechanism shipped with a coverage gap (which is why the checklist below insists you birth-test them). It is a strong tendency, and it held everywhere we looked. If you manage people, you already know it. "Please remember to update the ticket" fails; a deploy pipeline that refuses to ship without a ticket number succeeds. What surprised us is how *literally* it transfers to AI agents, which are supposed to be good at reading their own notes. They are good at reading. They are not reliably good at *deciding to read at the right moment*, and no amount of stern wording in a memory file fixes that, because the failure is upstream of the reading.

So we stopped re-writing rules in stronger words. The audit's recommendations were converted into mechanisms, seven of them, each shipped within hours of the audit closing: a lookup tool that must run before the agent may claim anything is "missing," a budget monitor that catches memory bloat before it bites, a design gate that stops render-and-hope loops after two rejections, and the gate described in the next section, among others. Five were validated against real targets immediately; the other two carry an explicit pending-first-live-use tag, because calling a mechanism "done" before it has caught something real is its own kind of optimism. The conversion procedure is the resource:

<MechanizeChecklist />

## The gate that caught what three other gates missed

One story from the audit shows the whole system, including its limits.

On July 1 we published a blog post that had passed three verification gates: a style-and-claims scan, a self-check on load-bearing claims, and an independent tools-denied review that traces claims to sources ([the cold-review method we published here](/blog/claude-science-what-its-design-teaches)). All three passed it. The post shipped saying "the whole method in four words: blind the questioner."

Count the words.

A human reader caught it. Three gates did not, and the miss was structural, not bad luck: every one of those gates verifies the text *against sources*. "Four words" is not a claim about any source. It is the text disagreeing with itself, and no source-tracing gate, however rigorous, is even looking there. Checks that all face the same direction share one blind spot, which means "we ran three checks" is not three times the coverage when all three look outward.

So the audit's recommendation was a fourth gate class: an internal-consistency review that checks a document against *itself*. Stated counts against actual counts. Arithmetic against the numbers given. Repeated facts against each other. References against the things they reference. The document's claims about itself against the document.

We built it within hours of the audit closing, and validated it the way we validate everything now: point it at real finished work and see if it catches anything true. It ran against two documents that had *already passed* our full source-tracing gate stack, including the audit report itself.

<Stat value="4 real defects" label="Caught by the internal-consistency gate on its first run, in two documents that had already passed every source-tracing gate we run" />

It found four: a reference in the audit report to a section that did not exist (a leftover from an earlier draft's numbering), a date range described as "four days" that spanned five, and two self-descriptions that claimed more than the documents delivered. Small? The "four words" error was small. Small internal contradictions are how a careful reader, or an AI answer engine deciding whether to cite you, learns to distrust a document. And note the recursion: the audit that recommended this gate was itself corrected by it. Nothing we run is exempt, which is the point.

The gate is below. It is deliberately narrow: it does not check facts against the world (your source gates do that), it checks the text against the text. Run it alongside your other checks on anything numbered, counted, or multi-part, and keep the reviewer independent of whoever wrote the document.

<ConsistencyReviewSkill />

## What this means if you run AI on real work

Three transferable conclusions, in order of importance:

1. **When an AI repeats a corrected mistake, stop re-instructing and mechanize.** The correction you wrote is not weak because it is badly worded; it is weak because it is a correction. Convert it into a gate, a required tool, or a pipeline step, and the repeat class dies. Our score since adopting this: no mechanized lesson has been skipped yet, while the written-only lessons each took at least one more violation before holding, and one pair never held at all until mechanized.
2. **Audit the history, not the vibes, and never let the producer grade itself.** The audit method above cost us one evening. Every material claim it produces should survive an adversarial reviewer that never saw the reasoning. Ours was corrected five times; yours will be corrected too, and that is the feature.
3. **Count your gate classes, not your gates.** Three checks of the same class share one blind spot. Style, sources, self-claims, and internal consistency are different classes. The cheapest coverage upgrade we made all month was adding the fourth class, and it paid for itself the first day.

None of this requires our stack, our vendor, or our help. It requires work records, version dates, and the discipline to let something that is not you tell you that you are wrong. That last one is the actual differentiator, for AI systems and for the companies running them.

*This post's numbers come from our internal audit records, session logs, and version-control history, dated inline. The post was run through the four checks it describes before publishing, and yes, they drew blood: the consistency review found a real self-contradiction in an earlier draft of this very post, and the independent review caught this post misquoting a rule, in the paragraph about misquotes. Both fixed. If you find an error anyway, we want to know, and if you want this discipline running on your own AI operations, [that conversation starts here](/book).*