Why Does Your AI Keep Making the Same Mistake? We Audited Ours to Find Out.
Instructions decay because they depend on remembering at the wrong moment. Mechanisms remove the remembering. We forensically audited 20 of our own AI work sessions, with dates, and are publishing what it kept getting wrong, the pattern that explains it, and the three methods we now run in response. All three are copy-pastable below.
If your AI agent keeps repeating a mistake you have already corrected, the instruction is not the problem. The delivery mechanism is. A written rule only works if the system remembers to consult it at the exact moment the wrong reflex fires, and that is precisely when nothing is thinking about rules. A mechanism, a gate it cannot skip, a tool that must run, a check built into the path, removes the remembering entirely. Rules get violated. Pipelines do not.
That is a clean theory. Here is the uncomfortable part: we proved it on ourselves, with dates, and we are publishing the logs-level detail because the pattern is the same one playing out inside every company running AI agents right now.
We run an AI teammate on real work daily: client deliverables, infrastructure, research, publishing. In early July we ran a forensic audit of its 20 most recent work sessions, five dense days of logs, against the full library of correction notes we had written for it. This post is what we found, the single most useful finding, and the three methods that came out of it. Each of the three main sections ends with its method as a copy-pastable resource, because a claim you cannot run is just content.
#We audited our own AI. Doesn't that break our own rule?
Fair question, because we published the rule ourselves: the producer of work must never be the one who validates it. Self-review leaks. The producer normalizes its own errors exactly the way you read past your own typos. If we had let our agent grade its own sessions and shipped the conclusions, this post would be worthless.
So the audit was designed as self-study without self-trust, three layers deep:
- Fixed-schema extraction, no editorializing. Four parallel extraction passes pulled structured facts from the session logs on a fixed schema (what was asked, what shipped, corrections received, verifications run or skipped, do-overs, and more), with a quote required for every claim. Extractors were forbidden to recommend or conclude anything.
- Cross-reference with timestamps. Every "it repeated a banked lesson" claim was checked against version-control dates: a lesson must have been recorded before the repeat for the claim to count. Memory of "we already knew that" is not evidence; commit dates are.
- Blind adversarial review. The draft findings then went to two independent reviewers that had not seen any of the producing analysis, with one instruction: refute. A finding claiming a pattern needed evidence in two or more distinct sessions; a single incident had to be labeled as exactly that, and the reviewers' verdicts shipped with the report.
The refuters earned their keep. They corrected the audit five times before we saw it: one "repeated after being warned" claim turned out to be backwards (the warning was written because of those incidents, not before them), one count was undercounted by half, one dramatic-sounding number (three machine crashes in four days) deflated under scrutiny to two, one quote could not be traced to any written record and was cut, and one finding actually got stronger when a reviewer found an older violated rule the draft had missed.
Read that list again. Five errors, in an audit about error-catching, written by the system being audited. That is not an embarrassment; that is the method working. A self-audit that comes back clean is telling you about the auditor, not the audit.
Here is the method, condensed to run on your own AI's history. It assumes nothing about which model or platform you use, only that your agent's work leaves records.
Session Audit: a forensic self-audit your AI cannot grade itself on An Ena Pragma method. WHAT IT NEEDS Your agent's work records (session logs, transcripts, tickets, PR history) and the dates your correction notes / rules were written (version control is ideal). THE PIPELINE 1. PIN THE INPUT. The N most recent substantive work sessions (we use 20), listed to a file before anything is read. No cherry-picking after the fact. 2. EXTRACT, DON'T CONCLUDE. Run extraction passes over the sessions with a fixed schema and a quote required for every entry. Forbid recommendations. Schema: - ASKED: what the operator requested - SHIPPED: what was actually delivered - CORRECTIONS: every correction the operator gave, and what triggered it - VERIFICATIONS: checks run (what they caught) vs checks skipped - REWORK: do-overs and wrong first attempts, with root cause - TOOL_FRICTION: failures and workarounds - MANUAL_SEQUENCES: repeated multi-step routines (automation candidates) - WINS: what went right and the visible mechanism why 3. FIND PATTERNS WITH A BAR. A pattern needs evidence in 2+ distinct sessions. Single occurrences go to a watchlist, not the findings. 4. CROSS-REFERENCE WITH DATES. For every "it repeated a known lesson" claim, check the lesson's recorded date PRECEDES the repeat. Memory is not evidence; timestamps are. 5. BLIND ADVERSARIAL REVIEW. Give the draft findings, WITHOUT your reasoning, to fresh reviewers instructed to refute: does the cited evidence exist? does it mean what is claimed? are the counts right? Verdict per finding: confirmed / weakened (state what to cut) / refuted. Cut what does not survive. 6. SHIP THE DELTA. Publish findings WITH their verdicts, and keep the pre-review draft. The corrections the reviewers forced are your proof the audit is honest. THE RULE THAT MAKES IT LEGITIMATE The producer never certifies its own work. The extraction is mechanical, the dates are from version control, and the judgment calls belong to reviewers that never saw the producing reasoning. A self-audit that comes back clean is telling you about the auditor, not the audit. WHAT TO DO WITH THE FINDINGS Anything that repeated after being written down goes to the mechanize checklist. Do not respond to a repeated failure with a stronger memo.
#The finding: procedures stick, admonitions repeat
The audit's sharpest pattern was not any single mistake. It was the difference between the lessons that held and the lessons that did not, and the line between them is structural.
The receipts, with dates:
- A rule recorded on June 9 ("search everywhere before declaring something missing") and a second rule from June 27 ("the recurring failure mode is not ignorance, it is not consulting what is already banked") were both violated by the same incident on July 3, when the agent declared a stored credential missing after one bad search, despite its own notes recording exactly where that credential lived. The rule about consulting your notes first was itself sitting in the notes.
- A rule recorded on June 19 (verify which account a browser session belongs to before acting in it) was violated again on June 29, and only held without prompting on July 3, on its third exposure.
- Meanwhile, a verification harness built on July 1, a mechanism rather than a rule, held on its first unprompted application the next day, and every application after.
The asymmetry is the whole story. An instruction depends on retrieval at the moment of action: the system must remember to remember, precisely when the confident wrong reflex is firing. A mechanism removes the retrieval step. The gate is part of the path, so forgetting is structurally impossible.
Is it an iron law? No, and our own reviewers made us say so: one written rule did eventually hold on its third exposure, and one early mechanism shipped with a coverage gap (which is why the checklist below insists you birth-test them). It is a strong tendency, and it held everywhere we looked. If you manage people, you already know it. "Please remember to update the ticket" fails; a deploy pipeline that refuses to ship without a ticket number succeeds. What surprised us is how literally it transfers to AI agents, which are supposed to be good at reading their own notes. They are good at reading. They are not reliably good at deciding to read at the right moment, and no amount of stern wording in a memory file fixes that, because the failure is upstream of the reading.
So we stopped re-writing rules in stronger words. The audit's recommendations were converted into mechanisms, seven of them, each shipped within hours of the audit closing: a lookup tool that must run before the agent may claim anything is "missing," a budget monitor that catches memory bloat before it bites, a design gate that stops render-and-hope loops after two rejections, and the gate described in the next section, among others. Five were validated against real targets immediately; the other two carry an explicit pending-first-live-use tag, because calling a mechanism "done" before it has caught something real is its own kind of optimism. The conversion procedure is the resource:
Mechanize the Lesson: what to do when your AI repeats a corrected mistake
An Ena Pragma method.
THE LAW (observed, dated, in our own logs)
Lessons shipped as procedures/mechanisms tend to hold on first use. Lessons
recorded as written rules tend to be violated at least once more before they
stick, if they stick. The failure is structural: a rule requires the system to
remember to consult it at exactly the moment the wrong reflex is firing.
THE CHECKLIST (run it the second time you correct the same thing)
1. NAME THE REFLEX, not the rule. What does the system actually DO in the wrong
moment? ("greps once and declares the thing missing", "renders a third
attempt instead of asking", "trusts its own summary of state")
2. FIND THE INTERCEPT POINT. The last moment before the damage where a check
could sit IN THE PATH: before a claim ships, before a file writes, before a
third attempt renders, before money or messages move.
3. CHOOSE THE WEAKEST SUFFICIENT MECHANISM, in this order:
a. a required tool (must run before the claim is allowed: cheap, loud)
b. a gate that refuses (a check that blocks the path until it passes)
c. a pipeline step (built into the workflow, cannot be skipped)
d. only if none fit: a hard trigger rule ("at 2 rejections, X fires")
4. MAKE IT REPORT, NOT REPAIR. The mechanism surfaces the problem; a person or
the agent fixes it deliberately. Silent auto-repair hides drift.
5. BIRTH-TEST IT ON THE REAL CASE. Point it at the exact incident that motivated
it. If it would not have caught that incident, it is decoration.
6. RETIRE THE MEMO. Fold the written rule into the mechanism's documentation and
stop counting on it. One pointer remains: "the tool enforces this now."
SMELL TEST
If your fix to a repeated failure contains the words "always remember to", you
have written admonition number two, not a fix.#The gate that caught what three other gates missed
One story from the audit shows the whole system, including its limits.
On July 1 we published a blog post that had passed three verification gates: a style-and-claims scan, a self-check on load-bearing claims, and an independent tools-denied review that traces claims to sources (the cold-review method we published here). All three passed it. The post shipped saying "the whole method in four words: blind the questioner."
Count the words.
A human reader caught it. Three gates did not, and the miss was structural, not bad luck: every one of those gates verifies the text against sources. "Four words" is not a claim about any source. It is the text disagreeing with itself, and no source-tracing gate, however rigorous, is even looking there. Checks that all face the same direction share one blind spot, which means "we ran three checks" is not three times the coverage when all three look outward.
So the audit's recommendation was a fourth gate class: an internal-consistency review that checks a document against itself. Stated counts against actual counts. Arithmetic against the numbers given. Repeated facts against each other. References against the things they reference. The document's claims about itself against the document.
We built it within hours of the audit closing, and validated it the way we validate everything now: point it at real finished work and see if it catches anything true. It ran against two documents that had already passed our full source-tracing gate stack, including the audit report itself.
It found four: a reference in the audit report to a section that did not exist (a leftover from an earlier draft's numbering), a date range described as "four days" that spanned five, and two self-descriptions that claimed more than the documents delivered. Small? The "four words" error was small. Small internal contradictions are how a careful reader, or an AI answer engine deciding whether to cite you, learns to distrust a document. And note the recursion: the audit that recommended this gate was itself corrected by it. Nothing we run is exempt, which is the point.
The gate is below. It is deliberately narrow: it does not check facts against the world (your source gates do that), it checks the text against the text. Run it alongside your other checks on anything numbered, counted, or multi-part, and keep the reviewer independent of whoever wrote the document.
Consistency Review: the check your source-checking cannot do
An Ena Pragma method.
THE GAP IT COVERS
Style gates, fact-checkers, and source-tracing reviews all verify a text against
things OUTSIDE it. None of them notice a text disagreeing with ITSELF ("the
whole method in four words" followed by a three-word method). However many
outward-facing checks you run, they all share this one inward blind spot;
internal consistency is the check most teams are missing entirely.
RUN IT: paste this into a fresh assistant session that did not write the
document. Deny it web/search tools; it needs only the text.
"""
You are an internal-consistency reviewer. Check this document against ITSELF
only. Do not check any claim against the outside world; flag only places where
the text disagrees with the text. For each checklist item: first enumerate every
candidate in the document, then verify each one.
1. STATED COUNTS vs actual counts: every "N words/steps/reasons/items/sections"
claim: count the referent.
2. ARITHMETIC: every sum, difference, percentage, or ratio stated or derivable:
recompute it from the numbers given.
3. REPEATED FACTS: any fact stated more than once (dates, figures, names,
versions): all instances must agree.
4. REFERENCE INTEGRITY: section/footnote/figure references resolve; numbered
lists are gapless; internal links land.
5. CLAIM-VS-STRUCTURE: the document's claims about itself are true of itself
("the table below has six rows" -> count the rows).
Report each finding as FAIL (the text contradicts itself) or WARN (ambiguous),
with the location and the two disagreeing values. Never report clean without
having enumerated the candidates. Verdict: CLEAN or FIX-BEFORE-SHIP.
"""
WHEN TO RUN IT
On anything numbered, counted, or multi-part, alongside (never instead of) your
source-tracing checks. The reviewer must be independent of the writer; producers
normalize their own arithmetic exactly like they normalize their own typos.
FIRST-RUN RESULT ON OUR OWN WORK
Four real defects in two documents that had already passed every source-tracing
gate we run, including a dangling section reference in the audit report that
recommended this very check.#What this means if you run AI on real work
Three transferable conclusions, in order of importance:
- When an AI repeats a corrected mistake, stop re-instructing and mechanize. The correction you wrote is not weak because it is badly worded; it is weak because it is a correction. Convert it into a gate, a required tool, or a pipeline step, and the repeat class dies. Our score since adopting this: no mechanized lesson has been skipped yet, while the written-only lessons each took at least one more violation before holding, and one pair never held at all until mechanized.
- Audit the history, not the vibes, and never let the producer grade itself. The audit method above cost us one evening. Every material claim it produces should survive an adversarial reviewer that never saw the reasoning. Ours was corrected five times; yours will be corrected too, and that is the feature.
- Count your gate classes, not your gates. Three checks of the same class share one blind spot. Style, sources, self-claims, and internal consistency are different classes. The cheapest coverage upgrade we made all month was adding the fourth class, and it paid for itself the first day.
None of this requires our stack, our vendor, or our help. It requires work records, version dates, and the discipline to let something that is not you tell you that you are wrong. That last one is the actual differentiator, for AI systems and for the companies running them.
This post's numbers come from our internal audit records, session logs, and version-control history, dated inline. The post was run through the four checks it describes before publishing, and yes, they drew blood: the consistency review found a real self-contradiction in an earlier draft of this very post, and the independent review caught this post misquoting a rule, in the paragraph about misquotes. Both fixed. If you find an error anyway, we want to know, and if you want this discipline running on your own AI operations, that conversation starts here.