July 4, 2026·7 min read·ai-operations · methodology

Why Does Your AI Know Every Fact but Fail the Combined Question?

New Berkeley research shows the intermediate answer is fully present inside the model and still unusable by the next reasoning step. Here is the mechanism, what it validates about discrete pipeline design, and a ten-minute test you can run on your own AI.

Ask your AI who placed order 4417, and it answers correctly. Ask who Meridian's account manager is, and it answers correctly. Ask who the account manager is for the customer that placed order 4417, and it confidently names the wrong person.

The failure looks like missing knowledge. It is not. A Berkeley paper posted this week measured what actually happens inside a transformer at the moment it composes two facts, and found the intermediate answer sitting there, fully formed and decodable, in a shape the next reasoning step cannot use. The model knows. It just cannot hand the answer to itself.

That mechanism has a practical consequence for anyone running AI on business operations, and it is the opposite of "wait for smarter models": the way to get reliable multi-step answers is to stop asking the model to carry intermediate results internally at all.

#What is the two-hop gap?

A two-hop question is any question whose answer requires looking something up with the result of another lookup. Which account manager owns the customer on this order. Which contracts are affected by the regulation that changed in July. Whether the vendor on this invoice is the same one flagged in last quarter's audit.

Language models have a documented, named problem here. In 2022, researchers measured what they called the compositionality gap: the fraction of questions where the model answers every sub-question correctly but still fails the composition. Their headline case: the model knows fact A, knows fact B, and cannot produce A composed with B. The gap did not close as models scaled; single-fact recall improved faster than composition did.

Follow-up work located the failure mechanically. A 2024 study of trained-from-scratch transformers found that two-hop composition circuits form in distinct layer bands, and that composition generalizes poorly outside the training distribution even when every individual fact is stored. The knowledge is in the weights. The plumbing between the facts is what breaks.

#What did the Berkeley team actually find?

The new paper, DiscoLoop (Fu, Guo, Wang, Zhu, Lee, Jiao, Russell, Mei; UC Berkeley with Princeton, posted July 1, 2026), studies an architecture built specifically for internal multi-step reasoning, a looped transformer that reapplies the same layers repeatedly, giving the model an explicit second pass to do the second hop. Even there, composition breaks, and the paper isolates why.

After the first pass, the model has the bridge answer. Reading the hidden state with the standard logit-lens technique, the correct intermediate entity is decodable with probability 1.000. But the vector holding that answer is geometrically misaligned with what the second pass expects to consume: its cosine similarity to the clean embedding of the same entity is about 0.33 on familiar data, and lower on unfamiliar data. The second hop receives a noisy smear instead of a clean answer.

8.3%

two-hop accuracy on unfamiliar fact combinations for a looped transformer that stores every individual fact perfectly, before intervention (DiscoLoop, Table 1)

The elegant part is the causal experiment. The authors patch one vector at one position between the two passes: they mix in the clean embedding of the answer the model itself already decoded. No retraining, nothing else touched. At a mixing weight of 0.1, accuracy on unfamiliar combinations jumps from 8.3 percent to 25.9 percent. At roughly 0.5, both familiar and unfamiliar accuracy approach 100 percent.

~100%

accuracy on the same held-out combinations after a training-free patch that mixes a clean copy of the answer the model already computed back into the second pass (DiscoLoop, Section 3.2)

Their proposed architecture bakes that patch in: each loop passes forward both the continuous hidden state and a decoded, re-embedded copy of what that state says. Pull the noisy vector toward a clean discrete answer, then keep reasoning.

The honest bounds: the authors themselves flag that these controlled results are on small models and synthetic fact graphs, and that on a real 440M-parameter pretraining run the benchmark gain is modest (about one point). And as of this writing, nothing is peer-reviewed or independently replicated, and no code is released. What survives those caveats is the mechanism, because it was demonstrated causally, not correlationally: the intermediate answer was present, unusable, and surgically fixable by making it discrete.

#Why this validates discrete pipelines

Here is the part that matters if you run AI on real work. "Decode the intermediate answer, then start the next step from the decoded answer" is not a research novelty. It is a description of how disciplined AI operations already work.

When a workflow writes each step's result to a field, a ticket, or a message before the next step begins, it is doing externally exactly what DiscoLoop does inside the model: replacing a latent carry with a clean, discrete artifact. We have argued the operational case for this before, loops need durable state outside the model, because context evaporates and processes restart. The DiscoLoop result adds a deeper reason. Even when nothing crashes and nothing is forgotten, the latent carry is the unreliable link. Externalizing the hop is not a compensation for weak models. It is the representationally correct design, now with causal evidence from inside the weights.

And one mechanism buys three properties at once:

Reliability. The next step starts from a verified artifact, not from whatever geometry the previous step left behind.
Auditability. A hop that exists as an artifact is a hop that appears in the record. An audit trail can only replay what was externalized; reasoning that stays latent is unreviewable by construction.
Governability. A discrete hop is a place where a human checkpoint can actually attach. You cannot put an approval gate in the middle of a forward pass.

Research and production practice converged on the same design from opposite directions. Berkeley found in the weights what operations teams learned from incidents.

#How do you test your own AI for the two-hop gap?

You do not need a lab. You need ten minutes and facts from your own business. The test below measures the gap the way the compositionality-gap researchers did: same facts, direct versus stepwise.

two-hop-gap-test.md

The Two-Hop Gap Test: does your AI compose facts, or just contain them?
An Ena Pragma method. Time: ~10 minutes. Works on any assistant, copilot, or
workflow that answers questions from your business data.

WHAT IT MEASURES
The gap between what your AI knows and what it can combine. Research calls it
the compositionality gap (arxiv.org/abs/2210.03350): the share of questions
where the model gets every sub-fact right and the combined answer wrong.

THE TEST
1. Pick two facts it verifiably has, where fact B is looked up BY the answer
   of fact A. Example: A = "who placed order 4417?" (Meridian Manufacturing),
   B = "who is Meridian's account manager?" (Dana).
2. Confirm each fact ALONE, in separate fresh conversations. Both must be
   right. If either is wrong, stop: you have a retrieval problem, not a
   composition problem, and this test does not apply yet.
3. Ask the combined question cold, in a fresh conversation:
   "Who is the account manager for the customer that placed order 4417?
   Answer with just the name."
   The "just the name" constraint matters: it pushes the composition to
   happen internally, without externalized intermediate steps.
4. Ask again stepwise, in a fresh conversation: first ask fact A, then use
   its answer to ask fact B as a follow-up.
5. Repeat steps 2 through 4 on 3 to 5 fact pairs from YOUR data (a correct
   stepwise run doubles as the sub-fact check). Score two columns:
   direct-correct and stepwise-correct.

READING THE RESULT
- Stepwise high, direct low: the two-hop gap, live in your stack. The
  knowledge is fine; the internal composition is the weak link. Any workflow
  that trusts one-shot combined answers is carrying silent risk.
- Both high: your pairs may be combinations the model has seen. Retry with
  pairs unique to your business (internal IDs, recent records).
- Both low: fix retrieval/access first; composition is not your bottleneck.

THE DESIGN RULE THAT FOLLOWS
In automated workflows, never let step N+1 depend on reasoning that stayed
inside step N. Decode every intermediate answer into an artifact (a field, a
message, a ticket update) and start the next step from the artifact. The same
move that fixes reliability is what makes the workflow auditable and gives a
human checkpoint somewhere to attach.

Re-run quarterly and after any model upgrade. The gap is a property of the
model-plus-your-data pair, not a constant.

Two things make this test worth running on your systems rather than trusting benchmark folklore. First, the gap is distribution-sensitive: public benchmarks use famous fact pairs that models have seen composed; your CRM has combinations no model ever trained on, which is exactly where composition degrades most. Second, the result is directly actionable: every workflow where the direct answer is trusted today is a place to insert an explicit hop tomorrow.

#What should you change in how you build?

Four rules fall straight out of the mechanism:

Never let step N+1 depend on reasoning that stayed inside step N. Decode every intermediate result into an artifact: a field, a row, a message, a ticket update. Start the next step from the artifact.
Budget hops explicitly. When you map a workflow, count its hops. Every hop is a place composition can fail silently, and a place a checkpoint can live. One-hop steps chained discretely beat one heroic multi-hop prompt.
Ask for the working, structurally. "Answer with just the name" is a reliability anti-pattern in automated pipelines; it forces the composition to happen internally where it is weakest and invisible. Let intermediate answers exist, then act on them.
Re-test when models change. The gap is a property of the model-plus-data pair, not a constant. A ten-minute quarterly re-run of the test above tells you whether an upgrade actually changed composition or just fluency.

#Where this came from, and why we checked

We found this paper through a YouTube explainer (Discover AI's video on DiscoLoop), whose description claimed transformers fail because representations "drift away from the geometry that later computations expect." Before citing anything, we verified the description against the paper itself. Two of its three claims held up: the paper does run causal intervention experiments, and its alignment principle does dramatically improve multi-hop accuracy in the controlled settings. The third was embellished: the paper never describes drift over time; it measures a static per-loop mismatch, and its diagnosis for standard transformers is a storage problem, not a geometry problem. The video is a fine pointer and a bad citation.

We publish that distinction deliberately. Secondary coverage of AI research is increasingly AI-generated, and at the time of writing this paper had zero human-authored analysis anywhere we could find. If you cite research to guide operational decisions, the primary source is the only source; everything else is a lead.

Every number in this post traces to the linked primary sources: the DiscoLoop paper and its full text, the compositionality gap study, and the grokked-transformers analysis. This post passed our published gate stack, including an independent review against those sources, before it went live. If your operations depend on AI answering joined-up questions and you want the hops made explicit, auditable, and governable, that conversation starts here.