The Two-Hop Gap Test

A ten-minute test of whether your AI can combine facts it already knows, run on your own business data. Direct versus stepwise, scored in two columns.

two-hop-gap-test.md

The Two-Hop Gap Test: does your AI compose facts, or just contain them?
An Ena Pragma method. Time: ~10 minutes. Works on any assistant, copilot, or
workflow that answers questions from your business data.

WHAT IT MEASURES
The gap between what your AI knows and what it can combine. Research calls it
the compositionality gap (arxiv.org/abs/2210.03350): the share of questions
where the model gets every sub-fact right and the combined answer wrong.

THE TEST
1. Pick two facts it verifiably has, where fact B is looked up BY the answer
   of fact A. Example: A = "who placed order 4417?" (Meridian Manufacturing),
   B = "who is Meridian's account manager?" (Dana).
2. Confirm each fact ALONE, in separate fresh conversations. Both must be
   right. If either is wrong, stop: you have a retrieval problem, not a
   composition problem, and this test does not apply yet.
3. Ask the combined question cold, in a fresh conversation:
   "Who is the account manager for the customer that placed order 4417?
   Answer with just the name."
   The "just the name" constraint matters: it pushes the composition to
   happen internally, without externalized intermediate steps.
4. Ask again stepwise, in a fresh conversation: first ask fact A, then use
   its answer to ask fact B as a follow-up.
5. Repeat steps 2 through 4 on 3 to 5 fact pairs from YOUR data (a correct
   stepwise run doubles as the sub-fact check). Score two columns:
   direct-correct and stepwise-correct.

READING THE RESULT
- Stepwise high, direct low: the two-hop gap, live in your stack. The
  knowledge is fine; the internal composition is the weak link. Any workflow
  that trusts one-shot combined answers is carrying silent risk.
- Both high: your pairs may be combinations the model has seen. Retry with
  pairs unique to your business (internal IDs, recent records).
- Both low: fix retrieval/access first; composition is not your bottleneck.

THE DESIGN RULE THAT FOLLOWS
In automated workflows, never let step N+1 depend on reasoning that stayed
inside step N. Decode every intermediate answer into an artifact (a field, a
message, a ticket update) and start the next step from the artifact. The same
move that fixes reliability is what makes the workflow auditable and gives a
human checkpoint somewhere to attach.

Re-run quarterly and after any model upgrade. The gap is a property of the
model-plus-your-data pair, not a constant.

This method is published in full in the post Why Does Your AI Know Every Fact but Fail the Combined Question?, which covers the evidence behind it and when to reach for it.