Debugging With AI: Three Hypotheses in Three Panes

A debugging workflow that runs three parallel AI agents on the same bug, each exploring a different hypothesis, with a shared shell for log inspection.

April 18, 2026 · 6 min read

The problem

Debugging is search. You have a bug, you have a hypothesis, you test it, the hypothesis was wrong, you form another one. Classical single-threaded debugging walks one hypothesis at a time — you finish investigating hypothesis A before you start on B. If A turns out to be a dead end, the hour you spent on it is gone, and you start over on B. If you have three plausible hypotheses and you want to work all of them, you need three terminals and three agents, not one.

The other pain point: when you finally find the bug, you lose the investigation trail. You fixed it, you shipped, you moved on. Two weeks later the bug returns in a slightly different form, and you don't remember which of the three hypotheses was right or what you learned along the way. Running the investigation in a visible grid with explicit panes makes the trail legible and — if you screenshot or export — reviewable.

The grid setup

A 2x2 grid. Top-left, top-right, and bottom-left each run Claude Code (or whichever agent you default to). Bottom-right is a shell pane for grep, log tailing, and running the actual failing test or reproducer. You give each of the three agent panes the exact same bug report, but ask each to pursue a different hypothesis.

This is an opinion piece, not a how-to. The setup is simple; the discipline is hard. The hard part is resisting the urge to tell the agents the answer. If you already knew the answer, you wouldn't be debugging. The three-pane layout is there to force you to explore, not confirm.

Step by step

  1. Open a new space at the repo root — say, ~/code/payment-worker. Pick the 2x2 preset.
  2. Write the bug report once, in a scratch file, and paste it into each agent pane identically. A usable bug report is specific: "On production, process_refund/2 returns {:error, :stale_token} roughly 1 in 200 times. Logs attached at /tmp/stale_token.log. It does not reproduce locally. Started Tuesday after the oban upgrade."
  3. In pane 1, after pasting the report, add: "Hypothesis A: the worker is using a cached auth token that expired mid-job. Investigate the token refresh logic in lib/auth/ and propose a repro."
  4. In pane 2: "Hypothesis B: there is a race between two workers picking up the same refund job. Investigate queue uniqueness in lib/workers/refund.ex and config/oban.exs."
  5. In pane 3: "Hypothesis C: the upstream payment API is returning transient 401s that we're misclassifying. Investigate the HTTP client wrapper in lib/payments/client.ex."
  6. While the three agents investigate, use pane 4 (shell) to grep the attached log: rg 'stale_token' /tmp/stale_token.log | head -n 50. Share anything interesting back to whichever agent's hypothesis it confirms or kills.
  7. Agents will typically return one of three things: (a) "I found the bug, here's the fix," (b) "this hypothesis doesn't hold because X," (c) vague hand-waving. Treat (a) with suspicion, (b) as progress, (c) as a signal to give more context.
  8. When one agent produces a concrete repro — "I can trigger this by calling MyMod.refresh/0 twice within a 100ms window while a worker is running" — stop the other two panes. You have the answer. Now focus.
  9. In pane 4 (shell), run the repro. Confirm it actually reproduces. Only then ask the agent to write the fix.
  10. Write a one-paragraph post-mortem into a file at the repo root before closing the space. Future you will thank current you when the bug recurs.

What this unlocks

Breadth without extra wall-clock cost. Three hypotheses investigated in forty minutes instead of one hypothesis per forty minutes. For hard bugs, this is the difference between fixing it today and fixing it tomorrow.

Fewer dead-end spirals. When agent 1 is stuck, agent 2 might have already killed its hypothesis and freed up your attention. You're not emotionally invested in any one hypothesis because two are always being pursued in parallel.

A debug log by default. Three panes, three transcripts, one shell history. When the dust settles you have a rough record of what was tried, what failed, and what worked. Copy the transcripts into the post-mortem.

Cheaper context escalation. If the agent in one pane needs a stack trace or log file, you drop it in with a quick paste. The other two panes keep working. No context-switch penalty.

Variations

Single agent, three ideas. If you only have one AI subscription, use one pane and explicitly label the hypotheses in a single prompt: "Investigate hypothesis A, B, and C in order. Stop when one holds." Slower than the three-pane version, faster than no plan at all.

Repro farm. Four panes, each running a variation of the failing test with different inputs or load conditions. Useful for race conditions that only manifest under specific concurrency. No agent in any pane — just tests. When one fires, you've found your repro.

Pair-debug with a human. 2x2 grid in a screen-share. You drive Claude in pane 1, your colleague drives Claude in pane 2, each pursuing a different hypothesis. Pane 3 is shared git/shell. Pane 4 is the error log streaming live. Works well for pair-programmers already used to screen-sharing.

Caveats

This approach is expensive for easy bugs. If the bug is a typo or a missing null check, you don't need three hypotheses — you need one agent and a five-minute fix. Reserve the 2x2 debug grid for bugs that have already defeated one attempt.

Agents lie confidently. An AI pane will sometimes declare "I found the root cause" when it found a symptom. The shell pane and a real repro are not optional. No fix gets merged without a reproducing test.

Hypothesis generation is still human work. The agents explore hypotheses you give them; they do not reliably generate good hypotheses on their own. If you can't name three plausible causes, you need to spend more time reading the logs before opening the grid.

FAQ

What if all three hypotheses are wrong? Good signal. It means your initial read of the bug was off. Close the agents, go back to the logs, form three new hypotheses, and reopen. Iterating on hypothesis generation is faster in this framework than in a single terminal.

Should the three panes all use the same model? Mostly yes — for debugging, you want to isolate the hypothesis as the variable, not the model. That said, occasionally running one pane with a different model surfaces different angles. Treat it as a diagnostic tool, not a default.

Can I share the grid screenshot in a PR as evidence? Yes. A screenshot of three panes each ruling out a hypothesis is a compact form of "here is how I diagnosed this." Some teams find it more persuasive than a written post-mortem.


Related reading:

Keep reading