Claude Code Mastery9 / 12

Testing and Debugging

Letting Claude Code own the entire test loop. Including the parts that make engineers nervous: regressions, flakies, integration tests, and the stack-trace whisperer.

Published May 12, 20265 min readHaythem Rehouma · Claude Mastery

Testing is where Claude Code earns its keep. Debugging is where it gets weirdly impressive.

But it is also where a sloppy workflow goes sideways fast. Let an agent "fix" a test the wrong way and you ship a green CI with a broken feature. Let it "debug" a flaky test and it might just delete the test.

Here are the patterns that work — and the rule that keeps them honest.

The one rule

Operationalise this with two sub-agents that cannot be the same agent:

test-writer — only adds or modifies tests.
test-fixer — only modifies production code to make tests pass.

If test-fixer ever wants to touch a test, it must escalate to a human. That is the contract.

Pattern 1 — Test-first delegation

For new features, this is the cleanest workflow:

> /agents test-writer
> Goal: Write vitest cases for lib/cache.ts covering:
>   - LRU eviction at maxSize=3
>   - TTL expiry under fake timers
>   - get(missing) returns undefined
> Constraints: vitest, fake timers via vi.useFakeTimers().
> DoD: Tests written, all currently RED.

The output is failing tests. That is correct.

> /agents test-fixer
> Make the new tests pass without modifying tests/cache.test.ts.

test-fixer implements the LRU + TTL in lib/cache.ts. Tests go green. Diff is small, focused, reviewable.

This is much cleaner than "write the cache and the tests in one go" because the test author is blind to the implementation. It catches API mistakes early.

Pattern 2 — Stack-trace whisperer

When something explodes in production logs, paste the stack trace. Claude Code is unreasonably good at this.

> Read this stack trace and tell me which of the three most likely causes is correct,
  with file:line evidence:

[paste stack]

Then propose a 5-line fix that addresses ONLY that cause.

Two notes:

"Three most likely causes" forces it to enumerate, not over-commit.
"5-line fix" caps the blast radius.

I find ~70% of "what is happening here?" debugging sessions end after this single round.

Pattern 3 — Flaky test triage

Flaky tests are the bane of CI. The agent is great at categorising them:

> Run the test suite 10 times. List any test that:
>   - Failed at least once and passed at least once.
>   - Output the exact error from the failing run.
> Do NOT modify any code.

You get back a triage table. From there, you ask:

> For test X (flaky), classify the likely cause:
>   - Race condition
>   - Time-dependent (real clock vs fake clock)
>   - Network / external dependency
>   - Resource leak from a previous test
> Cite the file:line evidence.

The agent's classification is rarely 100% right but it is almost always "right neighborhood." That is enough to point your investigation in the right direction.

Pattern 4 — Bisecting a regression

You have git bisect and the agent has git log. Combine them:

> The endpoint /api/x returned 200 last week and now returns 500.
> Find the offending commit by:
>   1. git log --oneline since last good deploy.
>   2. For each commit touching /api/x or its imports, summarise the change in one line.
>   3. Identify the most likely culprit and explain why.
>   4. Do NOT run any code.

This is faster than a binary git bisect because the agent reads diffs as it goes. Best for codebases small enough that the candidate set is < 50 commits.

Pattern 5 — Coverage as a coaching loop

This one is underrated:

> Run pnpm test --coverage on lib/auth/.
> Report uncovered lines.
> For each uncovered branch, propose a single-line description of a test
> that would cover it. Do NOT write the tests.

You get a punch list. Then test-writer knocks them out one by one.

The reason this is better than "achieve 90% coverage automatically" is that you see the punch list and can deprioritise the silly branches (impossible-to-hit error paths, exhaustive-switch defaults).

Things to never delegate in testing

Deciding what to test. That is product knowledge.
Deciding when "good enough" is good enough. Coverage thresholds are a values call.
Marking a test as "skip" or "todo". Always human. Always.

The agent suggests; the human decides what counts as ready.

Debugging anti-patterns I see weekly

"Just make the test pass." Catastrophic. Always. The reviewer will catch it; the agent should not be asked.
"Add // @ts-ignore to silence the build." Sub-agent should refuse. Configure it that way in .claude/agents/test-fixer.md:

# rules
- Never add @ts-ignore, @ts-expect-error, or eslint-disable.
- If you cannot fix a type error, escalate.

"Update the snapshot." Sometimes legitimate, but the agent should always show the diff and explain why the snapshot changed before updating it.

The result, after a quarter of using this

On the codebase I've been running this on for ~3 months:

Mean time to triage a flaky CI: ~12 min → ~3 min.
Mean time to debug a stack trace from prod: ~20 min → ~5 min.
Test coverage delta on new features: noisy, but average +8 percentage points.

None of those numbers are the agent being magic. They are the result of standardising the testing workflow with constraints you can't easily standardise around humans.

Next article: Team Workflows — how engineering teams are integrating Claude Code today, from the "one tool license" anti-pattern to the shared .claude/ git pattern that scales.

Claude Mastery

The one rule

Pattern 1 — Test-first delegation

Pattern 2 — Stack-trace whisperer

Pattern 3 — Flaky test triage

Pattern 4 — Bisecting a regression

Pattern 5 — Coverage as a coaching loop

Things to never delegate in testing

Debugging anti-patterns I see weekly

The result, after a quarter of using this

Share this article

Series — Claude Code Mastery

Keep learning

prompt-engineer

The Claude Mastery course