Skip to content

Claude Code Mastery9 / 12

Testing and Debugging

Letting Claude Code own the entire test loop. Including the parts that make engineers nervous: regressions, flakies, integration tests, and the stack-trace whisperer.

Testing is where Claude Code earns its keep. Debugging is where it gets weirdly impressive.

But it is also where a sloppy workflow goes sideways fast. Let an agent "fix" a test the wrong way and you ship a green CI with a broken feature. Let it "debug" a flaky test and it might just delete the test.

Here are the patterns that work — and the rule that keeps them honest.

The one rule

Operationalise this with two sub-agents that cannot be the same agent:

  • test-writer — only adds or modifies tests.
  • test-fixer — only modifies production code to make tests pass.

If test-fixer ever wants to touch a test, it must escalate to a human. That is the contract.

Pattern 1 — Test-first delegation

For new features, this is the cleanest workflow:

> /agents test-writer
> Goal: Write vitest cases for lib/cache.ts covering:
>   - LRU eviction at maxSize=3
>   - TTL expiry under fake timers
>   - get(missing) returns undefined
> Constraints: vitest, fake timers via vi.useFakeTimers().
> DoD: Tests written, all currently RED.

The output is failing tests. That is correct.

> /agents test-fixer
> Make the new tests pass without modifying tests/cache.test.ts.

test-fixer implements the LRU + TTL in lib/cache.ts. Tests go green. Diff is small, focused, reviewable.

This is much cleaner than "write the cache and the tests in one go" because the test author is blind to the implementation. It catches API mistakes early.

Pattern 2 — Stack-trace whisperer

When something explodes in production logs, paste the stack trace. Claude Code is unreasonably good at this.

> Read this stack trace and tell me which of the three most likely causes is correct,
  with file:line evidence:

[paste stack]

Then propose a 5-line fix that addresses ONLY that cause.

Two notes:

  • "Three most likely causes" forces it to enumerate, not over-commit.
  • "5-line fix" caps the blast radius.

I find ~70% of "what is happening here?" debugging sessions end after this single round.

Pattern 3 — Flaky test triage

Flaky tests are the bane of CI. The agent is great at categorising them:

> Run the test suite 10 times. List any test that:
>   - Failed at least once and passed at least once.
>   - Output the exact error from the failing run.
> Do NOT modify any code.

You get back a triage table. From there, you ask:

> For test X (flaky), classify the likely cause:
>   - Race condition
>   - Time-dependent (real clock vs fake clock)
>   - Network / external dependency
>   - Resource leak from a previous test
> Cite the file:line evidence.

The agent's classification is rarely 100% right but it is almost always "right neighborhood." That is enough to point your investigation in the right direction.

Pattern 4 — Bisecting a regression

You have git bisect and the agent has git log. Combine them:

> The endpoint /api/x returned 200 last week and now returns 500.
> Find the offending commit by:
>   1. git log --oneline since last good deploy.
>   2. For each commit touching /api/x or its imports, summarise the change in one line.
>   3. Identify the most likely culprit and explain why.
>   4. Do NOT run any code.

This is faster than a binary git bisect because the agent reads diffs as it goes. Best for codebases small enough that the candidate set is < 50 commits.

Pattern 5 — Coverage as a coaching loop

This one is underrated:

> Run pnpm test --coverage on lib/auth/.
> Report uncovered lines.
> For each uncovered branch, propose a single-line description of a test
> that would cover it. Do NOT write the tests.

You get a punch list. Then test-writer knocks them out one by one.

The reason this is better than "achieve 90% coverage automatically" is that you see the punch list and can deprioritise the silly branches (impossible-to-hit error paths, exhaustive-switch defaults).

Things to never delegate in testing

  • Deciding what to test. That is product knowledge.
  • Deciding when "good enough" is good enough. Coverage thresholds are a values call.
  • Marking a test as "skip" or "todo". Always human. Always.

The agent suggests; the human decides what counts as ready.

Debugging anti-patterns I see weekly

  • "Just make the test pass." Catastrophic. Always. The reviewer will catch it; the agent should not be asked.
  • "Add // @ts-ignore to silence the build." Sub-agent should refuse. Configure it that way in .claude/agents/test-fixer.md:
# rules
- Never add @ts-ignore, @ts-expect-error, or eslint-disable.
- If you cannot fix a type error, escalate.
  • "Update the snapshot." Sometimes legitimate, but the agent should always show the diff and explain why the snapshot changed before updating it.

The result, after a quarter of using this

On the codebase I've been running this on for ~3 months:

  • Mean time to triage a flaky CI: ~12 min → ~3 min.
  • Mean time to debug a stack trace from prod: ~20 min → ~5 min.
  • Test coverage delta on new features: noisy, but average +8 percentage points.

None of those numbers are the agent being magic. They are the result of standardising the testing workflow with constraints you can't easily standardise around humans.


Next article: Team Workflows — how engineering teams are integrating Claude Code today, from the "one tool license" anti-pattern to the shared .claude/ git pattern that scales.

Share this article

#ClaudeCode #Testing #Debugging #AgenticAI #DevTools

LinkedInX / TwitterBlueskyThreadsRedditHacker NewsWhatsAppEmail

Series — Claude Code Mastery

  1. Part 01Claude Code vs ChatGPT vs Copilot vs AgentsMost developers are using the wrong AI tool for the wrong job. Here is why — and what to do instead.
  2. Part 02Installation + The Antigravity WorkflowInstalling Claude Code is a 30-second job. Setting up the workflow that makes the agent feel like it's doing the heavy lifting — that's the part nobody writes about.
  3. Part 03Writing Prompts That Work"Make it better" is not a prompt. "Refactor this for performance" is not a prompt. Here is the four-part structure that makes Claude Code actually finish what you asked.
  4. Part 04Slash Commands — Building a Project from A to Z/init, /agents, /compact and your own custom commands. The toolkit that lets you go from empty folder to running app without leaving the Claude prompt.
  5. Part 05Sub-Agents — The 11 Specialized Experts Inside Claude CodeSlash commands reuse prompts. Sub-agents reuse whole personas — code-reviewer, test-writer, migration-runner. Here is the team you should have on day one.
  6. Part 06Production Codebase SafetyPermissions, guardrails, and what not to automate. The unsexy article that decides whether Claude Code becomes infrastructure or becomes the reason you got paged at 2 AM.
  7. Part 07Multi-Agent PipelinesChaining sub-agents, running them in parallel, and the patterns for 'review-while-coding' without losing your mind. Where Claude Code starts to feel like a small engineering org.
  8. Part 08Building Complete FeaturesFrom Linear ticket to merged PR with Claude Code. A real, honest walk-through — what the prompt looked like, what the agent got right, what I caught in review.
  9. Part 09Testing and Debuggingyou are hereLetting Claude Code own the entire test loop. Including the parts that make engineers nervous: regressions, flakies, integration tests, and the stack-trace whisperer.
  10. Part 10Team WorkflowsHow engineering teams are actually integrating Claude Code today. The shared .claude/ folder, the review rituals, and the anti-patterns I keep seeing in the wild.
  11. Part 11Advanced Patterns — Hooks, MCP Servers, Custom Tools, System PromptsOnce you've outgrown the defaults: hooks for deterministic side effects, MCP servers for org-specific data, custom tools, and system-prompt surgery.
  12. Part 12The Future of Agentic DevelopmentWhere this is going in 2026 and beyond. What I'd bet on, what I would not, and the line where I get sceptical of the hype.

Keep learning

Skill in the catalogue

prompt-engineer

Transforms user prompts into optimized prompts using frameworks (RTF, RISEN, Chain of Thought, RODES, Chain of Density, RACE, RISE, STAR, SOAP, CLEAR, GROW)

Open the skill →

Course

The Claude Mastery course

12 modules · 5 languages · certificate · 3-day free trial.

See plans →
LinkedInX / TwitterBlueskyThreads