159 lines
7.1 KiB
Markdown
159 lines
7.1 KiB
Markdown
---
|
|
name: investigate
|
|
description: "Systematic debugging with root cause investigation. Four phases: investigate, analyze, hypothesize, implement. Iron Law: no fixes without root cause. Use when asked to "debug this", "fix this bug", "w"
|
|
---
|
|
|
|
---
|
|
|
|
## Phase 1: Root Cause Investigation
|
|
|
|
Gather context before forming any hypothesis.
|
|
|
|
1. **Collect symptoms:** Read the error messages, stack traces, and reproduction steps. If the user hasn't provided enough context, ask ONE question at a time via question.
|
|
|
|
2. **Read the code:** Trace the code path from the symptom back to potential causes. Use Grep to find all references, Read to understand the logic.
|
|
|
|
3. **Check recent changes:**
|
|
```bash
|
|
git log --oneline -20 -- <affected-files>
|
|
```
|
|
Was this working before? What changed? A regression means the root cause is in the diff.
|
|
|
|
4. **Reproduce:** Can you trigger the bug deterministically? If not, gather more evidence before proceeding.
|
|
|
|
Output: **"Root cause hypothesis: ..."** — a specific, testable claim about what is wrong and why.
|
|
|
|
---
|
|
|
|
## Scope Lock
|
|
|
|
After forming your root cause hypothesis, lock edits to the affected module to prevent scope creep.
|
|
|
|
```bash
|
|
[ -x "${GSTACK_OPENCODE_DIR}/../freeze/bin/check-freeze.sh" ] && echo "FREEZE_AVAILABLE" || echo "FREEZE_UNAVAILABLE"
|
|
```
|
|
|
|
**If FREEZE_AVAILABLE:** Identify the narrowest directory containing the affected files. Write it to the freeze state file:
|
|
|
|
```bash
|
|
STATE_DIR="${CLAUDE_PLUGIN_DATA:-$HOME/.gstack}"
|
|
mkdir -p "$STATE_DIR"
|
|
echo "<detected-directory>/" > "$STATE_DIR/freeze-dir.txt"
|
|
echo "Debug scope locked to: <detected-directory>/"
|
|
```
|
|
|
|
Substitute `<detected-directory>` with the actual directory path (e.g., `src/auth/`). Tell the user: "Edits restricted to `<dir>/` for this debug session. This prevents changes to unrelated code. Run `/unfreeze` to remove the restriction."
|
|
|
|
If the bug spans the entire repo or the scope is genuinely unclear, skip the lock and note why.
|
|
|
|
**If FREEZE_UNAVAILABLE:** Skip scope lock. Edits are unrestricted.
|
|
|
|
---
|
|
|
|
## Phase 2: Pattern Analysis
|
|
|
|
Check if this bug matches a known pattern:
|
|
|
|
| Pattern | Signature | Where to look |
|
|
|---------|-----------|---------------|
|
|
| Race condition | Intermittent, timing-dependent | Concurrent access to shared state |
|
|
| Nil/null propagation | NoMethodError, TypeError | Missing guards on optional values |
|
|
| State corruption | Inconsistent data, partial updates | Transactions, callbacks, hooks |
|
|
| Integration failure | Timeout, unexpected response | External API calls, service boundaries |
|
|
| Configuration drift | Works locally, fails in staging/prod | Env vars, feature flags, DB state |
|
|
| Stale cache | Shows old data, fixes on cache clear | Redis, CDN, browser cache, Turbo |
|
|
|
|
Also check:
|
|
- `TODOS.md` for related known issues
|
|
- `git log` for prior fixes in the same area — **recurring bugs in the same files are an architectural smell**, not a coincidence
|
|
|
|
**External pattern search:** If the bug doesn't match a known pattern above, WebSearch for:
|
|
- "{framework} {generic error type}" — **sanitize first:** strip hostnames, IPs, file paths, SQL, customer data. Search the error category, not the raw message.
|
|
- "{library} {component} known issues"
|
|
|
|
If WebSearch is unavailable, skip this search and proceed with hypothesis testing. If a documented solution or known dependency bug surfaces, present it as a candidate hypothesis in Phase 3.
|
|
|
|
---
|
|
|
|
## Phase 3: Hypothesis Testing
|
|
|
|
Before writing ANY fix, verify your hypothesis.
|
|
|
|
1. **Confirm the hypothesis:** Add a temporary log statement, assertion, or debug output at the suspected root cause. Run the reproduction. Does the evidence match?
|
|
|
|
2. **If the hypothesis is wrong:** Before forming the next hypothesis, consider searching for the error. **Sanitize first** — strip hostnames, IPs, file paths, SQL fragments, customer identifiers, and any internal/proprietary data from the error message. Search only the generic error type and framework context: "{component} {sanitized error type} {framework version}". If the error message is too specific to sanitize safely, skip the search. If WebSearch is unavailable, skip and proceed. Then return to Phase 1. Gather more evidence. Do not guess.
|
|
|
|
3. **3-strike rule:** If 3 hypotheses fail, **STOP**. Use question:
|
|
```
|
|
3 hypotheses tested, none match. This may be an architectural issue
|
|
rather than a simple bug.
|
|
|
|
A) Continue investigating — I have a new hypothesis: [describe]
|
|
B) Escalate for human review — this needs someone who knows the system
|
|
C) Add logging and wait — instrument the area and catch it next time
|
|
```
|
|
|
|
**Red flags** — if you see any of these, slow down:
|
|
- "Quick fix for now" — there is no "for now." Fix it right or escalate.
|
|
- Proposing a fix before tracing data flow — you're guessing.
|
|
- Each fix reveals a new problem elsewhere — wrong layer, not wrong code.
|
|
|
|
---
|
|
|
|
## Phase 4: Implementation
|
|
|
|
Once root cause is confirmed:
|
|
|
|
1. **Fix the root cause, not the symptom.** The smallest change that eliminates the actual problem.
|
|
|
|
2. **Minimal diff:** Fewest files touched, fewest lines changed. Resist the urge to refactor adjacent code.
|
|
|
|
3. **Write a regression test** that:
|
|
- **Fails** without the fix (proves the test is meaningful)
|
|
- **Passes** with the fix (proves the fix works)
|
|
|
|
4. **Run the full test suite.** Paste the output. No regressions allowed.
|
|
|
|
5. **If the fix touches >5 files:** Use question to flag the blast radius:
|
|
```
|
|
This fix touches N files. That's a large blast radius for a bug fix.
|
|
A) Proceed — the root cause genuinely spans these files
|
|
B) Split — fix the critical path now, defer the rest
|
|
C) Rethink — maybe there's a more targeted approach
|
|
```
|
|
|
|
---
|
|
|
|
## Phase 5: Verification & Report
|
|
|
|
**Fresh verification:** Reproduce the original bug scenario and confirm it's fixed. This is not optional.
|
|
|
|
Run the test suite and paste the output.
|
|
|
|
Output a structured debug report:
|
|
```
|
|
DEBUG REPORT
|
|
════════════════════════════════════════
|
|
Symptom: [what the user observed]
|
|
Root cause: [what was actually wrong]
|
|
Fix: [what was changed, with file:line references]
|
|
Evidence: [test output, reproduction attempt showing fix works]
|
|
Regression test: [file:line of the new test]
|
|
Related: [TODOS.md items, prior bugs in same area, architectural notes]
|
|
Status: DONE | DONE_WITH_CONCERNS | BLOCKED
|
|
════════════════════════════════════════
|
|
```
|
|
|
|
---
|
|
|
|
## Important Rules
|
|
|
|
- **3+ failed fix attempts → STOP and question the architecture.** Wrong architecture, not failed hypothesis.
|
|
- **Never apply a fix you cannot verify.** If you can't reproduce and confirm, don't ship it.
|
|
- **Never say "this should fix it."** Verify and prove it. Run the tests.
|
|
- **If fix touches >5 files → question** about blast radius before proceeding.
|
|
- **Completion status:**
|
|
- DONE — root cause found, fix applied, regression test written, all tests pass
|
|
- DONE_WITH_CONCERNS — fixed but cannot fully verify (e.g., intermittent bug, requires staging)
|
|
- BLOCKED — root cause unclear after investigation, escalated
|