Testing, Debugging, and Updating Agents
Copy page
Tune delegation behavior, debug agent underperformance, distinguish designer failures from runtime failures, and update existing agents without drift.
Agent engineering doesn't stop at writing the first prompt. This page covers how to tune delegation, debug underperformance, and update agents while preserving their original intent.
Tuning delegation behavior
If delegation is too frequent
- Narrow the description triggers
- Add explicit exclusions ("Do not use for X")
- Add or strengthen a near-miss
<example>block with clear commentary explaining why it should not delegate
If delegation never happens
- Add concrete keywords and file patterns to the description
- Add or strengthen
<example>blocks (aim for 2-4 total) - Use "use proactively..." phrasing if you want aggressive delegation
If output is too verbose
- Strengthen the output contract with explicit verbosity limits
- Add a "verbosity limit" rule (e.g., "max 1-2 screens unless asked")
If it misses key checks
- Add a required checklist item ("Always run X", "Always verify Y")
- Make the check explicit in the workflow section
If it overreaches or changes too much
- Tighten scope and non-goals
- Reduce tool access
- Consider using plan mode or reviewer pattern instead
Diagnosing underperformance
When an agent fails, first identify whether it's a designer problem or a runtime problem:
| Category | What it means | Examples | Fix |
|---|---|---|---|
| Designer failure | The prompt has issues | Ambiguous instructions; missing escalation rules; no output contract | Edit the agent prompt |
| Runtime failure | The prompt is sound but the model misapplies it | Model ignores clear instruction; hallucinates despite guardrails | Strengthen emphasis, add examples, or accept limitation |
Rule of thumb: If you can imagine a careful human reader misinterpreting the prompt the same way the model did, it's a designer failure.
Diagnostic questions
- If you gave this prompt to a different model, would it interpret it the same way?
- Are there instructions that could be read two ways?
- Does the agent have enough context to judge edge cases?
- Is the output contract specific enough to know when output is "good"?
If any answer is "no," it's likely a designer failure. Fix the prompt before blaming the model.
Common designer failure modes
1. Ambiguous instructions
Instructions that could be read two ways — the model picks one interpretation; the user expected the other.
| Ambiguous | Clear |
|---|---|
| "Be thorough" | "Check for security issues, missing error handling, and breaking API changes" |
| "Clean up the code" | "Remove unused imports and dead code paths" |
| "Review for issues" | "Review for security vulnerabilities and correctness bugs" |
Fix: Add clarifying examples, "do X, not Y" constraints, or explicit scope.
2. Missing context assumptions
The prompt assumes context that won't be available at runtime.
| Problem | Fix |
|---|---|
| "Continue from where we left off" | Subagents start fresh — include all needed context in the handoff |
| "Apply the pattern we discussed" | Specify the pattern explicitly |
| "Fix the issue" | Describe the issue in detail |
3. Vague directive strength
Using weak language for requirements or strong language for preferences.
| Problem | Better |
|---|---|
| "Try to run tests" (when tests are required) | "You must run tests before returning" |
| "Consider security" (when security is non-negotiable) | "You must check for security vulnerabilities" |
| "Must follow style guide" (for cosmetic preferences) | "You should follow the style guide when possible" |
4. Missing failure mode awareness
The prompt tells the agent what to do, but not what to watch out for.
Fix: Include 3-5 contextually relevant failure modes from the failure mode catalog.
5. Underspecified output contract
The prompt doesn't define what "good output" looks like.
| Vague | Specific |
|---|---|
| "Return your findings" | "Return findings as a prioritized list with severity, file path, description, and recommendation" |
| "Summarize the issues" | "Return a TL;DR (2-5 bullets) followed by findings sorted by severity" |
6. Missing escalation rules
No guidance on when to ask for help vs proceed with assumptions.
Fix: Add explicit rules: when to ask, when to proceed with labeled assumptions, when to return partial results.
7. Overloaded scope
Too many instructions competing for attention.
Fix: Prioritize ruthlessly. Move reference material to files loaded on demand. Use "must" for critical items and "should" for nice-to-haves.
8. False dichotomies
Presenting two options when more exist.
| False dichotomy | Better |
|---|---|
| "Either fix the bug or document it" | "Fix, document, or escalate based on severity and confidence" |
| "Ask or assume" | "Ask for high-stakes decisions; proceed with labeled assumptions for low-stakes ones" |
The designer self-check
Before delivering an agent prompt, verify:
- No instruction could be read two ways in a different context
- No assumed context the agent won't have
- Directive strength is clear throughout ("must"/"should"/"consider" match actual requirements)
- 3-5 relevant failure modes are addressed
- Output contract is specific enough to validate against
- Escalation rules are clear (when to ask, when to proceed, when to return partial results)
- Scope is manageable (agent can hold all critical instructions in effective attention)
Updating existing agents
When updating an existing agent, the goal is to improve clarity and structure without changing the agent's meaning unless explicitly requested.
What you can change safely
These changes don't affect meaning, routing, or capability:
- Reformatting (headings, lists, checklists) without changing requirements
- Reducing redundancy and merging duplicated text
- Reordering sections for scannability
- Tightening language where intent is already clear
- Fixing typos, grammar, or formatting inconsistencies
What requires explicit approval
These changes can affect when the agent fires or how it judges:
- Expanding or narrowing description triggers
- Adding, removing, or modifying
<example>blocks - Changing MUST/SHOULD wording or severity levels
- Modifying personality statements or escalation thresholds
- Adding or removing failure modes
What requires impact analysis
These changes can break downstream consumers:
- Renaming the agent (breaks Task tool references)
- Changing return packet format (breaks orchestrator aggregation)
- Changing severity levels or dedup keys
- Changing from subagent to orchestrator pattern (or vice versa)
The update workflow
Inventory the current agent — read the full file, note frontmatter fields, body sections, and dependencies
Capture an intent snapshot — document purpose, routing posture, behavioral calibration, capability surface, and output contract
Identify update opportunities — find redundancy, unclear structure, ambiguous instructions, or missing output contracts
Classify each change — safe (implement by default), routing/calibration risk (require approval), capability/contract change (require explicit approval), downstream-breaking (require impact analysis)
Apply safe changes — implement formatting and clarity improvements
Drift check — compare before/after intent snapshots and flag any unintended changes
Common anti-patterns when updating
| Anti-pattern | Description |
|---|---|
| Oversampling | Adding rules, failure modes, or edge cases "because it seems good" — only add what the author clearly implied |
| Tone normalization | Rewriting personality into your own style, which can shift strictness or escalation thresholds |
| Calibration creep | Subtly shifting MUST → SHOULD or SHOULD → CONSIDER without realizing the behavioral impact |
| Silent routing changes | Expanding description triggers so the agent fires in new contexts without consent |
| Tool accumulation | Adding tools "for convenience" without considering the risk surface |
| Schema drift | Changing return packet structure when an orchestrator depends on it |
| Diff noise | Massive reflow that makes review hard and increases risk of accidental semantic changes |
Maintenance best practices
- Prefer small, targeted edits based on observed failures
- Keep a changelog in git history rather than inside the agent file
- Test delegation behavior after changes to description or
<example>blocks - When multiple failures occur, make targeted fixes rather than shotgun edits across the prompt