| Grader | Type | Quality |
|---|---|---|
| Ticket classifier accuracy | Exact Match | Reliable |
| Routing decision evaluator | LLM Grader | Reliable |
| PII redaction checker | Regex + Rule | Reliable |
| Escalation rationale scorer | LLM Grader | Questionable |
| Response tone evaluator | LLM Grader | Questionable |
| Anti-pattern | Present | Evidence |
|---|---|---|
| Golden dataset too small | Partial | 22 cases cover core paths but edge categories (billing disputes, multi-issue tickets) have fewer than 3 examples each. |
| Grader judges its own output | Present | Escalation rationale grader uses the same model family (claude-3) as the agent itself — grader bias risk for subjective quality judgements. |
| No adversarial fixtures | Present | All 22 golden cases are well-formed, cooperative inputs. No tests for jailbreak attempts, instruction injection in ticket body, or adversarial category confusion. |
| Coverage metric disguises dimension gaps | Present | Reporting "22 passing evals" hides that Trajectory Quality, Efficiency, and Collaboration have zero coverage — a single number obscures 3 absent dimensions. |
| Eval suite not run in CI | Absent | Eval suite is wired into the deployment pipeline and runs on every PR that modifies agent prompt, routing logic, or tool schema. |
| Static golden dataset | Absent | Dataset refresh process documented; new production tickets are sampled weekly and reviewed for golden-case candidates. |
The Support Triage Agent eval suite demonstrates solid core coverage for ticket classification and routing but has three production-blocking gaps: no trajectory quality tests, no adversarial safety fixtures, and no handoff quality evaluation. Before promoting to production, resolve P1 gaps and re-run the coverage assessment. The grader bias issue (same model family grading escalation rationale) should be addressed in parallel. Once P1 gaps are closed, coverage will reach at least 5/6 dimensions and the suite will be production-ready.