agentic-ai-engineering · Eval Scorecard

Support Triage Agent

Generated 2026-05-16 · Skill: agent-eval-design · Command: /agentic-ai-engineering:agentic-evals
Coverage
Partial
Dimensions tested
3 / 6
Task Success and Safety dimensions have reasonable coverage; Trajectory Quality, Efficiency, and Collaboration are absent. Robustness has partial coverage limited to happy-path tool failures. Three high-priority gaps block production readiness.

Six-Dimension Coverage

Tested
Task Success
22 golden cases cover ticket classification, routing decisions, and escalation triggers. LLM grader + exact-match on ticket ID.
Absent
Trajectory Quality
No tests verify the agent takes the minimum-step path or avoids unnecessary tool calls. Trajectory not logged in current test harness.
Partial
Robustness
Tool failure injection covers ticket-lookup and CRM tool. Malformed inputs and ambiguous ticket text not tested.
Partial
Safety
PII redaction validated on 8 fixture tickets. Prompt injection resistance not tested. No adversarial customer-message fixtures.
Absent
Efficiency
No latency budget tests. Token cost per ticket not measured. No budget-exceeded behavior tested.
Absent
Collaboration
Agent-to-human handoff quality not evaluated. No tests for handoff message clarity or context completeness when escalating to Tier-2.

Grader Quality

Grader Type Quality
Ticket classifier accuracy Exact Match Reliable
Routing decision evaluator LLM Grader Reliable
PII redaction checker Regex + Rule Reliable
Escalation rationale scorer LLM Grader Questionable
Response tone evaluator LLM Grader Questionable

Anti-patterns Detected

Anti-pattern Present Evidence
Golden dataset too small Partial 22 cases cover core paths but edge categories (billing disputes, multi-issue tickets) have fewer than 3 examples each.
Grader judges its own output Present Escalation rationale grader uses the same model family (claude-3) as the agent itself — grader bias risk for subjective quality judgements.
No adversarial fixtures Present All 22 golden cases are well-formed, cooperative inputs. No tests for jailbreak attempts, instruction injection in ticket body, or adversarial category confusion.
Coverage metric disguises dimension gaps Present Reporting "22 passing evals" hides that Trajectory Quality, Efficiency, and Collaboration have zero coverage — a single number obscures 3 absent dimensions.
Eval suite not run in CI Absent Eval suite is wired into the deployment pipeline and runs on every PR that modifies agent prompt, routing logic, or tool schema.
Static golden dataset Absent Dataset refresh process documented; new production tickets are sampled weekly and reviewed for golden-case candidates.

Coverage Gaps

P1
No trajectory quality tests
The agent's step count and tool call sequence are not evaluated. An agent that reaches the correct routing decision via an inefficient 12-step path will pass all current evals.
Add trajectory logging to the test harness; write 5 golden trajectories for core ticket types; score step count and unnecessary tool calls.
P1
Adversarial safety fixtures absent
No tests cover prompt injection in ticket body, adversarial category confusion, or attempts to extract internal routing rules via ticket text.
Write 8–10 adversarial fixtures; add a prompt-injection detection grader; target failure-to-detect rate below 5%.
P1
Handoff quality not evaluated
When the triage agent escalates to Tier-2, the handoff message content, context completeness, and recommended priority are not graded. Tier-2 agents receive unvalidated context.
Define handoff message schema; write LLM grader rubric for context completeness; add 6 escalation golden cases.
P2
No efficiency / latency budget tests
Token cost per ticket and p95 latency are unmonitored in evals. Budget-exceeded edge cases (long ticket threads) are not tested.
Instrument token counter in test harness; set budget thresholds (2000 tokens, 8s p95); add 3 long-thread stress fixtures.
P2
Escalation rationale grader uses same model family
Claude-3 grading Claude-3 outputs introduces grader bias — the grader is likely to over-score rationale quality for outputs that match its own style.
Replace escalation rationale grader with a GPT-4o or Gemini 1.5 Pro judge, or add a human-calibration pass on a 20% sample.

Eval Recommendation

The Support Triage Agent eval suite demonstrates solid core coverage for ticket classification and routing but has three production-blocking gaps: no trajectory quality tests, no adversarial safety fixtures, and no handoff quality evaluation. Before promoting to production, resolve P1 gaps and re-run the coverage assessment. The grader bias issue (same model family grading escalation rationale) should be addressed in parallel. Once P1 gaps are closed, coverage will reach at least 5/6 dimensions and the suite will be production-ready.

Implementation Priorities