Support Triage Agent — Eval Scorecard

Six-Dimension Coverage

Tested

Task Success

22 golden cases cover ticket classification, routing decisions, and escalation triggers. LLM grader + exact-match on ticket ID.

Absent

Trajectory Quality

No tests verify the agent takes the minimum-step path or avoids unnecessary tool calls. Trajectory not logged in current test harness.

Partial

Robustness

Tool failure injection covers ticket-lookup and CRM tool. Malformed inputs and ambiguous ticket text not tested.

Partial

Safety

PII redaction validated on 8 fixture tickets. Prompt injection resistance not tested. No adversarial customer-message fixtures.

Absent

Efficiency

No latency budget tests. Token cost per ticket not measured. No budget-exceeded behavior tested.

Absent

Collaboration

Agent-to-human handoff quality not evaluated. No tests for handoff message clarity or context completeness when escalating to Tier-2.

Grader Quality

Grader	Type	Quality
Ticket classifier accuracy	Exact Match	Reliable
Routing decision evaluator	LLM Grader	Reliable
PII redaction checker	Regex + Rule	Reliable
Escalation rationale scorer	LLM Grader	Questionable
Response tone evaluator	LLM Grader	Questionable

Anti-patterns Detected

Anti-pattern	Present	Evidence
Golden dataset too small	Partial	22 cases cover core paths but edge categories (billing disputes, multi-issue tickets) have fewer than 3 examples each.
Grader judges its own output	Present	Escalation rationale grader uses the same model family (claude-3) as the agent itself — grader bias risk for subjective quality judgements.
No adversarial fixtures	Present	All 22 golden cases are well-formed, cooperative inputs. No tests for jailbreak attempts, instruction injection in ticket body, or adversarial category confusion.
Coverage metric disguises dimension gaps	Present	Reporting "22 passing evals" hides that Trajectory Quality, Efficiency, and Collaboration have zero coverage — a single number obscures 3 absent dimensions.
Eval suite not run in CI	Absent	Eval suite is wired into the deployment pipeline and runs on every PR that modifies agent prompt, routing logic, or tool schema.
Static golden dataset	Absent	Dataset refresh process documented; new production tickets are sampled weekly and reviewed for golden-case candidates.

Coverage Gaps

No trajectory quality tests

The agent's step count and tool call sequence are not evaluated. An agent that reaches the correct routing decision via an inefficient 12-step path will pass all current evals.

Add trajectory logging to the test harness; write 5 golden trajectories for core ticket types; score step count and unnecessary tool calls.

Adversarial safety fixtures absent

No tests cover prompt injection in ticket body, adversarial category confusion, or attempts to extract internal routing rules via ticket text.

Write 8–10 adversarial fixtures; add a prompt-injection detection grader; target failure-to-detect rate below 5%.

Handoff quality not evaluated

When the triage agent escalates to Tier-2, the handoff message content, context completeness, and recommended priority are not graded. Tier-2 agents receive unvalidated context.

Define handoff message schema; write LLM grader rubric for context completeness; add 6 escalation golden cases.

No efficiency / latency budget tests

Token cost per ticket and p95 latency are unmonitored in evals. Budget-exceeded edge cases (long ticket threads) are not tested.

Instrument token counter in test harness; set budget thresholds (2000 tokens, 8s p95); add 3 long-thread stress fixtures.

Escalation rationale grader uses same model family

Claude-3 grading Claude-3 outputs introduces grader bias — the grader is likely to over-score rationale quality for outputs that match its own style.

Replace escalation rationale grader with a GPT-4o or Gemini 1.5 Pro judge, or add a human-calibration pass on a 20% sample.

Eval Recommendation

The Support Triage Agent eval suite demonstrates solid core coverage for ticket classification and routing but has three production-blocking gaps: no trajectory quality tests, no adversarial safety fixtures, and no handoff quality evaluation. Before promoting to production, resolve P1 gaps and re-run the coverage assessment. The grader bias issue (same model family grading escalation rationale) should be addressed in parallel. Once P1 gaps are closed, coverage will reach at least 5/6 dimensions and the suite will be production-ready.

Implementation Priorities

1. Add trajectory logging to the test harness and write 5 golden trajectory fixtures for the 5 most common ticket categories.
2. Write 8–10 adversarial safety fixtures covering prompt injection, category confusion, and routing-rule extraction attempts. Add detection grader.
3. Define handoff message schema and write 6 escalation golden cases with LLM grader for context completeness and priority accuracy.
4. Instrument token counter and latency timer; set budget thresholds; add 3 long-thread stress fixtures to catch efficiency regressions.
5. Replace same-family escalation grader with a cross-family judge (GPT-4o or Gemini 1.5 Pro) or add 20% human-calibration pass.