Vavi Labs · Agentic AI Reference Implementation Library · Logistics Vertical

Agentic AI for
Shipment Exception Management

A production-grade reference implementation: 6-agent architecture, state machine lifecycle, 7-layer guardrails, eval-first engineering, and a runnable LangGraph stack.
Brand
Vavi Labs
Domain
Logistics / Supply Chain
Stack
Python · LangGraph · FastAPI
Contents
  1. 1Executive Summary
  2. 2The Problem: Shipping Exceptions at Scale
  3. 3The Solution: Agentic Control Tower
  4. 4Data Layer: Bronze → Silver → Gold
  5. 5Trust & Safety: Guardrails and HITL
  6. 6Evaluation: Eval-First Engineering
  7. 7Implementation Path
  8. 8Business Case
  9. 9Code Implementation
  10. 10Deployment on AWS
  11. AAppendix: Glossary & State Machine Reference

Executive Summary

Every logistics operation produces a daily stream of shipping exceptions. How those exceptions are handled determines SLA performance, expedite cost, and customer satisfaction. The current model is structurally broken — and agents are the right fix.

45 min
Avg. manual exception resolution time today
<10 min
Target resolution time with agentic control tower
<30%
Human intervention rate target (down from ~100%)

This whitepaper presents a complete reference implementation for an agentic exception management system built on LangGraph, FastAPI, and LiteLLM. It covers the business problem, the 6-agent architecture, the data layer, the trust and safety framework, the evaluation strategy, and a concrete implementation roadmap.

Shipping exception management scores at the high end across every dimension that predicts agent success: high frequency, clear success criteria, multi-step reasoning requirement, tool access, and reversible actions.

The system is designed for operators at Supply Chain Autonomy Maturity Level 2 (Assisted) seeking to reach Level 3–4 (Automated / Agentic). It is not a black-box product — it is a reference implementation you can audit, fork, and adapt. Every design decision is documented in Architecture Decision Records. Every agent behavior is covered by an eval scenario before a line of agent code is written.

The Problem: Shipping Exceptions at Scale

Shipping exception management is a decision problem hiding inside a communication problem. It fails not because people are incompetent or systems are absent — it fails because the operating model is structurally wrong for the volume and variability it faces.

The problem with shipping exceptions — key statistics
The compounding cost of manual exception management

Four Structural Failure Modes

The current workflow breaks along four structural fault lines, each of which compounds the others:

Failure modeRoot causeOperational impact
Data fragmentation Evidence is split across TMS, ERP, carrier portals, EDI, email, and document systems with no real-time linkage Dispatcher spends 80% of case time reconstructing context from memory and tab-switching
Event quality variance Carrier signals arrive at different latencies via EDI batches, webhooks, OCR, and manual portal updates — with duplicates and gaps Decisions are made on systematically stale and incomplete information
Organizational latency Resolution requires coordination across dispatchers, carrier managers, freight forwarders, customs brokers, and customer service — each in a different system Action cycles measured in hours rather than minutes; SLA windows consumed by coordination
No closed-loop feedback Manual resolution actions (carrier calls, email chains, spreadsheet updates) leave no machine-readable record of what was decided or why No learning from past cases; every exception starts from zero

Exception Taxonomy

Shipping exceptions are not a single category. Each type has different triggers, different resolution paths, and different cost profiles. A system that treats them identically over-engineers simple cases and under-serves complex ones.

CategoryCommon triggersSystemic impact
Pre-ShipmentMissing documentation, missed booking windows, inventory stockoutsCargo fails to load; manufacturing line delays
In-TransitCarrier schedule changes, vessel rollovers, port congestion, weatherETA drift; SLA breaches; downstream supply chain halts
Customs & ComplianceHS code errors, regulatory holds, document errors, hazmat classificationsFive-figure fines; indefinite cargo impoundment; storage fees $100–$1,000+/day
Final-MileDamaged labels, incorrect addresses, missed delivery appointmentsCustomer dissatisfaction; elevated WISMO volume; return logistics cost

Maturity Model: Where Are You Today?

LevelDescriptionException handling today
Level 1 — ReactiveDashboards flag issues after the fact; resolution entirely manualDiscovered via customer complaints or missed SLAs
Level 2 — AssistedRule-based alerts fire; humans do all diagnosis and executionDispatcher investigates manually; resolves by phone and email
Level 3 — AutomatedPredefined playbooks execute for structured low-complexity exceptionsSimple cases auto-resolved; complex cases escalate
Level 4 — AgenticSpecialist agents detect, diagnose, and execute within governed boundsFull agentic control tower — the target state this system delivers

The Solution: Agentic Control Tower

The agentic future workflow inverts the current model: agents gather context continuously, propose decisions within policy bounds, and execute approved actions automatically — while humans focus exclusively on cases that genuinely require judgment.

Before: a dispatcher spends 45 minutes reconstructing context from 5 systems and resolves the case manually. After: the agent resolves or proposes in under 10 minutes. The dispatcher spends 30 seconds on an Approve click.

Six Specialized Agents

Signal Normalization Agent
Rule-based + Pydantic validation
Ingests carrier webhooks, TMS polling events, and EDI messages. Normalizes all events to a canonical Gold-layer schema. Creates the ExceptionCase record and transitions lifecycle to detected.
Risk-and-Impact Agent
Scoring model + LLM for novel cases
Classifies exception severity. Assigns autonomy tier (Low / Medium / High). Scores SLA breach risk and financial materiality. Transitions case to triaged.
Root-Cause Investigation Agent
ReAct loop · max 3 tool-call cycles
Queries carrier APIs, ERP inventory, customs document stores, and route alternatives. Builds investigation context. Computes data_completeness_score. Transitions to investigating.
Policy-and-Strategy Agent
Plan-and-Execute · ranked candidates
Evaluates investigation context. Produces ranked resolution candidates (action, cost, SLA impact, confidence). Selects top recommendation. Transitions to action_proposed.
Execution Agent
Deterministic · idempotent · retry 2×
Executes approved action against carrier API, TMS, or notification service. Handles retry with exponential backoff. Records result with idempotency key. Transitions to auto_resolved or failed.
Learning-and-Eval Agent
Async · post-closure · pattern analysis
Labels case outcomes. Tracks accuracy metrics per action type and exception category. Identifies false positives and negatives. Feeds patterns back into routing rules and eval harness.

Exception Lifecycle State Machine

The exception lifecycle is an explicit state machine. State names are immutable across every layer — ADRs, canonical docs, stack code, and eval scenarios use identical names. No aliases.

  detected ──▶ triaged ──▶ investigating ──▶ action_proposed ──▶ awaiting_human
                                                    │                  │
                                                    │                  │ operator approves
                                                    ▼                  ▼
                                              auto_resolved ◀──────────┘
                                                    │         (or failed)
                                                    ▼
                                                 closed
StateDuration targetWho advances it
detected< 1 minSignal Normalization Agent (automatic)
triaged< 2 minRisk-and-Impact Agent (automatic)
investigating< 5 minRoot-Cause Investigation Agent (automatic)
action_proposed< 1 minPolicy-and-Strategy Agent (automatic)
awaiting_human≤ 30 min for P1Operator (human)
auto_resolved< 2 minExecution Agent (automatic or operator-approved)
failed / closedTerminalSystem or operator

Autonomy Tiers

TierAction examplesHITL pattern
Low — automatic Open case, send status notification, request missing document, recheck ETA Passive Monitoring — operator watches dashboard; no approval required
Medium — one-click Propose alternate carrier, schedule change, customer compensation within threshold Approve/Reject Gate — agent pauses; operator approves in ≤30 seconds
High — human-owned Cross-border compliance assertion, financial commitment >$5,000, contractual change Review-and-Edit — operator reviews and modifies the proposed action before execution
Confidence failure Any action where confidence < 0.60 regardless of tier Exception Routing — agent flags uncertainty; operator owns the decision entirely

Data Layer: Bronze → Silver → Gold

The medallion architecture ensures that agents reason on structured, validated, enriched context — never on raw carrier payloads. Each layer has a defined latency budget, schema contract, and quality gate.

Bronze
Raw Ingest — Append-Only, Immutable
Accepts carrier webhooks, EDI 214 AT7 segments, TMS polling payloads, and OCR document extractions. Stored as-is within 30 seconds of receipt. Zero transformation — this layer is an audit log, not a working store. LLMs never see Bronze data directly.
Silver
Normalized Events — Pydantic-Validated, SLA ≤ 2 min
Transforms Bronze payloads into NormalizedEvent objects. All 13 required fields populated. Deduplication by event_fingerprint. Latency alignment applied. Data quality gate: if data_completeness_score < 0.60, circuit breaker trips → case enters awaiting_human immediately.
Gold
Agent Context — Enriched, Field-Allowlisted, SLA ≤ 5 min
Joins normalized carrier events with ERP inventory snapshot, carrier performance history, customer SLA contract terms, and route alternatives. Field allowlist enforced: agents receive only the 47 fields in the Gold schema. Raw payloads are excluded (ADR 004 — LLM injection risk). This is the only layer agents read from.

Data Quality Circuit Breaker

Score rangeAction
≥ 0.80Proceed to investigation and action proposal
0.60 – 0.79Flag in investigation context; proceed with reduced confidence ceiling (max confidence capped at 0.70)
< 0.60HALT — transition case to awaiting_human immediately; list missing fields in escalation context

Trust & Safety: Guardrails and HITL

A production agentic system requires defense-in-depth. Seven guardrail layers operate at different points in the agent trajectory. Four "never" rules are deterministic constraints that no confidence score or operator approval can override.

7-Layer Guardrail Stack

1
Input Validation
Pydantic schema validation on all inbound carrier webhook payloads. Reject malformed events before they enter Bronze layer.
2
LLM Reasoning Check
Structured output grader verifies that every field in proposed_action traces back to a tool result in investigation_context. Ungrounded fields are flagged as hallucination candidates.
3
Tool Input Check
All adapter call parameters validated against tool-specific schemas before execution. Prevents fabricated API parameters from reaching carrier systems.
4
Tool Output Check
All fetch_* adapter results scanned for instruction injection patterns before entering ExceptionCaseState. External content is data, not instructions.
5
Final Response Check
Action proposal reviewed against authorization policy (autonomy tier, financial threshold, compliance category) before case transitions to action_proposed.
6
Rules-Based Protections
Deterministic enforcement layer: financial commitment thresholds, customs always-escalate rule, rerouting limits, HITL mandatory for HIGH-tier actions. No LLM reasoning involved.
7
Guardian Agents
Separate LLM-based review layer for high-stakes cases: validates investigation context consistency, checks for instruction drift, scores trajectory coherence. Runs asynchronously on HIGH-tier cases.

Four "Never" Rules

These are hard constraints that operate independently of agent confidence, operator approval, or business urgency:

Never mutate a system of record without an idempotency key and a completed policy check. Every write action (execute_carrier_rebook, send_customer_notification, update_tms_status) carries an idempotency key derived from case_id + action_type + attempt_number. Replay safety is non-negotiable.
Never allow the agent to invent customs facts, HS codes, tariff classifications, or document completeness determinations. These fields must trace to a tool result from an authoritative source. If the source tool returns null or fails, the agent transitions to awaiting_human — it does not synthesize a value.
Never hide a low-confidence decision behind polished prose. When source_confidence < 0.30 or data_completeness_score < 0.60, the agent must explicitly state which sources are missing or stale and transition to Exception Routing. A well-worded proposal that conceals an incomplete investigation is a worse outcome than an explicit escalation.
Never treat document extraction as complete until key fields reconcile against business rules and source artifacts. A commercial invoice extraction is complete only when invoice_total, hs_code, country_of_origin, and consignee are all present and pass format validation. Partial extraction that passes silently is a silent failure path.

Evaluation: Eval-First Engineering

At 95% per-step accuracy, a 10-step agent trajectory succeeds only 60% of the time. The eval harness is scaffolded before any agent code. This is not process discipline — it is the only way to detect compounding errors during development, not after deployment.

Steps in trajectoryPer-step accuracyEnd-to-end success rate
5 steps95%77%
10 steps95%60%
20 steps95%36%

Four-Pillar Eval Framework

Outcome Correctness
Did the agent reach the correct terminal state? Are all required system effects present? Graded deterministically from expected_environment_state in the scenario YAML.
Trajectory Quality
Did the agent take a reasonable path to the outcome? LLM-as-judge grader checks investigation completeness, tool call sequence, and reasoning coherence.
Safety & Guardrails
Did the agent respect all seven guardrail layers? Did it correctly route HIGH-tier actions to HITL? Did it reject prompt injection attempts? Safety score is a hard gate — any safety failure blocks deployment.
Performance & Cost
MTTR target met? Token budget within bounds? Tool call count within the 9-call budget? These are secondary gates — fails here trigger optimization, not deployment block.

pass@k vs pass^k

MetricDefinitionWhen to use
pass@kAt least 1 of k runs succeeds — capability ceilingCapability benchmarking. "Can the agent ever solve this?"
pass^kAll k runs must succeed — regression gateProduction graduation. "Does the agent reliably solve this every time?"

Carrier rebook, SLA breach escalation, prompt injection resistance, and the customs always-escalate rule all require pass^5 (5-of-5) before production graduation. A system that passes 4-of-5 on a safety-critical scenario is not ready.

Implementation Path

Build in this order. Stop at each gate. Do not add the next component until the current one's signal criterion passes.

1
Domain entities + state machine — no LLM, no HTTP
Gate: pytest test_state_machine.py passes all 13 valid + 6 invalid transitions
2
Eval harness + 2 golden scenarios — before agent code
Gate: EvalHarness.run_scenario("GOLD-001") returns EvalResult without crashing
3
Mocked carrier adapters — real interfaces, fixture data
Gate: adapter.fetch_events("TEST-TRK-001") returns NormalizedEvent with all 13 fields
4
LangGraph skeleton — stub nodes, graph wiring, HITL interrupt
Gate: graph runs detected→closed with stubs; awaiting_human interrupt fires and resumes
5
FastAPI ingest → worker → real agents (one at a time)
Gate per agent: its eval scenario passes before moving to the next agent

What "Local-First" Means in Practice

The entire stack runs locally without cloud dependencies. Carrier adapters use fixture JSON from stack/evals/fixtures/. The LangGraph checkpoint store uses SQLite locally (Postgres in production). LiteLLM routes to a local model (Ollama) or a cloud API depending on the LITELLM_MODEL environment variable. The eval harness runs without any live carrier API credentials.

Stack Dependencies

Python 3.11+
FastAPI + uvicorn
LangGraph 0.2+
LiteLLM (model routing layer)
Pydantic v2
Langfuse (LLM observability)
OpenTelemetry SDK
Redis (worker queue — local: fakeredis)
SQLite (LangGraph checkpoint store — local)
pytest + pytest-asyncio

Business Case

The ROI from exception automation compounds: each automated investigation cycle reduces cost per case and accumulates labeled training data that improves future case routing. Payback: 6–18 months at 50+ cases/day.

6–18 mo
Typical payback timeline at 50–200 cases/day
50%
Reduction in manual lookup and reconciliation work per case
3–5%
Reduction in expedite spend as share of total logistics spend

KPI Before vs. After

KPIToday (manual)Agentic target
Mean Time to Resolve (MTTR)4–6 hours<45 minutes
Human intervention rate~100%<30%
Manual lookup reductionBaseline50% per case
On-time delivery (OTD) rate85–88% (industry avg)+3–5 pp improvement
Exception labor cost$12–18 / shipment30–50% reduction
Expedite spend3–5% of logistics spend3–5% reduction

ICP: Are You Ready?

You are the right fit if: You are a VP of Supply Chain, Head of Logistics Ops, or Engineering Lead at a 3PL, large manufacturer, or high-volume retailer. You handle 50–500 exception cases per day across multiple carriers. You have a TMS or ERP with an API you can query. You are at Maturity Level 2 (Assisted) and want to reach Level 3–4.

Not quite ready if: You don't have structured exception data (only emails and phone calls). Your carrier data arrives exclusively via manual entry with no EDI, API, or webhook. You handle fewer than 20 exceptions per day — the ROI isn't there yet. You have no engineering capacity to wire an adapter or run a LangGraph graph locally.

The right first step: pick one exception type. Wire one carrier. Get to Level 3 on that narrow path first. Don't architect a 6-agent system until a 1-agent proof of value is working.

Code Implementation

The reference stack is a Python monorepo organized around a clean architecture: the domain layer (entities and state machine) has zero external dependencies, and all external systems are accessed through a typed adapter interface.

Project Setup

Dependencies are managed via pyproject.toml. The key runtime dependencies and their roles:

PackageVersionRole
langgraph≥ 0.2.0Agent orchestration, state machine execution, HITL interrupt/resume
fastapi≥ 0.111.0Carrier webhook receiver, operator REST API
pydantic≥ 2.7.0Schema validation at every layer boundary (Bronze→Silver→Gold, AgentContext allowlist)
litellm≥ 1.40.0LLM routing: Sonnet for investigation/strategy, Haiku for normalization/risk/learning
langfuse≥ 2.30.0LLM trace observability, token cost tracking, LLM-as-judge eval grading
opentelemetry-sdk≥ 1.25.0Span instrumentation for the 17 typed agent events
boto3≥ 1.34.0SQS consumer (worker), S3 Bronze store, Bedrock model invocation
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python -m domain.db_init          # initialize SQLite checkpoint store (local)
pytest tests/unit/ tests/state_machine/ -v   # no LLM or cloud credentials needed
python -m evals.run_scenario --scenario-id GOLD-001

Module Structure

stack/
├── api/              ← FastAPI: POST /webhooks/carrier/{code}, GET/POST /cases/{id}
├── domain/           ← ExceptionCase entity, state machine, AgentContext allowlist
├── orchestration/    ← StateGraph wiring, 6 agent nodes, interrupt_before=["execute"]
├── adapters/         ← AbstractCarrierAdapter + mocked FedEx/UPS/Maersk/TMS/ERP/EDI
├── worker/           ← SQS polling loop, visibility timeout extension, DLQ handler
├── evals/            ← EvalHarness ABC, 8 YAML scenarios, LLM-as-judge graders
├── observability/    ← 17 typed event emitters, MELT metrics, OTel span decorators
└── tests/            ← Unit / integration / state machine / E2E (4-layer pyramid)

Core Interface Contracts

AbstractCarrierAdapter

class AbstractCarrierAdapter(ABC):
    @abstractmethod
    def fetch_tracking_events(
        self, tracking_number: str
    ) -> list[NormalizedEvent]:
        """Fetch normalized events. Must be idempotent."""
        ...

    @abstractmethod
    def execute_action(
        self, action: CarrierActionRequest
    ) -> CarrierActionResult:
        """Execute approved action. Idempotency key required."""
        ...

ExceptionCase entity

class ExceptionCase(BaseModel):
    case_id: str
    state: CaseState = "detected"
    tracking_number: str
    carrier_code: str
    exception_type: ExceptionType      # CUSTOMS_HOLD | WEATHER_DELAY | ...
    autonomy_tier: Literal["LOW", "MEDIUM", "HIGH"] = "HIGH"
    risk_score: float = Field(0.5, ge=0.0, le=1.0)
    data_completeness_score: float = Field(0.0, ge=0.0, le=1.0)
    sla_deadline_utc: datetime | None = None
    financial_materiality_usd: float = 0.0

LangGraph graph wiring

graph = StateGraph(ExceptionCaseState)
# Register 6 agent nodes
graph.add_node("normalize",   normalization.run)
graph.add_node("assess_risk", risk_assessment.run)
graph.add_node("investigate", investigation.run)
graph.add_node("strategize",  strategy.run)
graph.add_node("execute",     execution.run)
graph.add_node("learn",       learning.run)

# Conditional edge: HITL gate at action_proposed
graph.add_conditional_edges(
    "strategize",
    should_route_to_human,
    {"awaiting_human": END, "execute": "execute"}
)

return graph.compile(
    checkpointer=checkpointer,
    interrupt_before=["execute"]   # ← HITL serialization point
)

The interrupt_before=["execute"] call is the HITL mechanism. When the graph reaches this node for a case requiring operator review, LangGraph serializes the full graph state to the Postgres checkpoint store. The case waits in awaiting_human until the operator approves or rejects. The graph resumes from the checkpoint — no state is lost across process restarts or worker scaling events.

Testing Pyramid

LayerLocationLLM neededCI gate
Unittests/unit/NoEvery commit
State machinetests/state_machine/NoEvery commit
Integrationtests/integration/MockedEvery commit
E2Etests/e2e/YesStaging gate

State machine tests cover all 13 valid transitions and all 6 invalid transitions (each must raise InvalidStateTransitionError). These run without any credentials in under 2 seconds — they are the fastest regression signal in the system.

Deployment on AWS

The production topology maps onto five AWS services. The API service is lightweight — validation and queue dispatch only. The Worker service is heavy — it runs LangGraph including multi-step LLM calls. They scale independently.

Internet
    │
    ▼
ALB (HTTPS :443)
    │
    ▼
ECS Fargate — API Service (FastAPI webhook receiver)
  cpu: 512  memory: 1 GB  min: 2 tasks  scale: requests/target
    │  SQS SendMessage
    ▼
SQS — shipping-exception-queue
  VisibilityTimeout: 900s (customs hold budget)
    │  SQS ReceiveMessage (1 msg per task)
    ▼
ECS Fargate — Worker Service (LangGraph runner)
  cpu: 1024  memory: 4 GB  min: 1 task  scale: queue depth
    ├──▶  RDS Postgres    (LangGraph checkpoint store — awaiting_human state)
    ├──▶  AWS Bedrock     (Sonnet: investigation + strategy / Haiku: others)
    ├──▶  S3 Bronze       (raw carrier payloads, EDI messages, OCR extractions)
    └──▶  Langfuse        (LLM traces, token cost, eval grading)

Service Configuration and Tuning Knobs

ServiceKey configurationMost impactful tuning knob
ECS API cpu: 512, memory: 1024 MB, min 2 tasks, max 10, scale on RequestCount Raise memory if Pydantic validation OOMs on large EDI payloads
ECS Worker cpu: 1024, memory: 4096 MB, min 1 task, max 20, scale on SQS depth memory is the first knob — raise to 8 GB if investigation agent loads large document extracts
SQS VisibilityTimeout: 900s, MessageRetentionPeriod: 4 days, DLQ maxReceiveCount: 3 VisibilityTimeout must exceed longest graph execution time × 1.5. Customs hold investigations take up to 12 min — 900s covers with margin
RDS Postgres db.t4g.medium (50/day), db.r7g.large (500/day), Multi-AZ required Multi-AZ is non-negotiable — checkpoint store is the recovery point for all awaiting_human states
Bedrock Sonnet for Investigation + Strategy agents; Haiku for Normalization, Risk, Learning Route Risk-and-Impact to Haiku (saves ~$0.05/case); request RPM quota increase before load testing

Model Routing: Sonnet vs. Haiku

AgentModelRationale
Signal NormalizationHaikuDeterministic EDI/webhook extraction — no complex reasoning
Risk-and-ImpactHaikuRule-based scoring; LLM only for novel exception types
Root-Cause InvestigationSonnetMulti-step ReAct loop; carrier API + ERP + customs queries
Policy-and-StrategySonnetRanked candidate generation — output quality is load-bearing
ExecutionNo LLMFully deterministic carrier API calls
Learning-and-EvalHaikuOutcome labeling — structured output, lower complexity

Monthly Cost Model

ResourceDev/stagingProd (100/day)Prod (500/day)
ECS Fargate (API + Worker)$23$75$260
RDS Postgres (Multi-AZ)$80$200
AWS Bedrock (Sonnet + Haiku)$20$175$870
S3 Bronze storage$1$5$40
SQS + NAT Gateway$6$17$58
Langfuse cloudFree$50$150
Total~$50~$400~$1,580
Bedrock cost dominates. The single biggest lever: route normalization, risk-scoring, and learning to Haiku instead of Sonnet. This alone reduces LLM cost by ~40% with no change to investigation or strategy quality.

Terraform Module Structure

infra/
├── modules/
│   ├── ecs-api/     ← Task def, ALB, target group, auto-scaling
│   ├── ecs-worker/  ← Task def, SQS-triggered step scaling
│   ├── sqs/         ← Queue, DLQ, CloudWatch alarm
│   ├── rds/         ← Postgres instance, subnet group, Multi-AZ
│   └── iam/         ← API task role, Worker task role (least privilege)
└── envs/
    ├── dev/          ← SQLite locally; only RDS and SQS differ
    └── prod/         ← Full VPC, Multi-AZ RDS, autoscaling enabled

Glossary & State Machine Reference

Glossary

ExceptionCase
The core domain entity. Holds case_id, status, exception_type, autonomy_tier, and audit_events. Immutable once created — mutated only via StateMachine.transition().
data_completeness_score
Float 0.0–1.0. Weighted ratio of required Gold-layer fields populated to total required fields for the given exception type. Circuit breaker threshold: 0.60.
autonomy_tier
LOW / MEDIUM / HIGH. Assigned by Risk-and-Impact Agent. Determines HITL pattern for the case. Influences confidence thresholds.
idempotency_key
Format: {case_id}:{action_type}:{attempt_number}. Ensures repeated execution attempts for the same action do not produce duplicate side effects.
NormalizedEvent
Silver-layer entity produced by Signal Normalization Agent. 13 required fields. Schema validated by Pydantic. Deduplicated by event_fingerprint.
pass^k
Regression eval metric. All k runs of a scenario must pass. Used for safety-critical scenarios (customs, rebook, prompt injection). Blocks production graduation if any run fails.
HITL
Human-in-the-loop. Four patterns: Passive Monitoring (no approval), Approve/Reject Gate (≤30s), Review-and-Edit (operator modifies), Exception Routing (confidence failure escalation).
EvalHarness
Abstract base class for the eval system. Defines run_scenario(scenario_id) → EvalResult and list_scenarios() → list[ScenarioSpec]. Scaffolded before any agent code.
EDDOps
Error-Driven Development for Operations. When a production failure occurs, the 72hr SLA requires a minimized test case (failing scenario) committed to the eval harness before the fix is deployed.
MELT
Metrics, Events, Logs, Traces. The four observability signal types used by the stack. 17 typed events required per case for a complete audit trail.

State Transition Reference

All 13 valid state transitions. Any transition not in this list raises InvalidStateTransitionError.

From stateTo stateTrigger
detectedtriagedRisk-and-Impact Agent completes scoring
triagedinvestigatingRoot-Cause Investigation Agent begins ReAct loop
investigatingaction_proposedPolicy-and-Strategy Agent produces ranked candidates
action_proposedawaiting_humanAutonomy tier is MEDIUM/HIGH, or confidence < threshold
action_proposedauto_resolvedAutonomy tier is LOW and confidence ≥ threshold
awaiting_humanauto_resolvedOperator approves the proposed action
awaiting_humaninvestigatingOperator rejects; routes back for re-investigation
auto_resolvedclosedDownstream validation confirms shipment on track
failedclosedOperator acknowledges failure; case closed manually
investigatingawaiting_humandata_completeness_score < 0.60 (circuit breaker)
investigatingfailedStep count exceeds 20, or same tool called 3× consecutively
auto_resolvedfailedExecution Agent: all retries exhausted (non-retryable error)
failedinvestigatingOperator explicitly reopens case for re-investigation

Get the Full Reference Implementation

Vavi Labs publishes workflow-specific AI systems with the architecture, trust model, runnable stack, and implementation detail needed for serious review.

View on GitHub → Vavi Labs · github.com/deepak-karkala/agentic-ai-reference-implementation-library