Every logistics operation produces a daily stream of shipping exceptions. How those exceptions are handled determines SLA performance, expedite cost, and customer satisfaction. The current model is structurally broken — and agents are the right fix.
This whitepaper presents a complete reference implementation for an agentic exception management system built on LangGraph, FastAPI, and LiteLLM. It covers the business problem, the 6-agent architecture, the data layer, the trust and safety framework, the evaluation strategy, and a concrete implementation roadmap.
The system is designed for operators at Supply Chain Autonomy Maturity Level 2 (Assisted) seeking to reach Level 3–4 (Automated / Agentic). It is not a black-box product — it is a reference implementation you can audit, fork, and adapt. Every design decision is documented in Architecture Decision Records. Every agent behavior is covered by an eval scenario before a line of agent code is written.
Shipping exception management is a decision problem hiding inside a communication problem. It fails not because people are incompetent or systems are absent — it fails because the operating model is structurally wrong for the volume and variability it faces.
The current workflow breaks along four structural fault lines, each of which compounds the others:
| Failure mode | Root cause | Operational impact |
|---|---|---|
| Data fragmentation | Evidence is split across TMS, ERP, carrier portals, EDI, email, and document systems with no real-time linkage | Dispatcher spends 80% of case time reconstructing context from memory and tab-switching |
| Event quality variance | Carrier signals arrive at different latencies via EDI batches, webhooks, OCR, and manual portal updates — with duplicates and gaps | Decisions are made on systematically stale and incomplete information |
| Organizational latency | Resolution requires coordination across dispatchers, carrier managers, freight forwarders, customs brokers, and customer service — each in a different system | Action cycles measured in hours rather than minutes; SLA windows consumed by coordination |
| No closed-loop feedback | Manual resolution actions (carrier calls, email chains, spreadsheet updates) leave no machine-readable record of what was decided or why | No learning from past cases; every exception starts from zero |
Shipping exceptions are not a single category. Each type has different triggers, different resolution paths, and different cost profiles. A system that treats them identically over-engineers simple cases and under-serves complex ones.
| Category | Common triggers | Systemic impact |
|---|---|---|
| Pre-Shipment | Missing documentation, missed booking windows, inventory stockouts | Cargo fails to load; manufacturing line delays |
| In-Transit | Carrier schedule changes, vessel rollovers, port congestion, weather | ETA drift; SLA breaches; downstream supply chain halts |
| Customs & Compliance | HS code errors, regulatory holds, document errors, hazmat classifications | Five-figure fines; indefinite cargo impoundment; storage fees $100–$1,000+/day |
| Final-Mile | Damaged labels, incorrect addresses, missed delivery appointments | Customer dissatisfaction; elevated WISMO volume; return logistics cost |
| Level | Description | Exception handling today |
|---|---|---|
| Level 1 — Reactive | Dashboards flag issues after the fact; resolution entirely manual | Discovered via customer complaints or missed SLAs |
| Level 2 — Assisted | Rule-based alerts fire; humans do all diagnosis and execution | Dispatcher investigates manually; resolves by phone and email |
| Level 3 — Automated | Predefined playbooks execute for structured low-complexity exceptions | Simple cases auto-resolved; complex cases escalate |
| Level 4 — Agentic | Specialist agents detect, diagnose, and execute within governed bounds | Full agentic control tower — the target state this system delivers |
The agentic future workflow inverts the current model: agents gather context continuously, propose decisions within policy bounds, and execute approved actions automatically — while humans focus exclusively on cases that genuinely require judgment.
detected.triaged.data_completeness_score. Transitions to investigating.action_proposed.auto_resolved or failed.The exception lifecycle is an explicit state machine. State names are immutable across every layer — ADRs, canonical docs, stack code, and eval scenarios use identical names. No aliases.
detected ──▶ triaged ──▶ investigating ──▶ action_proposed ──▶ awaiting_human
│ │
│ │ operator approves
▼ ▼
auto_resolved ◀──────────┘
│ (or failed)
▼
closed
| State | Duration target | Who advances it |
|---|---|---|
detected | < 1 min | Signal Normalization Agent (automatic) |
triaged | < 2 min | Risk-and-Impact Agent (automatic) |
investigating | < 5 min | Root-Cause Investigation Agent (automatic) |
action_proposed | < 1 min | Policy-and-Strategy Agent (automatic) |
awaiting_human | ≤ 30 min for P1 | Operator (human) |
auto_resolved | < 2 min | Execution Agent (automatic or operator-approved) |
failed / closed | Terminal | System or operator |
| Tier | Action examples | HITL pattern |
|---|---|---|
| Low — automatic | Open case, send status notification, request missing document, recheck ETA | Passive Monitoring — operator watches dashboard; no approval required |
| Medium — one-click | Propose alternate carrier, schedule change, customer compensation within threshold | Approve/Reject Gate — agent pauses; operator approves in ≤30 seconds |
| High — human-owned | Cross-border compliance assertion, financial commitment >$5,000, contractual change | Review-and-Edit — operator reviews and modifies the proposed action before execution |
| Confidence failure | Any action where confidence < 0.60 regardless of tier | Exception Routing — agent flags uncertainty; operator owns the decision entirely |
The medallion architecture ensures that agents reason on structured, validated, enriched context — never on raw carrier payloads. Each layer has a defined latency budget, schema contract, and quality gate.
NormalizedEvent objects. All 13 required fields populated. Deduplication by event_fingerprint. Latency alignment applied. Data quality gate: if data_completeness_score < 0.60, circuit breaker trips → case enters awaiting_human immediately.| Score range | Action |
|---|---|
| ≥ 0.80 | Proceed to investigation and action proposal |
| 0.60 – 0.79 | Flag in investigation context; proceed with reduced confidence ceiling (max confidence capped at 0.70) |
| < 0.60 | HALT — transition case to awaiting_human immediately; list missing fields in escalation context |
A production agentic system requires defense-in-depth. Seven guardrail layers operate at different points in the agent trajectory. Four "never" rules are deterministic constraints that no confidence score or operator approval can override.
proposed_action traces back to a tool result in investigation_context. Ungrounded fields are flagged as hallucination candidates.fetch_* adapter results scanned for instruction injection patterns before entering ExceptionCaseState. External content is data, not instructions.action_proposed.These are hard constraints that operate independently of agent confidence, operator approval, or business urgency:
execute_carrier_rebook, send_customer_notification, update_tms_status) carries an idempotency key derived from case_id + action_type + attempt_number. Replay safety is non-negotiable.
awaiting_human — it does not synthesize a value.
source_confidence < 0.30 or data_completeness_score < 0.60, the agent must explicitly state which sources are missing or stale and transition to Exception Routing. A well-worded proposal that conceals an incomplete investigation is a worse outcome than an explicit escalation.
invoice_total, hs_code, country_of_origin, and consignee are all present and pass format validation. Partial extraction that passes silently is a silent failure path.
At 95% per-step accuracy, a 10-step agent trajectory succeeds only 60% of the time. The eval harness is scaffolded before any agent code. This is not process discipline — it is the only way to detect compounding errors during development, not after deployment.
| Steps in trajectory | Per-step accuracy | End-to-end success rate |
|---|---|---|
| 5 steps | 95% | 77% |
| 10 steps | 95% | 60% |
| 20 steps | 95% | 36% |
| Metric | Definition | When to use |
|---|---|---|
| pass@k | At least 1 of k runs succeeds — capability ceiling | Capability benchmarking. "Can the agent ever solve this?" |
| pass^k | All k runs must succeed — regression gate | Production graduation. "Does the agent reliably solve this every time?" |
Carrier rebook, SLA breach escalation, prompt injection resistance, and the customs always-escalate rule all require pass^5 (5-of-5) before production graduation. A system that passes 4-of-5 on a safety-critical scenario is not ready.
Build in this order. Stop at each gate. Do not add the next component until the current one's signal criterion passes.
The entire stack runs locally without cloud dependencies. Carrier adapters use fixture JSON from stack/evals/fixtures/. The LangGraph checkpoint store uses SQLite locally (Postgres in production). LiteLLM routes to a local model (Ollama) or a cloud API depending on the LITELLM_MODEL environment variable. The eval harness runs without any live carrier API credentials.
Python 3.11+ FastAPI + uvicorn LangGraph 0.2+ LiteLLM (model routing layer) Pydantic v2 Langfuse (LLM observability) OpenTelemetry SDK Redis (worker queue — local: fakeredis) SQLite (LangGraph checkpoint store — local) pytest + pytest-asyncio
The ROI from exception automation compounds: each automated investigation cycle reduces cost per case and accumulates labeled training data that improves future case routing. Payback: 6–18 months at 50+ cases/day.
| KPI | Today (manual) | Agentic target |
|---|---|---|
| Mean Time to Resolve (MTTR) | 4–6 hours | <45 minutes |
| Human intervention rate | ~100% | <30% |
| Manual lookup reduction | Baseline | 50% per case |
| On-time delivery (OTD) rate | 85–88% (industry avg) | +3–5 pp improvement |
| Exception labor cost | $12–18 / shipment | 30–50% reduction |
| Expedite spend | 3–5% of logistics spend | 3–5% reduction |
You are the right fit if: You are a VP of Supply Chain, Head of Logistics Ops, or Engineering Lead at a 3PL, large manufacturer, or high-volume retailer. You handle 50–500 exception cases per day across multiple carriers. You have a TMS or ERP with an API you can query. You are at Maturity Level 2 (Assisted) and want to reach Level 3–4.
Not quite ready if: You don't have structured exception data (only emails and phone calls). Your carrier data arrives exclusively via manual entry with no EDI, API, or webhook. You handle fewer than 20 exceptions per day — the ROI isn't there yet. You have no engineering capacity to wire an adapter or run a LangGraph graph locally.
The reference stack is a Python monorepo organized around a clean architecture: the domain layer (entities and state machine) has zero external dependencies, and all external systems are accessed through a typed adapter interface.
Dependencies are managed via pyproject.toml. The key runtime dependencies and their roles:
| Package | Version | Role |
|---|---|---|
langgraph | ≥ 0.2.0 | Agent orchestration, state machine execution, HITL interrupt/resume |
fastapi | ≥ 0.111.0 | Carrier webhook receiver, operator REST API |
pydantic | ≥ 2.7.0 | Schema validation at every layer boundary (Bronze→Silver→Gold, AgentContext allowlist) |
litellm | ≥ 1.40.0 | LLM routing: Sonnet for investigation/strategy, Haiku for normalization/risk/learning |
langfuse | ≥ 2.30.0 | LLM trace observability, token cost tracking, LLM-as-judge eval grading |
opentelemetry-sdk | ≥ 1.25.0 | Span instrumentation for the 17 typed agent events |
boto3 | ≥ 1.34.0 | SQS consumer (worker), S3 Bronze store, Bedrock model invocation |
python -m venv .venv && source .venv/bin/activate pip install -e ".[dev]" python -m domain.db_init # initialize SQLite checkpoint store (local) pytest tests/unit/ tests/state_machine/ -v # no LLM or cloud credentials needed python -m evals.run_scenario --scenario-id GOLD-001
stack/
├── api/ ← FastAPI: POST /webhooks/carrier/{code}, GET/POST /cases/{id}
├── domain/ ← ExceptionCase entity, state machine, AgentContext allowlist
├── orchestration/ ← StateGraph wiring, 6 agent nodes, interrupt_before=["execute"]
├── adapters/ ← AbstractCarrierAdapter + mocked FedEx/UPS/Maersk/TMS/ERP/EDI
├── worker/ ← SQS polling loop, visibility timeout extension, DLQ handler
├── evals/ ← EvalHarness ABC, 8 YAML scenarios, LLM-as-judge graders
├── observability/ ← 17 typed event emitters, MELT metrics, OTel span decorators
└── tests/ ← Unit / integration / state machine / E2E (4-layer pyramid)
class AbstractCarrierAdapter(ABC):
@abstractmethod
def fetch_tracking_events(
self, tracking_number: str
) -> list[NormalizedEvent]:
"""Fetch normalized events. Must be idempotent."""
...
@abstractmethod
def execute_action(
self, action: CarrierActionRequest
) -> CarrierActionResult:
"""Execute approved action. Idempotency key required."""
...
class ExceptionCase(BaseModel):
case_id: str
state: CaseState = "detected"
tracking_number: str
carrier_code: str
exception_type: ExceptionType # CUSTOMS_HOLD | WEATHER_DELAY | ...
autonomy_tier: Literal["LOW", "MEDIUM", "HIGH"] = "HIGH"
risk_score: float = Field(0.5, ge=0.0, le=1.0)
data_completeness_score: float = Field(0.0, ge=0.0, le=1.0)
sla_deadline_utc: datetime | None = None
financial_materiality_usd: float = 0.0
graph = StateGraph(ExceptionCaseState)
# Register 6 agent nodes
graph.add_node("normalize", normalization.run)
graph.add_node("assess_risk", risk_assessment.run)
graph.add_node("investigate", investigation.run)
graph.add_node("strategize", strategy.run)
graph.add_node("execute", execution.run)
graph.add_node("learn", learning.run)
# Conditional edge: HITL gate at action_proposed
graph.add_conditional_edges(
"strategize",
should_route_to_human,
{"awaiting_human": END, "execute": "execute"}
)
return graph.compile(
checkpointer=checkpointer,
interrupt_before=["execute"] # ← HITL serialization point
)
The interrupt_before=["execute"] call is the HITL mechanism. When the graph reaches this node for a case requiring operator review, LangGraph serializes the full graph state to the Postgres checkpoint store. The case waits in awaiting_human until the operator approves or rejects. The graph resumes from the checkpoint — no state is lost across process restarts or worker scaling events.
| Layer | Location | LLM needed | CI gate |
|---|---|---|---|
| Unit | tests/unit/ | No | Every commit |
| State machine | tests/state_machine/ | No | Every commit |
| Integration | tests/integration/ | Mocked | Every commit |
| E2E | tests/e2e/ | Yes | Staging gate |
State machine tests cover all 13 valid transitions and all 6 invalid transitions (each must raise InvalidStateTransitionError). These run without any credentials in under 2 seconds — they are the fastest regression signal in the system.
The production topology maps onto five AWS services. The API service is lightweight — validation and queue dispatch only. The Worker service is heavy — it runs LangGraph including multi-step LLM calls. They scale independently.
Internet
│
▼
ALB (HTTPS :443)
│
▼
ECS Fargate — API Service (FastAPI webhook receiver)
cpu: 512 memory: 1 GB min: 2 tasks scale: requests/target
│ SQS SendMessage
▼
SQS — shipping-exception-queue
VisibilityTimeout: 900s (customs hold budget)
│ SQS ReceiveMessage (1 msg per task)
▼
ECS Fargate — Worker Service (LangGraph runner)
cpu: 1024 memory: 4 GB min: 1 task scale: queue depth
├──▶ RDS Postgres (LangGraph checkpoint store — awaiting_human state)
├──▶ AWS Bedrock (Sonnet: investigation + strategy / Haiku: others)
├──▶ S3 Bronze (raw carrier payloads, EDI messages, OCR extractions)
└──▶ Langfuse (LLM traces, token cost, eval grading)
| Service | Key configuration | Most impactful tuning knob |
|---|---|---|
| ECS API | cpu: 512, memory: 1024 MB, min 2 tasks, max 10, scale on RequestCount | Raise memory if Pydantic validation OOMs on large EDI payloads |
| ECS Worker | cpu: 1024, memory: 4096 MB, min 1 task, max 20, scale on SQS depth | memory is the first knob — raise to 8 GB if investigation agent loads large document extracts |
| SQS | VisibilityTimeout: 900s, MessageRetentionPeriod: 4 days, DLQ maxReceiveCount: 3 | VisibilityTimeout must exceed longest graph execution time × 1.5. Customs hold investigations take up to 12 min — 900s covers with margin |
| RDS Postgres | db.t4g.medium (50/day), db.r7g.large (500/day), Multi-AZ required | Multi-AZ is non-negotiable — checkpoint store is the recovery point for all awaiting_human states |
| Bedrock | Sonnet for Investigation + Strategy agents; Haiku for Normalization, Risk, Learning | Route Risk-and-Impact to Haiku (saves ~$0.05/case); request RPM quota increase before load testing |
| Agent | Model | Rationale |
|---|---|---|
| Signal Normalization | Haiku | Deterministic EDI/webhook extraction — no complex reasoning |
| Risk-and-Impact | Haiku | Rule-based scoring; LLM only for novel exception types |
| Root-Cause Investigation | Sonnet | Multi-step ReAct loop; carrier API + ERP + customs queries |
| Policy-and-Strategy | Sonnet | Ranked candidate generation — output quality is load-bearing |
| Execution | No LLM | Fully deterministic carrier API calls |
| Learning-and-Eval | Haiku | Outcome labeling — structured output, lower complexity |
| Resource | Dev/staging | Prod (100/day) | Prod (500/day) |
|---|---|---|---|
| ECS Fargate (API + Worker) | $23 | $75 | $260 |
| RDS Postgres (Multi-AZ) | — | $80 | $200 |
| AWS Bedrock (Sonnet + Haiku) | $20 | $175 | $870 |
| S3 Bronze storage | $1 | $5 | $40 |
| SQS + NAT Gateway | $6 | $17 | $58 |
| Langfuse cloud | Free | $50 | $150 |
| Total | ~$50 | ~$400 | ~$1,580 |
infra/
├── modules/
│ ├── ecs-api/ ← Task def, ALB, target group, auto-scaling
│ ├── ecs-worker/ ← Task def, SQS-triggered step scaling
│ ├── sqs/ ← Queue, DLQ, CloudWatch alarm
│ ├── rds/ ← Postgres instance, subnet group, Multi-AZ
│ └── iam/ ← API task role, Worker task role (least privilege)
└── envs/
├── dev/ ← SQLite locally; only RDS and SQS differ
└── prod/ ← Full VPC, Multi-AZ RDS, autoscaling enabled
All 13 valid state transitions. Any transition not in this list raises InvalidStateTransitionError.
| From state | To state | Trigger |
|---|---|---|
detected | triaged | Risk-and-Impact Agent completes scoring |
triaged | investigating | Root-Cause Investigation Agent begins ReAct loop |
investigating | action_proposed | Policy-and-Strategy Agent produces ranked candidates |
action_proposed | awaiting_human | Autonomy tier is MEDIUM/HIGH, or confidence < threshold |
action_proposed | auto_resolved | Autonomy tier is LOW and confidence ≥ threshold |
awaiting_human | auto_resolved | Operator approves the proposed action |
awaiting_human | investigating | Operator rejects; routes back for re-investigation |
auto_resolved | closed | Downstream validation confirms shipment on track |
failed | closed | Operator acknowledges failure; case closed manually |
investigating | awaiting_human | data_completeness_score < 0.60 (circuit breaker) |
investigating | failed | Step count exceeds 20, or same tool called 3× consecutively |
auto_resolved | failed | Execution Agent: all retries exhausted (non-retryable error) |
failed | investigating | Operator explicitly reopens case for re-investigation |
Vavi Labs publishes workflow-specific AI systems with the architecture, trust model, runnable stack, and implementation detail needed for serious review.
View on GitHub → Vavi Labs · github.com/deepak-karkala/agentic-ai-reference-implementation-library