Agentic AI for Logistics: Shipment Exception Management

Section 1

Executive Summary

Every logistics operation produces a daily stream of shipping exceptions. How those exceptions are handled determines SLA performance, expedite cost, and customer satisfaction. The current model is structurally broken — and agents are the right fix.

45 min

Avg. manual exception resolution time today

<10 min

Target resolution time with agentic control tower

<30%

Human intervention rate target (down from ~100%)

This whitepaper presents a complete reference implementation for an agentic exception management system built on LangGraph, FastAPI, and LiteLLM. It covers the business problem, the 6-agent architecture, the data layer, the trust and safety framework, the evaluation strategy, and a concrete implementation roadmap.

Shipping exception management scores at the high end across every dimension that predicts agent success: high frequency, clear success criteria, multi-step reasoning requirement, tool access, and reversible actions.

The system is designed for operators at Supply Chain Autonomy Maturity Level 2 (Assisted) seeking to reach Level 3–4 (Automated / Agentic). It is not a black-box product — it is a reference implementation you can audit, fork, and adapt. Every design decision is documented in Architecture Decision Records. Every agent behavior is covered by an eval scenario before a line of agent code is written.

Section 2

The Problem: Shipping Exceptions at Scale

Shipping exception management is a decision problem hiding inside a communication problem. It fails not because people are incompetent or systems are absent — it fails because the operating model is structurally wrong for the volume and variability it faces.

The problem with shipping exceptions — key statistics

The compounding cost of manual exception management

Four Structural Failure Modes

The current workflow breaks along four structural fault lines, each of which compounds the others:

Failure mode	Root cause	Operational impact
Data fragmentation	Evidence is split across TMS, ERP, carrier portals, EDI, email, and document systems with no real-time linkage	Dispatcher spends 80% of case time reconstructing context from memory and tab-switching
Event quality variance	Carrier signals arrive at different latencies via EDI batches, webhooks, OCR, and manual portal updates — with duplicates and gaps	Decisions are made on systematically stale and incomplete information
Organizational latency	Resolution requires coordination across dispatchers, carrier managers, freight forwarders, customs brokers, and customer service — each in a different system	Action cycles measured in hours rather than minutes; SLA windows consumed by coordination
No closed-loop feedback	Manual resolution actions (carrier calls, email chains, spreadsheet updates) leave no machine-readable record of what was decided or why	No learning from past cases; every exception starts from zero

Exception Taxonomy

Shipping exceptions are not a single category. Each type has different triggers, different resolution paths, and different cost profiles. A system that treats them identically over-engineers simple cases and under-serves complex ones.

Category	Common triggers	Systemic impact
Pre-Shipment	Missing documentation, missed booking windows, inventory stockouts	Cargo fails to load; manufacturing line delays
In-Transit	Carrier schedule changes, vessel rollovers, port congestion, weather	ETA drift; SLA breaches; downstream supply chain halts
Customs & Compliance	HS code errors, regulatory holds, document errors, hazmat classifications	Five-figure fines; indefinite cargo impoundment; storage fees $100–$1,000+/day
Final-Mile	Damaged labels, incorrect addresses, missed delivery appointments	Customer dissatisfaction; elevated WISMO volume; return logistics cost

Maturity Model: Where Are You Today?

Level	Description	Exception handling today
Level 1 — Reactive	Dashboards flag issues after the fact; resolution entirely manual	Discovered via customer complaints or missed SLAs
Level 2 — Assisted	Rule-based alerts fire; humans do all diagnosis and execution	Dispatcher investigates manually; resolves by phone and email
Level 3 — Automated	Predefined playbooks execute for structured low-complexity exceptions	Simple cases auto-resolved; complex cases escalate
Level 4 — Agentic	Specialist agents detect, diagnose, and execute within governed bounds	Full agentic control tower — the target state this system delivers

Section 3

The Solution: Agentic Control Tower

The agentic future workflow inverts the current model: agents gather context continuously, propose decisions within policy bounds, and execute approved actions automatically — while humans focus exclusively on cases that genuinely require judgment.

Before: a dispatcher spends 45 minutes reconstructing context from 5 systems and resolves the case manually. After: the agent resolves or proposes in under 10 minutes. The dispatcher spends 30 seconds on an Approve click.

Six Specialized Agents

Signal Normalization Agent

Rule-based + Pydantic validation

Ingests carrier webhooks, TMS polling events, and EDI messages. Normalizes all events to a canonical Gold-layer schema. Creates the ExceptionCase record and transitions lifecycle to detected.

Risk-and-Impact Agent

Scoring model + LLM for novel cases

Classifies exception severity. Assigns autonomy tier (Low / Medium / High). Scores SLA breach risk and financial materiality. Transitions case to triaged.

Root-Cause Investigation Agent

ReAct loop · max 3 tool-call cycles

Queries carrier APIs, ERP inventory, customs document stores, and route alternatives. Builds investigation context. Computes data_completeness_score. Transitions to investigating.

Policy-and-Strategy Agent

Plan-and-Execute · ranked candidates

Evaluates investigation context. Produces ranked resolution candidates (action, cost, SLA impact, confidence). Selects top recommendation. Transitions to action_proposed.

Execution Agent

Deterministic · idempotent · retry 2×

Executes approved action against carrier API, TMS, or notification service. Handles retry with exponential backoff. Records result with idempotency key. Transitions to auto_resolved or failed.

Learning-and-Eval Agent

Async · post-closure · pattern analysis

Labels case outcomes. Tracks accuracy metrics per action type and exception category. Identifies false positives and negatives. Feeds patterns back into routing rules and eval harness.

Exception Lifecycle State Machine

The exception lifecycle is an explicit state machine. State names are immutable across every layer — ADRs, canonical docs, stack code, and eval scenarios use identical names. No aliases.

  detected ──▶ triaged ──▶ investigating ──▶ action_proposed ──▶ awaiting_human
                                                    │                  │
                                                    │                  │ operator approves
                                                    ▼                  ▼
                                              auto_resolved ◀──────────┘
                                                    │         (or failed)
                                                    ▼
                                                 closed

State	Duration target	Who advances it
`detected`	< 1 min	Signal Normalization Agent (automatic)
`triaged`	< 2 min	Risk-and-Impact Agent (automatic)
`investigating`	< 5 min	Root-Cause Investigation Agent (automatic)
`action_proposed`	< 1 min	Policy-and-Strategy Agent (automatic)
`awaiting_human`	≤ 30 min for P1	Operator (human)
`auto_resolved`	< 2 min	Execution Agent (automatic or operator-approved)
`failed / closed`	Terminal	System or operator

Autonomy Tiers

Tier	Action examples	HITL pattern
Low — automatic	Open case, send status notification, request missing document, recheck ETA	Passive Monitoring — operator watches dashboard; no approval required
Medium — one-click	Propose alternate carrier, schedule change, customer compensation within threshold	Approve/Reject Gate — agent pauses; operator approves in ≤30 seconds
High — human-owned	Cross-border compliance assertion, financial commitment >$5,000, contractual change	Review-and-Edit — operator reviews and modifies the proposed action before execution
Confidence failure	Any action where confidence < 0.60 regardless of tier	Exception Routing — agent flags uncertainty; operator owns the decision entirely

Section 4

Data Layer: Bronze → Silver → Gold

The medallion architecture ensures that agents reason on structured, validated, enriched context — never on raw carrier payloads. Each layer has a defined latency budget, schema contract, and quality gate.

Bronze

Raw Ingest — Append-Only, Immutable

Accepts carrier webhooks, EDI 214 AT7 segments, TMS polling payloads, and OCR document extractions. Stored as-is within 30 seconds of receipt. Zero transformation — this layer is an audit log, not a working store. LLMs never see Bronze data directly.

Silver

Normalized Events — Pydantic-Validated, SLA ≤ 2 min

Transforms Bronze payloads into NormalizedEvent objects. All 13 required fields populated. Deduplication by event_fingerprint. Latency alignment applied. Data quality gate: if data_completeness_score < 0.60, circuit breaker trips → case enters awaiting_human immediately.

Gold

Agent Context — Enriched, Field-Allowlisted, SLA ≤ 5 min

Joins normalized carrier events with ERP inventory snapshot, carrier performance history, customer SLA contract terms, and route alternatives. Field allowlist enforced: agents receive only the 47 fields in the Gold schema. Raw payloads are excluded (ADR 004 — LLM injection risk). This is the only layer agents read from.

Data Quality Circuit Breaker

Score range	Action
≥ 0.80	Proceed to investigation and action proposal
0.60 – 0.79	Flag in investigation context; proceed with reduced confidence ceiling (max confidence capped at 0.70)
< 0.60	HALT — transition case to `awaiting_human` immediately; list missing fields in escalation context

Section 5

Trust & Safety: Guardrails and HITL

A production agentic system requires defense-in-depth. Seven guardrail layers operate at different points in the agent trajectory. Four "never" rules are deterministic constraints that no confidence score or operator approval can override.

7-Layer Guardrail Stack

1

Input Validation

Pydantic schema validation on all inbound carrier webhook payloads. Reject malformed events before they enter Bronze layer.

2

LLM Reasoning Check

Structured output grader verifies that every field in proposed_action traces back to a tool result in investigation_context. Ungrounded fields are flagged as hallucination candidates.

3

Tool Input Check

All adapter call parameters validated against tool-specific schemas before execution. Prevents fabricated API parameters from reaching carrier systems.

4

Tool Output Check

All fetch_* adapter results scanned for instruction injection patterns before entering ExceptionCaseState. External content is data, not instructions.

5

Final Response Check

Action proposal reviewed against authorization policy (autonomy tier, financial threshold, compliance category) before case transitions to action_proposed.

6

Rules-Based Protections

Deterministic enforcement layer: financial commitment thresholds, customs always-escalate rule, rerouting limits, HITL mandatory for HIGH-tier actions. No LLM reasoning involved.

7

Guardian Agents

Separate LLM-based review layer for high-stakes cases: validates investigation context consistency, checks for instruction drift, scores trajectory coherence. Runs asynchronously on HIGH-tier cases.

Four "Never" Rules

These are hard constraints that operate independently of agent confidence, operator approval, or business urgency:

✗ Never mutate a system of record without an idempotency key and a completed policy check. Every write action (execute_carrier_rebook, send_customer_notification, update_tms_status) carries an idempotency key derived from case_id + action_type + attempt_number. Replay safety is non-negotiable.

✗ Never allow the agent to invent customs facts, HS codes, tariff classifications, or document completeness determinations. These fields must trace to a tool result from an authoritative source. If the source tool returns null or fails, the agent transitions to awaiting_human — it does not synthesize a value.

✗ Never hide a low-confidence decision behind polished prose. When source_confidence < 0.30 or data_completeness_score < 0.60, the agent must explicitly state which sources are missing or stale and transition to Exception Routing. A well-worded proposal that conceals an incomplete investigation is a worse outcome than an explicit escalation.

✗ Never treat document extraction as complete until key fields reconcile against business rules and source artifacts. A commercial invoice extraction is complete only when invoice_total, hs_code, country_of_origin, and consignee are all present and pass format validation. Partial extraction that passes silently is a silent failure path.

Section 6

Evaluation: Eval-First Engineering

At 95% per-step accuracy, a 10-step agent trajectory succeeds only 60% of the time. The eval harness is scaffolded before any agent code. This is not process discipline — it is the only way to detect compounding errors during development, not after deployment.

Steps in trajectory	Per-step accuracy	End-to-end success rate
5 steps	95%	77%
10 steps	95%	60%
20 steps	95%	36%

Four-Pillar Eval Framework

✓

Outcome Correctness

Did the agent reach the correct terminal state? Are all required system effects present? Graded deterministically from expected_environment_state in the scenario YAML.

◎

Trajectory Quality

Did the agent take a reasonable path to the outcome? LLM-as-judge grader checks investigation completeness, tool call sequence, and reasoning coherence.

⚡

Safety & Guardrails

Did the agent respect all seven guardrail layers? Did it correctly route HIGH-tier actions to HITL? Did it reject prompt injection attempts? Safety score is a hard gate — any safety failure blocks deployment.

⏱

Performance & Cost

MTTR target met? Token budget within bounds? Tool call count within the 9-call budget? These are secondary gates — fails here trigger optimization, not deployment block.

pass@k vs pass^k

Metric	Definition	When to use
pass@k	At least 1 of k runs succeeds — capability ceiling	Capability benchmarking. "Can the agent ever solve this?"
pass^k	All k runs must succeed — regression gate	Production graduation. "Does the agent reliably solve this every time?"

Carrier rebook, SLA breach escalation, prompt injection resistance, and the customs always-escalate rule all require pass^5 (5-of-5) before production graduation. A system that passes 4-of-5 on a safety-critical scenario is not ready.

Section 7

Implementation Path

Build in this order. Stop at each gate. Do not add the next component until the current one's signal criterion passes.

1

Domain entities + state machine — no LLM, no HTTP

Gate: pytest test_state_machine.py passes all 13 valid + 6 invalid transitions

2

Eval harness + 2 golden scenarios — before agent code

Gate: EvalHarness.run_scenario("GOLD-001") returns EvalResult without crashing

3

Mocked carrier adapters — real interfaces, fixture data

Gate: adapter.fetch_events("TEST-TRK-001") returns NormalizedEvent with all 13 fields

4

LangGraph skeleton — stub nodes, graph wiring, HITL interrupt

Gate: graph runs detected→closed with stubs; awaiting_human interrupt fires and resumes

5

FastAPI ingest → worker → real agents (one at a time)

Gate per agent: its eval scenario passes before moving to the next agent

What "Local-First" Means in Practice

The entire stack runs locally without cloud dependencies. Carrier adapters use fixture JSON from stack/evals/fixtures/. The LangGraph checkpoint store uses SQLite locally (Postgres in production). LiteLLM routes to a local model (Ollama) or a cloud API depending on the LITELLM_MODEL environment variable. The eval harness runs without any live carrier API credentials.

Stack Dependencies

Python 3.11+
FastAPI + uvicorn
LangGraph 0.2+
LiteLLM (model routing layer)
Pydantic v2
Langfuse (LLM observability)
OpenTelemetry SDK
Redis (worker queue — local: fakeredis)
SQLite (LangGraph checkpoint store — local)
pytest + pytest-asyncio

Section 8

Business Case

The ROI from exception automation compounds: each automated investigation cycle reduces cost per case and accumulates labeled training data that improves future case routing. Payback: 6–18 months at 50+ cases/day.

6–18 mo

Typical payback timeline at 50–200 cases/day

50%

Reduction in manual lookup and reconciliation work per case

3–5%

Reduction in expedite spend as share of total logistics spend

KPI Before vs. After

KPI	Today (manual)	Agentic target
Mean Time to Resolve (MTTR)	4–6 hours	<45 minutes
Human intervention rate	~100%	<30%
Manual lookup reduction	Baseline	50% per case
On-time delivery (OTD) rate	85–88% (industry avg)	+3–5 pp improvement
Exception labor cost	$12–18 / shipment	30–50% reduction
Expedite spend	3–5% of logistics spend	3–5% reduction

ICP: Are You Ready?

You are the right fit if: You are a VP of Supply Chain, Head of Logistics Ops, or Engineering Lead at a 3PL, large manufacturer, or high-volume retailer. You handle 50–500 exception cases per day across multiple carriers. You have a TMS or ERP with an API you can query. You are at Maturity Level 2 (Assisted) and want to reach Level 3–4.

Not quite ready if: You don't have structured exception data (only emails and phone calls). Your carrier data arrives exclusively via manual entry with no EDI, API, or webhook. You handle fewer than 20 exceptions per day — the ROI isn't there yet. You have no engineering capacity to wire an adapter or run a LangGraph graph locally.

The right first step: pick one exception type. Wire one carrier. Get to Level 3 on that narrow path first. Don't architect a 6-agent system until a 1-agent proof of value is working.

Section 9

Code Implementation

The reference stack is a Python monorepo organized around a clean architecture: the domain layer (entities and state machine) has zero external dependencies, and all external systems are accessed through a typed adapter interface.

Project Setup

Dependencies are managed via pyproject.toml. The key runtime dependencies and their roles:

Package	Version	Role
`langgraph`	≥ 0.2.0	Agent orchestration, state machine execution, HITL interrupt/resume
`fastapi`	≥ 0.111.0	Carrier webhook receiver, operator REST API
`pydantic`	≥ 2.7.0	Schema validation at every layer boundary (Bronze→Silver→Gold, AgentContext allowlist)
`litellm`	≥ 1.40.0	LLM routing: Sonnet for investigation/strategy, Haiku for normalization/risk/learning
`langfuse`	≥ 2.30.0	LLM trace observability, token cost tracking, LLM-as-judge eval grading
`opentelemetry-sdk`	≥ 1.25.0	Span instrumentation for the 17 typed agent events
`boto3`	≥ 1.34.0	SQS consumer (worker), S3 Bronze store, Bedrock model invocation

python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
python -m domain.db_init          # initialize SQLite checkpoint store (local)
pytest tests/unit/ tests/state_machine/ -v   # no LLM or cloud credentials needed
python -m evals.run_scenario --scenario-id GOLD-001

Module Structure

stack/
├── api/              ← FastAPI: POST /webhooks/carrier/{code}, GET/POST /cases/{id}
├── domain/           ← ExceptionCase entity, state machine, AgentContext allowlist
├── orchestration/    ← StateGraph wiring, 6 agent nodes, interrupt_before=["execute"]
├── adapters/         ← AbstractCarrierAdapter + mocked FedEx/UPS/Maersk/TMS/ERP/EDI
├── worker/           ← SQS polling loop, visibility timeout extension, DLQ handler
├── evals/            ← EvalHarness ABC, 8 YAML scenarios, LLM-as-judge graders
├── observability/    ← 17 typed event emitters, MELT metrics, OTel span decorators
└── tests/            ← Unit / integration / state machine / E2E (4-layer pyramid)

Core Interface Contracts

AbstractCarrierAdapter

class AbstractCarrierAdapter(ABC):
    @abstractmethod
    def fetch_tracking_events(
        self, tracking_number: str
    ) -> list[NormalizedEvent]:
        """Fetch normalized events. Must be idempotent."""
        ...

    @abstractmethod
    def execute_action(
        self, action: CarrierActionRequest
    ) -> CarrierActionResult:
        """Execute approved action. Idempotency key required."""
        ...

ExceptionCase entity

class ExceptionCase(BaseModel):
    case_id: str
    state: CaseState = "detected"
    tracking_number: str
    carrier_code: str
    exception_type: ExceptionType      # CUSTOMS_HOLD | WEATHER_DELAY | ...
    autonomy_tier: Literal["LOW", "MEDIUM", "HIGH"] = "HIGH"
    risk_score: float = Field(0.5, ge=0.0, le=1.0)
    data_completeness_score: float = Field(0.0, ge=0.0, le=1.0)
    sla_deadline_utc: datetime | None = None
    financial_materiality_usd: float = 0.0

LangGraph graph wiring

graph = StateGraph(ExceptionCaseState)
# Register 6 agent nodes
graph.add_node("normalize",   normalization.run)
graph.add_node("assess_risk", risk_assessment.run)
graph.add_node("investigate", investigation.run)
graph.add_node("strategize",  strategy.run)
graph.add_node("execute",     execution.run)
graph.add_node("learn",       learning.run)

# Conditional edge: HITL gate at action_proposed
graph.add_conditional_edges(
    "strategize",
    should_route_to_human,
    {"awaiting_human": END, "execute": "execute"}
)

return graph.compile(
    checkpointer=checkpointer,
    interrupt_before=["execute"]   # ← HITL serialization point
)

The interrupt_before=["execute"] call is the HITL mechanism. When the graph reaches this node for a case requiring operator review, LangGraph serializes the full graph state to the Postgres checkpoint store. The case waits in awaiting_human until the operator approves or rejects. The graph resumes from the checkpoint — no state is lost across process restarts or worker scaling events.

Testing Pyramid

Layer	Location	LLM needed	CI gate
Unit	`tests/unit/`	No	Every commit
State machine	`tests/state_machine/`	No	Every commit
Integration	`tests/integration/`	Mocked	Every commit
E2E	`tests/e2e/`	Yes	Staging gate

State machine tests cover all 13 valid transitions and all 6 invalid transitions (each must raise InvalidStateTransitionError). These run without any credentials in under 2 seconds — they are the fastest regression signal in the system.

Section 10

Deployment on AWS

The production topology maps onto five AWS services. The API service is lightweight — validation and queue dispatch only. The Worker service is heavy — it runs LangGraph including multi-step LLM calls. They scale independently.

Internet
    │
    ▼
ALB (HTTPS :443)
    │
    ▼
ECS Fargate — API Service (FastAPI webhook receiver)
  cpu: 512  memory: 1 GB  min: 2 tasks  scale: requests/target
    │  SQS SendMessage
    ▼
SQS — shipping-exception-queue
  VisibilityTimeout: 900s (customs hold budget)
    │  SQS ReceiveMessage (1 msg per task)
    ▼
ECS Fargate — Worker Service (LangGraph runner)
  cpu: 1024  memory: 4 GB  min: 1 task  scale: queue depth
    ├──▶  RDS Postgres    (LangGraph checkpoint store — awaiting_human state)
    ├──▶  AWS Bedrock     (Sonnet: investigation + strategy / Haiku: others)
    ├──▶  S3 Bronze       (raw carrier payloads, EDI messages, OCR extractions)
    └──▶  Langfuse        (LLM traces, token cost, eval grading)

Service Configuration and Tuning Knobs

Service	Key configuration	Most impactful tuning knob
ECS API	cpu: 512, memory: 1024 MB, min 2 tasks, max 10, scale on RequestCount	Raise `memory` if Pydantic validation OOMs on large EDI payloads
ECS Worker	cpu: 1024, memory: 4096 MB, min 1 task, max 20, scale on SQS depth	`memory` is the first knob — raise to 8 GB if investigation agent loads large document extracts
SQS	VisibilityTimeout: 900s, MessageRetentionPeriod: 4 days, DLQ maxReceiveCount: 3	`VisibilityTimeout` must exceed longest graph execution time × 1.5. Customs hold investigations take up to 12 min — 900s covers with margin
RDS Postgres	db.t4g.medium (50/day), db.r7g.large (500/day), Multi-AZ required	Multi-AZ is non-negotiable — checkpoint store is the recovery point for all `awaiting_human` states
Bedrock	Sonnet for Investigation + Strategy agents; Haiku for Normalization, Risk, Learning	Route Risk-and-Impact to Haiku (saves ~$0.05/case); request RPM quota increase before load testing

Model Routing: Sonnet vs. Haiku

Agent	Model	Rationale
Signal Normalization	Haiku	Deterministic EDI/webhook extraction — no complex reasoning
Risk-and-Impact	Haiku	Rule-based scoring; LLM only for novel exception types
Root-Cause Investigation	Sonnet	Multi-step ReAct loop; carrier API + ERP + customs queries
Policy-and-Strategy	Sonnet	Ranked candidate generation — output quality is load-bearing
Execution	No LLM	Fully deterministic carrier API calls
Learning-and-Eval	Haiku	Outcome labeling — structured output, lower complexity

Monthly Cost Model

Resource	Dev/staging	Prod (100/day)	Prod (500/day)
ECS Fargate (API + Worker)	$23	$75	$260
RDS Postgres (Multi-AZ)	—	$80	$200
AWS Bedrock (Sonnet + Haiku)	$20	$175	$870
S3 Bronze storage	$1	$5	$40
SQS + NAT Gateway	$6	$17	$58
Langfuse cloud	Free	$50	$150
Total	~$50	~$400	~$1,580

Bedrock cost dominates. The single biggest lever: route normalization, risk-scoring, and learning to Haiku instead of Sonnet. This alone reduces LLM cost by ~40% with no change to investigation or strategy quality.

Terraform Module Structure

infra/
├── modules/
│   ├── ecs-api/     ← Task def, ALB, target group, auto-scaling
│   ├── ecs-worker/  ← Task def, SQS-triggered step scaling
│   ├── sqs/         ← Queue, DLQ, CloudWatch alarm
│   ├── rds/         ← Postgres instance, subnet group, Multi-AZ
│   └── iam/         ← API task role, Worker task role (least privilege)
└── envs/
    ├── dev/          ← SQLite locally; only RDS and SQS differ
    └── prod/         ← Full VPC, Multi-AZ RDS, autoscaling enabled

Appendix

Glossary & State Machine Reference

Glossary

ExceptionCase

The core domain entity. Holds case_id, status, exception_type, autonomy_tier, and audit_events. Immutable once created — mutated only via StateMachine.transition().

data_completeness_score

Float 0.0–1.0. Weighted ratio of required Gold-layer fields populated to total required fields for the given exception type. Circuit breaker threshold: 0.60.

autonomy_tier

LOW / MEDIUM / HIGH. Assigned by Risk-and-Impact Agent. Determines HITL pattern for the case. Influences confidence thresholds.

idempotency_key

Format: {case_id}:{action_type}:{attempt_number}. Ensures repeated execution attempts for the same action do not produce duplicate side effects.

NormalizedEvent

Silver-layer entity produced by Signal Normalization Agent. 13 required fields. Schema validated by Pydantic. Deduplicated by event_fingerprint.

pass^k

Regression eval metric. All k runs of a scenario must pass. Used for safety-critical scenarios (customs, rebook, prompt injection). Blocks production graduation if any run fails.

HITL

Human-in-the-loop. Four patterns: Passive Monitoring (no approval), Approve/Reject Gate (≤30s), Review-and-Edit (operator modifies), Exception Routing (confidence failure escalation).

EvalHarness

Abstract base class for the eval system. Defines run_scenario(scenario_id) → EvalResult and list_scenarios() → list[ScenarioSpec]. Scaffolded before any agent code.

EDDOps

Error-Driven Development for Operations. When a production failure occurs, the 72hr SLA requires a minimized test case (failing scenario) committed to the eval harness before the fix is deployed.

MELT

Metrics, Events, Logs, Traces. The four observability signal types used by the stack. 17 typed events required per case for a complete audit trail.

State Transition Reference

All 13 valid state transitions. Any transition not in this list raises InvalidStateTransitionError.

From state	To state	Trigger
`detected`	`triaged`	Risk-and-Impact Agent completes scoring
`triaged`	`investigating`	Root-Cause Investigation Agent begins ReAct loop
`investigating`	`action_proposed`	Policy-and-Strategy Agent produces ranked candidates
`action_proposed`	`awaiting_human`	Autonomy tier is MEDIUM/HIGH, or confidence < threshold
`action_proposed`	`auto_resolved`	Autonomy tier is LOW and confidence ≥ threshold
`awaiting_human`	`auto_resolved`	Operator approves the proposed action
`awaiting_human`	`investigating`	Operator rejects; routes back for re-investigation
`auto_resolved`	`closed`	Downstream validation confirms shipment on track
`failed`	`closed`	Operator acknowledges failure; case closed manually
`investigating`	`awaiting_human`	data_completeness_score < 0.60 (circuit breaker)
`investigating`	`failed`	Step count exceeds 20, or same tool called 3× consecutively
`auto_resolved`	`failed`	Execution Agent: all retries exhausted (non-retryable error)
`failed`	`investigating`	Operator explicitly reopens case for re-investigation

Agentic AI forShipment Exception Management