Supply Chain Risk Assessment - Multi-Agent POC

A proof-of-concept multi-agent system for assessing supply chain risks using LangGraph, specialized AI agents, and real-time news analysis.

PROBLEM - Why Did We Build This?

The Challenge - Supply Chain Risk Analysis

Supply chain managers face critical challenges from real-time supply chain disruptions:

Information Overload: 100+ news articles daily about shortages, port delays, factory incidents affecting supply chains
Time Pressure: Supply chain events require rapid response within minutes, not hours
Expertise Gap: Single analyst can't assess logistics, manufacturing, compliance, AND cybersecurity impacts simultaneously
Missed Risks: Manual analysis is time-consuming and incomplete

Real-world scenario:

A semiconductor shortage hits major automotive suppliers at 8 AM. By 2 PM, your procurement team needs to know:

Which suppliers and shipments are affected?

What risks are cascading through the supply chain?

Are there alternative suppliers available?

What compliance or cybersecurity issues may arise?

Traditional approach: 1 analyst, 4+ hours, incomplete analysis Our POC approach: AI multi-agent system, 15-30 seconds, structured risk assessment

SOLUTION - What Does It Do?

Core Capabilities

This system automatically analyzes supply chain risks by orchestrating 4 specialized AI agents:

USER QUERY: "semiconductor shortage impact on automotive"
    ↓
SYSTEM PROCESSES:
✅ Fetches real-time news articles from NewsAPI (1-20 articles)
✅ Categorizes risks by type (facility incidents, supply issues, etc.)
✅ Routes to appropriate specialists (Logistics, Manufacturing, Compliance, Cybersecurity)
✅ Specialist analyzes from their domain expertise
✅ Synthesizes findings into actionable recommendations
✅ Stores insights in memory for future queries
    ↓
OUTPUT: Executive summary + prioritized recommendations in 15-30 seconds

Key Features

1. Multi-Agent Architecture

4 specialized AI agents (domain experts)
Parallel execution capability
Hierarchical supervisor pattern for intelligent routing

2. Advanced AI Patterns

Tool Use: Real-time news retrieval and risk categorization
Reflection: Self-critique and improvement
Chain-of-Thought: 6-step reasoning transparency
Memory: Short-term (recent 10 queries) + long-term persistence
Reinforcement Learning (RL): Q-learning pattern selector (experimental)

3. Production-Ready Design

Context engineering (manages token limits efficiently)
Retry logic with guardrails
Human-in-the-loop debugging mode
Comprehensive metrics tracking

4. Modular & Swappable

Switch LLM providers via config (Groq/OpenRouter)
Add new specialists by extending base class
Config-driven risk categories

✅ RESULTS - Did It Work?

Real Test Run Example

Command:

python main.py --query "nvidia ai chips supply chain disruption"

Output (15 seconds):

================================================================================
  SUPPLY CHAIN RISK ANALYSIS - RESULTS
================================================================================

[!!!] OVERALL RISK LEVEL: HIGH
--------------------------------------------------------------------------------

WHAT THIS MEANS:
  The NVIDIA AI chip supply chain is at significant risk of disruption due
  to increased demand for Microsoft's Maia 200 AI accelerator, potential
  shipping delays, freight cost implications, and geopolitical tensions.

KEY RISKS IDENTIFIED:
  1. Potential shipping delays and freight cost implications due to
     increased demand for Microsoft's Maia 200 AI accelerator, which
     could lead to a 15% increase in freight costs and a 20% delay in
     delivery timelines
  2. Possible transportation disruptions at ports in Taiwan and China,
     affecting delivery timelines for NVIDIA AI chips
  3. Geopolitical tensions and trade policies pose significant risk to
     the supply chain

RECOMMENDED ACTIONS:
  1. Priority: HIGH | When: immediate
     Action: Diversify logistics and transportation partnerships to
             mitigate potential route availability constraints
     Who: NVIDIA Logistics Team
  2. Priority: MEDIUM | When: short-term
     Action: Implement proactive freight audit and payment processes
     Who: NVIDIA Finance Team

ANALYSIS DETAILS:
  News Articles Analyzed: 2
  Expert Specialists Consulted: logistics
  Reflection Pattern: Used (self-critique applied)
  Chain-of-Thought: Used (6 reasoning steps)
  Analysis Confidence: 85%

System Performance (Development Testing)

✅ Validated Metrics (n=5 test queries)

Metric	Target	Measured	Status
System Reliability	95%+	100% (5/5 successful)	✅ Exceeds target
Response Time	<30s	15-22s avg	✅ Meets target
Recommendations Generated	100%	100% (5/5)	✅ Meets target
Chain-of-Thought Reasoning	N/A	6 steps (consistent)	✅ Implemented
Caching	N/A	Functional (hit/miss tracking)	✅ Implemented
Rate Limiting	N/A	Functional (TPM tracking)	✅ Implemented

⚠️ Framework Capabilities (Require Larger-Scale Validation)

Capability	Status	Next Steps
Article Relevance	~34% categorization rate observed	Needs manual labeling & tuning for 90%+ target
Risk Categorization	Keyword-based; functional	Needs ground truth dataset for accuracy validation
Multi-Specialist Activation	1/4 specialists activated in tests	Review supervisor routing logic for proper activation
Token Efficiency	Theoretical 92% reduction (60K→4.8K)	Measure systematically across 100+ queries
Format Compliance	95-100% observed in testing	Validation framework built; extract stats from larger sample

Note: System successfully completes all queries with high reliability. Metrics collection infrastructure is production-ready. Formal evaluation with labeled test dataset (n=100+) is the next milestone for production deployment.

Real Test Queries Executed

Query 1: "semiconductor shortage impact on automotive industry"

Articles Retrieved: 1
Articles Categorized: 0
Specialists Used: logistics
Response Time: ~22 seconds
✅ Analysis completed successfully

Query 2: "nvidia ai chips supply chain disruption"

Articles Retrieved: 2
Articles Categorized: 1 (medium risk)
Specialists Used: logistics
Response Time: ~15 seconds
✅ Analysis completed successfully

Query 3: "chip shortage" (with --dev flag)

Articles Retrieved: 20
Articles Categorized: 7 (medium risk)
Specialists Used: logistics (human-in-the-loop interrupted)
Human-in-the-Loop: ✅ Successfully paused workflow
✅ System responded correctly to user input

Query 4: "chinese semiconductor industry export restrictions"

Articles Retrieved: 2
Articles Categorized: 1 (medium risk)
Specialists Used: logistics
Response Time: ~14 seconds
✅ Analysis completed successfully

Query 5: "chip shortage" (with DEBUG logging)

Articles Retrieved: 20
Articles Categorized: 7 (medium risk)
Specialists Used: logistics
Rate Limiting: ✅ Triggered correctly ("TPM limit approaching, waiting 54.4s")
Token Tracking: ✅ Visible in logs (11,384 → 9,496 tokens remaining)
✅ Analysis completed successfully

🏗️ HOW IT WORKS - System Architecture

High-Level Flow

┌────────────────────────────────────────────────────────────────┐
│  USER QUERY                                                     │
│  "chip shortage impact on automotive industry"                 │
└────────────┬───────────────────────────────────────────────────┘
             │
     ┌───────▼────────┐
     │ (1) NEWS       │  Fetches 1-20 articles from NewsAPI
     │ RETRIEVAL      │  Context Engineering: Limit to 20 max
     │ [Tool, no LLM] │  File: src/tools/news_retrieval.py
     └───────┬────────┘
             │
     ┌───────▼────────┐
     │ (2) RISK       │  Categorizes by keywords (supply_issues, etc.)
     │ CATEGORIZATION │  Context Engineering: Keyword matching
     │ [Tool, no LLM] │  File: src/tools/risk_categorizer.py
     └───────┬────────┘  Config: config/risk_categories.py
             │
     ┌───────▼────────┐
     │ (3) SUPERVISOR │  Decides: 1, 2, or 4 specialists needed?
     │ ROUTING        │  Rule: 3+ HIGH = all 4, 1-2 HIGH = subset
     │ [Logic, no LLM]│  File: src/graph/nodes.py:supervisor_node()
     └───────┬────────┘
             │
     ┌───────▼────────────────────────────────────────────┐
     │ (4) SPECIALIST AGENTS (Parallel Execution)         │
     │                                                     │
     │  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌─────┐│
     │  │LOGISTICS │  │MANUFACTUR│  │COMPLIANCE│  │CYBER││
     │  │SPECIALIST│  │ING SPEC. │  │SPEC.     │  │SPEC.││
     │  │[LLM Call]│  │[LLM Call]│  │[LLM Call]│  │[LLM]││
     │  └────┬─────┘  └────┬─────┘  └────┬─────┘  └──┬──┘│
     │       │             │             │            │   │
     │  Context Engineering: Each gets top 3 articles    │
     │  File: src/context/manager.py                     │
     │  Prompts: config/prompts.py (domain-specific)     │
     └───────┬───────────────────────────────────────────┘
             │
     ┌───────▼────────┐
     │ (5) SYNTHESIS  │  Combines all specialist reports
     │                │  Optional: Chain-of-Thought (6 steps)
     │ [LLM Call]     │  Optional: Reflection (self-critique)
     │                │  File: src/graph/nodes.py:synthesis_node()
     └───────┬────────┘  Prompts: config/prompts.py
             │
     ┌───────▼────────┐
     │ (6) MEMORY     │  Stores analysis for future reference
     │ STORAGE        │  Short-term: Recent 10 queries
     │ [No LLM]       │  Long-term: Persistent JSON storage
     │                │  File: src/memory/memory_system.py
     └───────┬────────┘  Storage: data/memory_storage/
             │
     ┌───────▼────────┐
     │ (7) OUTPUT     │  Formatted terminal output
     │ FORMATTING     │  Optional: Save JSON file
     │                │  File: main.py:print_analysis_summary()
     └────────────────┘

Detailed Architecture

Hierarchical Supervisor Pattern:

                    ┌──────────────┐
                    │  SUPERVISOR  │
                    │   (Decides)  │
                    └───────┬──────┘
                            │
         ┌──────────────────┼──────────────────┐
         │                  │                  │
    HIGH SEVERITY      MEDIUM-HIGH         LOW SEVERITY
    (3+ risks)         (1-2 risks)         (0 risks)
         │                  │                  │
    ┌────▼─────┐       ┌────▼─────┐      ┌────▼─────┐
    │ ALL 4    │       │ SUBSET   │      │ MINIMAL  │
    │Logistics │       │Logistics │      │Logistics │
    │Manufactur│       │Manufactur│      │only      │
    │Compliance│       │          │      │          │
    │Cyber     │       │          │      │          │
    └──────────┘       └──────────┘      └──────────┘

Why This Matters:

Token Efficiency: Don't call all 4 specialists for minor issues (saves 75% tokens)
Cost Optimization: Intelligent routing reduces unnecessary LLM calls
Speed: Parallel execution when multiple specialists are needed

Context Engineering (Token Management)

Problem: LLMs have limited context windows and only effectively use 20-30% of stated capacity.

Our Solution - 3 Context Engineering Points:

Point #1: Limit News Articles

# src/tools/news_retrieval.py:42
max_articles = 20  # Hard limit

# Why: NewsAPI can return 100+, LLM can't process effectively
# Result: 20 articles instead of 100+ (80% reduction)

Point #2: Top 3 Articles Per Specialist

# src/context/manager.py:40
MAX_ARTICLES_PER_CONTEXT = 3

# How: Rank articles by keyword overlap with query
# Result: Each specialist gets ONLY 3 most relevant articles
# Theoretical Savings: 20 articles (15k tokens) → 3 articles (1.2k tokens) = 92% reduction

Point #3: Token Counting & Truncation

# src/context/manager.py:169
def count_tokens(text):
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Max per specialist: 4000 tokens
# Truncates smartly (preserves complete sentences)

Theoretical Impact:

WITHOUT context engineering:
  4 specialists × 15,000 tokens = 60,000 tokens per query
  Cost: High, slow, often fails

WITH context engineering:
  4 specialists × 1,200 tokens = 4,800 tokens per query
  Theoretical Savings: 92% fewer tokens, 92% lower cost

Note: In practice, observed 1 specialist activation in development tests.
Token tracking exists in rate limiter but requires systematic measurement.

Memory System

Purpose: Learn from past analyses to improve future recommendations.

Implementation:

Short-Term Memory:

What: Recent 10 queries and analyses
Why: Quick access to recent context for follow-up questions
How: In-memory dictionary with LRU eviction
File: src/memory/memory_system.py:25-60

Long-Term Memory:

What: Persistent storage of ALL analyses
Why: Historical patterns, recurring risks, trend analysis
How: JSON files indexed by query keywords
Storage: data/memory_storage/long_term_memory.json

Example Memory Entry:

{
  "query": "chip shortage",
  "timestamp": "2026-02-04T20:58:42",
  "overall_risk_level": "high",
  "key_findings": ["Shipping delays anticipated", "Port congestion expected"],
  "recommendations": ["Diversify logistics partnerships", "Develop contingency plans"],
  "metadata": {
    "specialists_used": ["logistics"],
    "articles_analyzed": 2,
    "response_time_seconds": 15
  }
}

View memory:

python main.py --memory-stats

Prompt Engineering

All prompts stored in: config/prompts.py

Prompt Structure (4 parts):

SPECIALIST_PROMPT = """
┌────────────────────────────────────────┐
│ PART 1: ROLE DEFINITION               │
└────────────────────────────────────────┘
You are a Logistics Supply Chain Specialist.

┌────────────────────────────────────────┐
│ PART 2: TASK + CONTEXT                │
└────────────────────────────────────────┘
Analyze the following supply chain risks focusing on LOGISTICS ONLY.

Query: {query}                    ← Filled at runtime
Relevant Articles: {articles}     ← Top 3 from ContextManager
Other Insights: {other_insights}  ← Other specialist findings

Focus Areas:
- Transportation disruptions
- Shipping delays
- Port congestion

┌────────────────────────────────────────┐
│ PART 3: FORMAT CONSTRAINT (JSON)      │
└────────────────────────────────────────┘
Output MUST be valid JSON in this EXACT format:
{{
    "risk_level": "high|medium|low",
    "findings": ["specific finding 1", "specific finding 2"],
    "recommendations": ["actionable rec 1", "actionable rec 2"],
    "confidence_score": 0.0-1.0
}}

┌────────────────────────────────────────┐
│ PART 4: GUARDRAILS                    │
└────────────────────────────────────────┘
CRITICAL:
- Findings MUST include specific details (companies, locations, dates)
- Recommendations MUST be actionable (not generic advice)

JSON Output:"""

Why This Format Works:

Role definition: LLM knows who to act as
Context: Gets exactly what it needs (top 3 articles)
JSON constraint: Forces structured output (not free text)
Guardrails: Quality requirements (specificity, actionability)

Observed Result: 95-100% format compliance in development testing

LLM Calls in the System

#	Purpose	Prompt Used	File Location	When It Runs
1	Logistics analysis	`LOGISTICS_SPECIALIST_PROMPT`	config/prompts.py:28	If logistics risk detected
2	Manufacturing analysis	`MANUFACTURING_SPECIALIST_PROMPT`	config/prompts.py:70	If manufacturing risk detected
3	Compliance analysis	`COMPLIANCE_SPECIALIST_PROMPT`	config/prompts.py:113	If compliance risk detected
4	Cybersecurity analysis	`CYBERSECURITY_SPECIALIST_PROMPT`	config/prompts.py:156	If cyber risk detected
5	Synthesis	`SYNTHESIS_PROMPT`	config/prompts.py:199	Always (combines reports)
6	Chain-of-Thought (optional)	`COT_PROMPT`	config/prompts.py:278	If enabled
7	Reflection (optional)	`REFLECTION_PROMPT`	config/prompts.py:244	Enabled by default

Typical query in development: 2-3 LLM calls (1 specialist + synthesis + reflection) Maximum possible: 9 LLM calls (all specialists + synthesis + CoT + reflection)

State Management

What is State?

Shared dictionary that all nodes read from and write to
Think of it as a "whiteboard" in a conference room
Each node (expert) reads the whiteboard, does work, writes back findings

State Structure:

AgentState = {
    # User input
    'query': "chip shortage impact",

    # News data (from News Retrieval node)
    'news_articles': [article1, article2, ...],

    # Risk categorization (from Risk Categorization node)
    'categorized_risks': {'supply_issues': [...], 'facility_incident': [...]},
    'risk_summary': {'by_severity': {'high': 3, 'medium': 2, 'low': 0}},

    # Routing decisions (from Supervisor node)
    'specialists_to_invoke': ['logistics', 'manufacturing', 'compliance', 'cybersecurity'],

    # Specialist outputs (from Specialist nodes)
    'specialist_reports': {
        'logistics': {risk_level: 'high', findings: [...], recommendations: [...]},
        'manufacturing': {...},
        'compliance': {...},
        'cybersecurity': {...}
    },

    # Advanced patterns (optional)
    'cot_reasoning': {...},          # Chain-of-Thought steps
    'reflection_critique': {...},    # Reflection improvements

    # Final output (from Synthesis node)
    'final_analysis': {
        'overall_risk_level': 'high',
        'executive_summary': '...',
        'critical_findings': [...],
        'prioritized_recommendations': [...]
    },

    # Metadata
    'metadata': {
        'articles_count': 2,
        'routing_decision': 'minimal',
        'specialists_used': ['logistics'],
        'response_time_seconds': 15
    },
    'errors': []
}

State File: src/graph/state.py

⚡ Quick Start

Simple Commands

1. Basic Analysis

python main.py --query "chip shortage"

2. Test Complex Query

python main.py --query "ransomware attack factory shutdown port congestion GDPR violation"

3. Human-in-the-Loop (Debug Mode)

python main.py --query "semiconductor shortage" --dev

Pauses before each specialist runs, shows current state, option to skip.

4. Save Full Results

python main.py --query "factory fire Taiwan" --output results.json

5. View Metrics Dashboard

python main.py --metrics

6. Debug Logging (See Everything)

python main.py --query "port congestion" --log-level DEBUG

7. Disable Optional Patterns (Save Tokens)

python main.py --query "chip shortage" --disable-reflection --disable-cot

8. RL-Based Pattern Selection (Experimental)

python main.py --query "chip shortage" --rl-patterns

Uses Q-learning to automatically decide when to enable Reflection/CoT patterns based on query complexity.

🔧 Installation

Prerequisites

Python 3.10+
pip package manager
API keys (Groq, NewsAPI, OpenRouter)

Setup

# 1. Clone repository
git clone <repository-url>
cd Supply_Chain_POC

# 2. Create virtual environment
python -m venv .venv

# 3. Activate virtual environment
# Windows:
.venv\Scripts\activate
# Linux/Mac:
source .venv/bin/activate

# 4. Install dependencies
pip install -r requirements.txt

# 5. Configure API keys
cp .env.example .env
# Edit .env and add your API keys

API Keys Configuration

Create .env file in project root:

# Required API Keys
NEWSAPI_KEY=your_newsapi_key_here
GROQ_API_KEY=your_groq_api_key_here
OPENROUTER_API_KEY=your_openrouter_api_key_here

Get API keys:

NewsAPI: https://newsapi.org (free tier: 100 requests/day)
Groq: https://console.groq.com (free tier available)
OpenRouter: https://openrouter.ai (optional fallback)

Verify Installation

# Test basic command
python main.py --query "test"

# Expected: Analysis completes in ~20-30 seconds

📚 Documentation

Complete Guides

Document	Purpose	When to Read
START_HERE.md	Quick start guide with all features	Start here - First time setup and testing
ARCHITECTURE_FLOW.md	Complete architecture flow with diagrams	Understanding how it works end-to-end
SIMPLE_PIPELINE.md	One-page quick reference	Quick lookup of files, prompts, and flow
QUICK_COMMANDS.md	Copy-paste command reference	Running tests and viewing results
SYSTEM_VERIFICATION.md	Detailed system documentation	Deep dive into every architectural decision

Key Files Reference

Configuration:

config/prompts.py - All 9 LLM prompts
config/risk_categories.py - Risk keyword definitions
config/models.yaml - LLM provider settings
.env - API keys (create from .env.example)

Core Implementation:

src/graph/workflow.py - LangGraph workflow orchestration
src/graph/nodes.py - All node functions
src/agents/specialist_agents.py - 4 specialist agents
src/context/manager.py - Context engineering
src/memory/memory_system.py - Memory system
main.py - CLI interface

📁 Project Structure

Supply_Chain_POC/
├── config/
│   ├── config.yaml              # Main configuration
│   ├── models.yaml              # LLM provider settings
│   ├── prompts.py               # All 9 LLM prompts (CENTRALIZED)
│   └── risk_categories.py       # Risk keyword definitions
├── src/
│   ├── agents/
│   │   ├── base_agent.py              # Abstract base class
│   │   ├── specialist_agents.py       # 4 domain specialists
│   │   ├── reflection_agent.py        # Reflection pattern
│   │   └── cot_agent.py              # Chain-of-thought
│   ├── context/
│   │   └── manager.py                 # Context engineering (token management)
│   ├── evaluation/
│   │   ├── input_validator.py         # Query validation
│   │   ├── output_validator.py        # Format validation
│   │   └── llm_judge.py              # Quality assessment
│   ├── graph/
│   │   ├── workflow.py                # LangGraph workflow orchestration
│   │   ├── nodes.py                   # All node functions (7 nodes)
│   │   ├── routing.py                 # Supervisor decision logic
│   │   └── state.py                   # State management (shared dictionary)
│   ├── memory/
│   │   └── memory_system.py           # Short-term + long-term memory
│   ├── rl/
│   │   ├── __init__.py                # RL module exports
│   │   └── pattern_selector.py        # Q-learning pattern selection
│   ├── metrics/
│   │   └── tracker.py                 # Business & technical metrics
│   ├── tools/
│   │   ├── news_retrieval.py          # NewsAPI integration
│   │   └── risk_categorizer.py        # Rule-based categorization
│   ├── utils/
│   │   └── logging_config.py          # Logging setup
│   └── llm_engine.py                  # Modular LLM abstraction (Groq/OpenRouter)
├── data/
│   ├── memory_storage/
│   │   ├── long_term_memory.json      # Persistent analysis storage
│   │   └── short_term_memory.json     # Recent queries cache
│   ├── metrics/
│   │   └── metrics_*.json             # Historical metrics
│   └── rl/
│       └── q_table.json               # Q-learning state-action values
├── tests/
│   ├── test_llm_engine.py
│   ├── test_evaluation.py
│   ├── test_agents/
│   └── test_tools/
├── docs/                               # Comprehensive documentation
├── .env.example                       # API key template
├── main.py                            # CLI interface (entry point)
├── requirements.txt                   # Python dependencies
└── README.md                          # This file

🎯 Design Principles

1. LLM Output Constraints

Every LLM call enforces strict JSON format:

prompt = f"""
Task: {task}

Output MUST be valid JSON in this exact format:
{{
    "risk_level": "high|medium|low",
    "findings": ["finding1", "finding2"],
    "recommendations": ["rec1", "rec2"]
}}
"""

# Retry logic with guardrails
for attempt in range(3):
    output = llm.invoke(prompt)
    if validate_format(output):
        return output
    prompt = make_stricter(prompt, attempt)
return safe_default_output()

Observed Result: 95-100% validation compliance in development testing

2. Modular Architecture

Each component is independently testable:

# Agents receive LLM engine (don't create it)
specialist = LogisticsSpecialist(llm_engine=engine)

# Tools are pure functions
categories = risk_categorizer.categorize(text)

# Validators are stateless
result = input_validator.validate(query)

Benefits:

Easy to test (mock LLM engine)
Easy to extend (add new specialist)
Easy to swap (change LLM provider)

3. Separation of Concerns

Business logic → Agents (src/agents/)
LLM interaction → Engine (src/llm_engine.py)
Workflow orchestration → LangGraph (src/graph/)
Context engineering → ContextManager (src/context/manager.py)
Metrics tracking → Separate from execution (src/metrics/)

🔍 Debugging & Observability

Debug Mode (Human-in-the-Loop)

python main.py --query "chip shortage" --dev

What you see:

About to execute: logistics_specialist
Current state: 20 articles, 0 specialist reports
Proceed? (y/n/skip):

Your options:

y = Run this specialist
skip = Skip this specialist (save tokens)
n = Abort entire workflow

Debug Logging

python main.py --query "chip shortage" --log-level DEBUG

Shows:

Every LLM call with token counts
Routing decisions and why
Validation pass/fail for each output
Cache hits/misses
Context engineering details
Rate limit tracking

Metrics Dashboard

python main.py --metrics

Output:

BUSINESS METRICS:
  Total analyses: 5
  Average response time: 18s
  Risk distribution:
    HIGH: 3 (60%)
    MEDIUM: 2 (40%)

TECHNICAL METRICS:
  Total analyses: 5
  Average format compliance: 1.0 (100%)
  Cache hit rate: Tracked

🛠️ Configuration

Swap LLM Providers

Edit config/models.yaml:

models:
  specialists:
    provider: groq  # Options: 'groq' or 'openrouter'
    model: llama-3.3-70b-versatile
    temperature: 0.7
    max_tokens: 4000

Adjust Risk Categories

Edit config/risk_categories.py:

RISK_CATEGORIES = {
    'facility_incident': {
        'keywords': ['fire', 'explosion', 'shutdown', 'accident'],
        'severity': 'high'
    },
    # Add new categories here
    'your_custom_category': {
        'keywords': ['keyword1', 'keyword2'],
        'severity': 'medium'
    }
}

Enable/Disable Patterns

# Disable reflection (save tokens)
python main.py --query "chip shortage" --disable-reflection

# Disable chain-of-thought (save tokens)
python main.py --query "chip shortage" --disable-cot

# Disable both (minimal tokens)
python main.py --query "chip shortage" --disable-reflection --disable-cot

Reinforcement Learning Pattern Selection (Experimental)

What it does: Uses Q-learning to automatically decide when to enable expensive AI patterns (Reflection and Chain-of-Thought) based on query complexity and past performance.

How to use:

python main.py --query "chip shortage" --rl-patterns

How it works:

Extracts state features from query (query length, risk severity, article count)
Uses epsilon-greedy strategy: 20% exploration (random), 80% exploitation (best known)
Executes selected patterns
Calculates reward: 60% analysis quality + 40% token efficiency
Updates Q-table for future learning

Q-table storage:

Stored at: data/rl/q_table.json
Persists across runs for continuous learning

Limitations:

Requires 50-200 queries to learn effective policies
First 10-20 queries will be mostly random (exploration phase)
State discretization loses some precision
Performance depends on query distribution

📖 Deep Dive: See docs/rl_pattern_selection.md for algorithm details.

📚 Architecture Deep Dives

Understanding the WHY, limitations, and trade-offs of each architectural decision:

🤖 Multi-Agent Architecture

Doc: docs/multi_agent_architecture.md

Key Questions Answered:

Why 4 specialized agents instead of 1 general LLM?
Why is the supervisor rule-based (not LLM)?
Why don't specialists communicate with each other?
What's hard-coded and what's learned?

🧠 Agentic Patterns (Chain-of-Thought & Reflection)

Doc: docs/agentic_patterns.md

Key Questions Answered:

Why use prompts instead of fine-tuning?
When does Chain-of-Thought help?
Why does Reflection double token cost?
When do these patterns fail?

🎯 Reinforcement Learning Pattern Selection

Doc: docs/rl_pattern_selection.md

Key Questions Answered:

Why Q-learning instead of prompts?
What's the cold-start problem?
Why discretize continuous features?
How is reward calculated?

🛡️ Guardrails and Validation

Doc: docs/guardrails_validation.md

Key Questions Answered:

What's validated and what's NOT?
Why doesn't LLM judge catch all errors?
Why retry instead of fail immediately?
What about fact-checking?

What's NOT Validated:

Factual accuracy (no knowledge base)
Logical consistency
Completeness (missing risks)
Hallucinated citations

🧪 Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src tests/

# Run specific module
pytest tests/test_llm_engine.py -v

# Run specific test
pytest tests/test_evaluation.py::TestInputValidator -v

Test coverage:

LLM Engine: Provider initialization, JSON parsing, retry logic
Agents: Specialist outputs, reflection, chain-of-thought
Tools: Risk categorization accuracy
Evaluation: Input/output validation

❓ Troubleshooting

API Key Errors:

ValueError: GROQ_API_KEY not found

→ Create .env file with GROQ_API_KEY=your_key_here

Import Errors:

ModuleNotFoundError: No module named 'src'

→ Ensure virtual environment is activated: .venv\Scripts\activate

LLM Output Errors:

Failed to parse JSON after 3 attempts

→ Check logs in logs/supply_chain_risk.log (guardrails retry automatically)

NewsAPI Errors:

NewsAPIException: API key invalid

→ Verify key at https://newsapi.org/account (free tier: 100 requests/day)

Rate Limit Errors:

429 Too Many Requests

→ Wait 1 hour or use --disable-reflection --disable-cot to reduce calls

📊 Performance & Cost

Token Budget (typical query based on development testing):

Component	Estimated Input Tokens	Estimated Output Tokens	Total
Specialist (1)	1,200	400	1,600
Synthesis	2,500	600	3,100
Reflection	+500	+300	+800
Chain-of-Thought	+800	+400	+1,200
Base Total	3,700	1,000	4,700

With Reflection: +20% tokens With Chain-of-Thought: +25% tokens

Cost estimate (Groq free tier):

Free tier: 14,400 requests/day
Development testing: ~4,700-6,000 tokens/query (1 specialist + synthesis + patterns)
Can run hundreds of queries per day within free tier limits

Note: Token measurements from rate limiter logs show actual consumption. Systematic measurement across large query sample recommended for production deployment.

🆕 Recent Updates (v1.1 - Dec 2025)

Critical Bugs Fixed

✅ Risk Categorization Error: Fixed KeyError when 0 articles returned
✅ HITL Display Bug: Now shows actual node names instead of __interrupt__
✅ Chain-of-Thought Error: Fixed article source field handling
✅ Output Formatting: Dramatically improved clarity and readability

Output Format Improvements

Clear visual risk indicators: [!!!] HIGH, [!!] MEDIUM, [!] LOW
Explanatory section headers
Text wrapping for better readability
Shows which AI patterns were used
Displays confidence scores

📄 License

MIT License - See LICENSE file for details

🤝 Contributing

Fork the repository
Create feature branch (git checkout -b feature/new-specialist)
Add tests for new functionality
Ensure all tests pass (pytest)
Submit pull request

💬 Support

For issues or questions, please open a GitHub issue.

🎓 Project Summary

What This System Demonstrates

1. Problem Solved:

Automated multi-domain supply chain risk analysis
Reduced analysis time from hours to seconds
Structured, actionable recommendations

2. Technical Implementation:

LangGraph orchestration with hierarchical supervisor pattern
4 specialized AI agents with domain-specific prompts
Context engineering for token optimization
Validation framework with retry mechanisms

3. Production-Ready Features:

Modular architecture (swap LLM providers via config)
Comprehensive error handling and retry logic
Human-in-the-loop debugging capability
Metrics tracking (business + technical)
Memory system (short-term + long-term)
Rate limiting and caching

4. Advanced AI Patterns Implemented:

✅ Tool Use (News retrieval, risk categorization)
✅ Multi-Agent (4 specialists with parallel execution capability)
✅ Reflection (self-critique and improvement)
✅ Chain-of-Thought (6-step structured reasoning)
✅ Memory (persistent learning and context)

5. Development Status:

Functional POC with 100% query completion rate (5/5 test queries)
Validation frameworks built and operational
Response times: 15-22 seconds average
Ready for scaled evaluation with larger test datasets

6. Next Steps for Production:

Create labeled test dataset (100+ queries with ground truth)
Run systematic evaluation to measure precision, recall, accuracy
Tune supervisor routing logic for multi-specialist activation
Measure token efficiency across diverse query distribution
Implement continuous monitoring and feedback loops

Built with: Python 3.10+ | LangGraph | Groq (Llama 3.3 70B) | NewsAPI

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
config		config
docs		docs
src		src
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGES_SUMMARY.md		CHANGES_SUMMARY.md
IMPLEMENTATION_DECISIONS.md		IMPLEMENTATION_DECISIONS.md
README.md		README.md
complete_test.json		complete_test.json
final_test.json		final_test.json
main.py		main.py
requirements.txt		requirements.txt
setup.ps1		setup.ps1
view_results.py		view_results.py

Folders and files

Latest commit

History

Repository files navigation

Supply Chain Risk Assessment - Multi-Agent POC

Table of Contents

PROBLEM - Why Did We Build This?

The Challenge - Supply Chain Risk Analysis

SOLUTION - What Does It Do?

Core Capabilities

Key Features

✅ RESULTS - Did It Work?

Real Test Run Example

System Performance (Development Testing)

✅ Validated Metrics (n=5 test queries)

⚠️ Framework Capabilities (Require Larger-Scale Validation)

Real Test Queries Executed

🏗️ HOW IT WORKS - System Architecture

High-Level Flow

Detailed Architecture

Context Engineering (Token Management)

Memory System

Prompt Engineering

LLM Calls in the System

State Management

⚡ Quick Start

Simple Commands

🔧 Installation

Prerequisites

Setup

API Keys Configuration

Verify Installation

📚 Documentation

Complete Guides

Key Files Reference

📁 Project Structure

🎯 Design Principles

1. LLM Output Constraints

2. Modular Architecture

3. Separation of Concerns

🔍 Debugging & Observability

Debug Mode (Human-in-the-Loop)

Debug Logging

Metrics Dashboard

🛠️ Configuration

Swap LLM Providers

Adjust Risk Categories

Enable/Disable Patterns

Reinforcement Learning Pattern Selection (Experimental)

📚 Architecture Deep Dives

🤖 Multi-Agent Architecture

🧠 Agentic Patterns (Chain-of-Thought & Reflection)

🎯 Reinforcement Learning Pattern Selection

🛡️ Guardrails and Validation

🧪 Testing

❓ Troubleshooting

📊 Performance & Cost

🆕 Recent Updates (v1.1 - Dec 2025)

Critical Bugs Fixed

Output Format Improvements

📄 License

🤝 Contributing

💬 Support

🎓 Project Summary

What This System Demonstrates

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages