Skip to content

krishna11-dot/Climate-Risk-Assesment-POC

Repository files navigation

Supply Chain Risk Assessment - Multi-Agent POC

A proof-of-concept multi-agent system for assessing supply chain risks using LangGraph, specialized AI agents, and real-time news analysis.

Python 3.10+ LangGraph Groq


Table of Contents


PROBLEM - Why Did We Build This?

The Challenge - Supply Chain Risk Analysis

Supply chain managers face critical challenges from real-time supply chain disruptions:

  • Information Overload: 100+ news articles daily about shortages, port delays, factory incidents affecting supply chains
  • Time Pressure: Supply chain events require rapid response within minutes, not hours
  • Expertise Gap: Single analyst can't assess logistics, manufacturing, compliance, AND cybersecurity impacts simultaneously
  • Missed Risks: Manual analysis is time-consuming and incomplete

Real-world scenario:

A semiconductor shortage hits major automotive suppliers at 8 AM. By 2 PM, your procurement team needs to know:

  • Which suppliers and shipments are affected?
  • What risks are cascading through the supply chain?
  • Are there alternative suppliers available?
  • What compliance or cybersecurity issues may arise?

Traditional approach: 1 analyst, 4+ hours, incomplete analysis Our POC approach: AI multi-agent system, 15-30 seconds, structured risk assessment


SOLUTION - What Does It Do?

Core Capabilities

This system automatically analyzes supply chain risks by orchestrating 4 specialized AI agents:

USER QUERY: "semiconductor shortage impact on automotive"
    ↓
SYSTEM PROCESSES:
βœ… Fetches real-time news articles from NewsAPI (1-20 articles)
βœ… Categorizes risks by type (facility incidents, supply issues, etc.)
βœ… Routes to appropriate specialists (Logistics, Manufacturing, Compliance, Cybersecurity)
βœ… Specialist analyzes from their domain expertise
βœ… Synthesizes findings into actionable recommendations
βœ… Stores insights in memory for future queries
    ↓
OUTPUT: Executive summary + prioritized recommendations in 15-30 seconds

Key Features

1. Multi-Agent Architecture

  • 4 specialized AI agents (domain experts)
  • Parallel execution capability
  • Hierarchical supervisor pattern for intelligent routing

2. Advanced AI Patterns

  • Tool Use: Real-time news retrieval and risk categorization
  • Reflection: Self-critique and improvement
  • Chain-of-Thought: 6-step reasoning transparency
  • Memory: Short-term (recent 10 queries) + long-term persistence
  • Reinforcement Learning (RL): Q-learning pattern selector (experimental)

3. Production-Ready Design

  • Context engineering (manages token limits efficiently)
  • Retry logic with guardrails
  • Human-in-the-loop debugging mode
  • Comprehensive metrics tracking

4. Modular & Swappable

  • Switch LLM providers via config (Groq/OpenRouter)
  • Add new specialists by extending base class
  • Config-driven risk categories

βœ… RESULTS - Did It Work?

Real Test Run Example

Command:

python main.py --query "nvidia ai chips supply chain disruption"

Output (15 seconds):

================================================================================
  SUPPLY CHAIN RISK ANALYSIS - RESULTS
================================================================================

[!!!] OVERALL RISK LEVEL: HIGH
--------------------------------------------------------------------------------

WHAT THIS MEANS:
  The NVIDIA AI chip supply chain is at significant risk of disruption due
  to increased demand for Microsoft's Maia 200 AI accelerator, potential
  shipping delays, freight cost implications, and geopolitical tensions.

KEY RISKS IDENTIFIED:
  1. Potential shipping delays and freight cost implications due to
     increased demand for Microsoft's Maia 200 AI accelerator, which
     could lead to a 15% increase in freight costs and a 20% delay in
     delivery timelines
  2. Possible transportation disruptions at ports in Taiwan and China,
     affecting delivery timelines for NVIDIA AI chips
  3. Geopolitical tensions and trade policies pose significant risk to
     the supply chain

RECOMMENDED ACTIONS:
  1. Priority: HIGH | When: immediate
     Action: Diversify logistics and transportation partnerships to
             mitigate potential route availability constraints
     Who: NVIDIA Logistics Team
  2. Priority: MEDIUM | When: short-term
     Action: Implement proactive freight audit and payment processes
     Who: NVIDIA Finance Team

ANALYSIS DETAILS:
  News Articles Analyzed: 2
  Expert Specialists Consulted: logistics
  Reflection Pattern: Used (self-critique applied)
  Chain-of-Thought: Used (6 reasoning steps)
  Analysis Confidence: 85%

System Performance (Development Testing)

βœ… Validated Metrics (n=5 test queries)

Metric Target Measured Status
System Reliability 95%+ 100% (5/5 successful) βœ… Exceeds target
Response Time <30s 15-22s avg βœ… Meets target
Recommendations Generated 100% 100% (5/5) βœ… Meets target
Chain-of-Thought Reasoning N/A 6 steps (consistent) βœ… Implemented
Caching N/A Functional (hit/miss tracking) βœ… Implemented
Rate Limiting N/A Functional (TPM tracking) βœ… Implemented

⚠️ Framework Capabilities (Require Larger-Scale Validation)

Capability Status Next Steps
Article Relevance ~34% categorization rate observed Needs manual labeling & tuning for 90%+ target
Risk Categorization Keyword-based; functional Needs ground truth dataset for accuracy validation
Multi-Specialist Activation 1/4 specialists activated in tests Review supervisor routing logic for proper activation
Token Efficiency Theoretical 92% reduction (60K→4.8K) Measure systematically across 100+ queries
Format Compliance 95-100% observed in testing Validation framework built; extract stats from larger sample

Note: System successfully completes all queries with high reliability. Metrics collection infrastructure is production-ready. Formal evaluation with labeled test dataset (n=100+) is the next milestone for production deployment.

Real Test Queries Executed

Query 1: "semiconductor shortage impact on automotive industry"

  • Articles Retrieved: 1
  • Articles Categorized: 0
  • Specialists Used: logistics
  • Response Time: ~22 seconds
  • βœ… Analysis completed successfully

Query 2: "nvidia ai chips supply chain disruption"

  • Articles Retrieved: 2
  • Articles Categorized: 1 (medium risk)
  • Specialists Used: logistics
  • Response Time: ~15 seconds
  • βœ… Analysis completed successfully

Query 3: "chip shortage" (with --dev flag)

  • Articles Retrieved: 20
  • Articles Categorized: 7 (medium risk)
  • Specialists Used: logistics (human-in-the-loop interrupted)
  • Human-in-the-Loop: βœ… Successfully paused workflow
  • βœ… System responded correctly to user input

Query 4: "chinese semiconductor industry export restrictions"

  • Articles Retrieved: 2
  • Articles Categorized: 1 (medium risk)
  • Specialists Used: logistics
  • Response Time: ~14 seconds
  • βœ… Analysis completed successfully

Query 5: "chip shortage" (with DEBUG logging)

  • Articles Retrieved: 20
  • Articles Categorized: 7 (medium risk)
  • Specialists Used: logistics
  • Rate Limiting: βœ… Triggered correctly ("TPM limit approaching, waiting 54.4s")
  • Token Tracking: βœ… Visible in logs (11,384 β†’ 9,496 tokens remaining)
  • βœ… Analysis completed successfully

πŸ—οΈ HOW IT WORKS - System Architecture

High-Level Flow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  USER QUERY                                                     β”‚
β”‚  "chip shortage impact on automotive industry"                 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ (1) NEWS       β”‚  Fetches 1-20 articles from NewsAPI
     β”‚ RETRIEVAL      β”‚  Context Engineering: Limit to 20 max
     β”‚ [Tool, no LLM] β”‚  File: src/tools/news_retrieval.py
     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ (2) RISK       β”‚  Categorizes by keywords (supply_issues, etc.)
     β”‚ CATEGORIZATION β”‚  Context Engineering: Keyword matching
     β”‚ [Tool, no LLM] β”‚  File: src/tools/risk_categorizer.py
     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Config: config/risk_categories.py
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ (3) SUPERVISOR β”‚  Decides: 1, 2, or 4 specialists needed?
     β”‚ ROUTING        β”‚  Rule: 3+ HIGH = all 4, 1-2 HIGH = subset
     β”‚ [Logic, no LLM]β”‚  File: src/graph/nodes.py:supervisor_node()
     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ (4) SPECIALIST AGENTS (Parallel Execution)         β”‚
     β”‚                                                     β”‚
     β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”β”‚
     β”‚  β”‚LOGISTICS β”‚  β”‚MANUFACTURβ”‚  β”‚COMPLIANCEβ”‚  β”‚CYBERβ”‚β”‚
     β”‚  β”‚SPECIALISTβ”‚  β”‚ING SPEC. β”‚  β”‚SPEC.     β”‚  β”‚SPEC.β”‚β”‚
     β”‚  β”‚[LLM Call]β”‚  β”‚[LLM Call]β”‚  β”‚[LLM Call]β”‚  β”‚[LLM]β”‚β”‚
     β”‚  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”¬β”€β”€β”˜β”‚
     β”‚       β”‚             β”‚             β”‚            β”‚   β”‚
     β”‚  Context Engineering: Each gets top 3 articles    β”‚
     β”‚  File: src/context/manager.py                     β”‚
     β”‚  Prompts: config/prompts.py (domain-specific)     β”‚
     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ (5) SYNTHESIS  β”‚  Combines all specialist reports
     β”‚                β”‚  Optional: Chain-of-Thought (6 steps)
     β”‚ [LLM Call]     β”‚  Optional: Reflection (self-critique)
     β”‚                β”‚  File: src/graph/nodes.py:synthesis_node()
     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Prompts: config/prompts.py
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ (6) MEMORY     β”‚  Stores analysis for future reference
     β”‚ STORAGE        β”‚  Short-term: Recent 10 queries
     β”‚ [No LLM]       β”‚  Long-term: Persistent JSON storage
     β”‚                β”‚  File: src/memory/memory_system.py
     β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  Storage: data/memory_storage/
             β”‚
     β”Œβ”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”
     β”‚ (7) OUTPUT     β”‚  Formatted terminal output
     β”‚ FORMATTING     β”‚  Optional: Save JSON file
     β”‚                β”‚  File: main.py:print_analysis_summary()
     β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Detailed Architecture

Hierarchical Supervisor Pattern:

                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚  SUPERVISOR  β”‚
                    β”‚   (Decides)  β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
                            β”‚
         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
         β”‚                  β”‚                  β”‚
    HIGH SEVERITY      MEDIUM-HIGH         LOW SEVERITY
    (3+ risks)         (1-2 risks)         (0 risks)
         β”‚                  β”‚                  β”‚
    β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”       β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”      β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”
    β”‚ ALL 4    β”‚       β”‚ SUBSET   β”‚      β”‚ MINIMAL  β”‚
    β”‚Logistics β”‚       β”‚Logistics β”‚      β”‚Logistics β”‚
    β”‚Manufacturβ”‚       β”‚Manufacturβ”‚      β”‚only      β”‚
    β”‚Complianceβ”‚       β”‚          β”‚      β”‚          β”‚
    β”‚Cyber     β”‚       β”‚          β”‚      β”‚          β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜      β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Matters:

  • Token Efficiency: Don't call all 4 specialists for minor issues (saves 75% tokens)
  • Cost Optimization: Intelligent routing reduces unnecessary LLM calls
  • Speed: Parallel execution when multiple specialists are needed

Context Engineering (Token Management)

Problem: LLMs have limited context windows and only effectively use 20-30% of stated capacity.

Our Solution - 3 Context Engineering Points:

Point #1: Limit News Articles

# src/tools/news_retrieval.py:42
max_articles = 20  # Hard limit

# Why: NewsAPI can return 100+, LLM can't process effectively
# Result: 20 articles instead of 100+ (80% reduction)

Point #2: Top 3 Articles Per Specialist

# src/context/manager.py:40
MAX_ARTICLES_PER_CONTEXT = 3

# How: Rank articles by keyword overlap with query
# Result: Each specialist gets ONLY 3 most relevant articles
# Theoretical Savings: 20 articles (15k tokens) β†’ 3 articles (1.2k tokens) = 92% reduction

Point #3: Token Counting & Truncation

# src/context/manager.py:169
def count_tokens(text):
    encoding = tiktoken.get_encoding("cl100k_base")
    return len(encoding.encode(text))

# Max per specialist: 4000 tokens
# Truncates smartly (preserves complete sentences)

Theoretical Impact:

WITHOUT context engineering:
  4 specialists Γ— 15,000 tokens = 60,000 tokens per query
  Cost: High, slow, often fails

WITH context engineering:
  4 specialists Γ— 1,200 tokens = 4,800 tokens per query
  Theoretical Savings: 92% fewer tokens, 92% lower cost

Note: In practice, observed 1 specialist activation in development tests.
Token tracking exists in rate limiter but requires systematic measurement.

Memory System

Purpose: Learn from past analyses to improve future recommendations.

Implementation:

Short-Term Memory:

  • What: Recent 10 queries and analyses
  • Why: Quick access to recent context for follow-up questions
  • How: In-memory dictionary with LRU eviction
  • File: src/memory/memory_system.py:25-60

Long-Term Memory:

Example Memory Entry:

{
  "query": "chip shortage",
  "timestamp": "2026-02-04T20:58:42",
  "overall_risk_level": "high",
  "key_findings": ["Shipping delays anticipated", "Port congestion expected"],
  "recommendations": ["Diversify logistics partnerships", "Develop contingency plans"],
  "metadata": {
    "specialists_used": ["logistics"],
    "articles_analyzed": 2,
    "response_time_seconds": 15
  }
}

View memory:

python main.py --memory-stats

Prompt Engineering

All prompts stored in: config/prompts.py

Prompt Structure (4 parts):

SPECIALIST_PROMPT = """
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PART 1: ROLE DEFINITION               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
You are a Logistics Supply Chain Specialist.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PART 2: TASK + CONTEXT                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Analyze the following supply chain risks focusing on LOGISTICS ONLY.

Query: {query}                    ← Filled at runtime
Relevant Articles: {articles}     ← Top 3 from ContextManager
Other Insights: {other_insights}  ← Other specialist findings

Focus Areas:
- Transportation disruptions
- Shipping delays
- Port congestion

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PART 3: FORMAT CONSTRAINT (JSON)      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Output MUST be valid JSON in this EXACT format:
{{
    "risk_level": "high|medium|low",
    "findings": ["specific finding 1", "specific finding 2"],
    "recommendations": ["actionable rec 1", "actionable rec 2"],
    "confidence_score": 0.0-1.0
}}

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ PART 4: GUARDRAILS                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
CRITICAL:
- Findings MUST include specific details (companies, locations, dates)
- Recommendations MUST be actionable (not generic advice)

JSON Output:"""

Why This Format Works:

  • Role definition: LLM knows who to act as
  • Context: Gets exactly what it needs (top 3 articles)
  • JSON constraint: Forces structured output (not free text)
  • Guardrails: Quality requirements (specificity, actionability)

Observed Result: 95-100% format compliance in development testing

LLM Calls in the System

# Purpose Prompt Used File Location When It Runs
1 Logistics analysis LOGISTICS_SPECIALIST_PROMPT config/prompts.py:28 If logistics risk detected
2 Manufacturing analysis MANUFACTURING_SPECIALIST_PROMPT config/prompts.py:70 If manufacturing risk detected
3 Compliance analysis COMPLIANCE_SPECIALIST_PROMPT config/prompts.py:113 If compliance risk detected
4 Cybersecurity analysis CYBERSECURITY_SPECIALIST_PROMPT config/prompts.py:156 If cyber risk detected
5 Synthesis SYNTHESIS_PROMPT config/prompts.py:199 Always (combines reports)
6 Chain-of-Thought (optional) COT_PROMPT config/prompts.py:278 If enabled
7 Reflection (optional) REFLECTION_PROMPT config/prompts.py:244 Enabled by default

Typical query in development: 2-3 LLM calls (1 specialist + synthesis + reflection) Maximum possible: 9 LLM calls (all specialists + synthesis + CoT + reflection)

State Management

What is State?

  • Shared dictionary that all nodes read from and write to
  • Think of it as a "whiteboard" in a conference room
  • Each node (expert) reads the whiteboard, does work, writes back findings

State Structure:

AgentState = {
    # User input
    'query': "chip shortage impact",

    # News data (from News Retrieval node)
    'news_articles': [article1, article2, ...],

    # Risk categorization (from Risk Categorization node)
    'categorized_risks': {'supply_issues': [...], 'facility_incident': [...]},
    'risk_summary': {'by_severity': {'high': 3, 'medium': 2, 'low': 0}},

    # Routing decisions (from Supervisor node)
    'specialists_to_invoke': ['logistics', 'manufacturing', 'compliance', 'cybersecurity'],

    # Specialist outputs (from Specialist nodes)
    'specialist_reports': {
        'logistics': {risk_level: 'high', findings: [...], recommendations: [...]},
        'manufacturing': {...},
        'compliance': {...},
        'cybersecurity': {...}
    },

    # Advanced patterns (optional)
    'cot_reasoning': {...},          # Chain-of-Thought steps
    'reflection_critique': {...},    # Reflection improvements

    # Final output (from Synthesis node)
    'final_analysis': {
        'overall_risk_level': 'high',
        'executive_summary': '...',
        'critical_findings': [...],
        'prioritized_recommendations': [...]
    },

    # Metadata
    'metadata': {
        'articles_count': 2,
        'routing_decision': 'minimal',
        'specialists_used': ['logistics'],
        'response_time_seconds': 15
    },
    'errors': []
}

State File: src/graph/state.py


⚑ Quick Start

Simple Commands

1. Basic Analysis

python main.py --query "chip shortage"

2. Test Complex Query

python main.py --query "ransomware attack factory shutdown port congestion GDPR violation"

3. Human-in-the-Loop (Debug Mode)

python main.py --query "semiconductor shortage" --dev

Pauses before each specialist runs, shows current state, option to skip.

4. Save Full Results

python main.py --query "factory fire Taiwan" --output results.json

5. View Metrics Dashboard

python main.py --metrics

6. Debug Logging (See Everything)

python main.py --query "port congestion" --log-level DEBUG

7. Disable Optional Patterns (Save Tokens)

python main.py --query "chip shortage" --disable-reflection --disable-cot

8. RL-Based Pattern Selection (Experimental)

python main.py --query "chip shortage" --rl-patterns

Uses Q-learning to automatically decide when to enable Reflection/CoT patterns based on query complexity.


πŸ”§ Installation

Prerequisites

  • Python 3.10+
  • pip package manager
  • API keys (Groq, NewsAPI, OpenRouter)

Setup

# 1. Clone repository
git clone <repository-url>
cd Supply_Chain_POC

# 2. Create virtual environment
python -m venv .venv

# 3. Activate virtual environment
# Windows:
.venv\Scripts\activate
# Linux/Mac:
source .venv/bin/activate

# 4. Install dependencies
pip install -r requirements.txt

# 5. Configure API keys
cp .env.example .env
# Edit .env and add your API keys

API Keys Configuration

Create .env file in project root:

# Required API Keys
NEWSAPI_KEY=your_newsapi_key_here
GROQ_API_KEY=your_groq_api_key_here
OPENROUTER_API_KEY=your_openrouter_api_key_here

Get API keys:

Verify Installation

# Test basic command
python main.py --query "test"

# Expected: Analysis completes in ~20-30 seconds

πŸ“š Documentation

Complete Guides

Document Purpose When to Read
START_HERE.md Quick start guide with all features Start here - First time setup and testing
ARCHITECTURE_FLOW.md Complete architecture flow with diagrams Understanding how it works end-to-end
SIMPLE_PIPELINE.md One-page quick reference Quick lookup of files, prompts, and flow
QUICK_COMMANDS.md Copy-paste command reference Running tests and viewing results
SYSTEM_VERIFICATION.md Detailed system documentation Deep dive into every architectural decision

Key Files Reference

Configuration:

Core Implementation:


πŸ“ Project Structure

Supply_Chain_POC/
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ config.yaml              # Main configuration
β”‚   β”œβ”€β”€ models.yaml              # LLM provider settings
β”‚   β”œβ”€β”€ prompts.py               # All 9 LLM prompts (CENTRALIZED)
β”‚   └── risk_categories.py       # Risk keyword definitions
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ agents/
β”‚   β”‚   β”œβ”€β”€ base_agent.py              # Abstract base class
β”‚   β”‚   β”œβ”€β”€ specialist_agents.py       # 4 domain specialists
β”‚   β”‚   β”œβ”€β”€ reflection_agent.py        # Reflection pattern
β”‚   β”‚   └── cot_agent.py              # Chain-of-thought
β”‚   β”œβ”€β”€ context/
β”‚   β”‚   └── manager.py                 # Context engineering (token management)
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   β”œβ”€β”€ input_validator.py         # Query validation
β”‚   β”‚   β”œβ”€β”€ output_validator.py        # Format validation
β”‚   β”‚   └── llm_judge.py              # Quality assessment
β”‚   β”œβ”€β”€ graph/
β”‚   β”‚   β”œβ”€β”€ workflow.py                # LangGraph workflow orchestration
β”‚   β”‚   β”œβ”€β”€ nodes.py                   # All node functions (7 nodes)
β”‚   β”‚   β”œβ”€β”€ routing.py                 # Supervisor decision logic
β”‚   β”‚   └── state.py                   # State management (shared dictionary)
β”‚   β”œβ”€β”€ memory/
β”‚   β”‚   └── memory_system.py           # Short-term + long-term memory
β”‚   β”œβ”€β”€ rl/
β”‚   β”‚   β”œβ”€β”€ __init__.py                # RL module exports
β”‚   β”‚   └── pattern_selector.py        # Q-learning pattern selection
β”‚   β”œβ”€β”€ metrics/
β”‚   β”‚   └── tracker.py                 # Business & technical metrics
β”‚   β”œβ”€β”€ tools/
β”‚   β”‚   β”œβ”€β”€ news_retrieval.py          # NewsAPI integration
β”‚   β”‚   └── risk_categorizer.py        # Rule-based categorization
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   └── logging_config.py          # Logging setup
β”‚   └── llm_engine.py                  # Modular LLM abstraction (Groq/OpenRouter)
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ memory_storage/
β”‚   β”‚   β”œβ”€β”€ long_term_memory.json      # Persistent analysis storage
β”‚   β”‚   └── short_term_memory.json     # Recent queries cache
β”‚   β”œβ”€β”€ metrics/
β”‚   β”‚   └── metrics_*.json             # Historical metrics
β”‚   └── rl/
β”‚       └── q_table.json               # Q-learning state-action values
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ test_llm_engine.py
β”‚   β”œβ”€β”€ test_evaluation.py
β”‚   β”œβ”€β”€ test_agents/
β”‚   └── test_tools/
β”œβ”€β”€ docs/                               # Comprehensive documentation
β”œβ”€β”€ .env.example                       # API key template
β”œβ”€β”€ main.py                            # CLI interface (entry point)
β”œβ”€β”€ requirements.txt                   # Python dependencies
└── README.md                          # This file

🎯 Design Principles

1. LLM Output Constraints

Every LLM call enforces strict JSON format:

prompt = f"""
Task: {task}

Output MUST be valid JSON in this exact format:
{{
    "risk_level": "high|medium|low",
    "findings": ["finding1", "finding2"],
    "recommendations": ["rec1", "rec2"]
}}
"""

# Retry logic with guardrails
for attempt in range(3):
    output = llm.invoke(prompt)
    if validate_format(output):
        return output
    prompt = make_stricter(prompt, attempt)
return safe_default_output()

Observed Result: 95-100% validation compliance in development testing

2. Modular Architecture

Each component is independently testable:

# Agents receive LLM engine (don't create it)
specialist = LogisticsSpecialist(llm_engine=engine)

# Tools are pure functions
categories = risk_categorizer.categorize(text)

# Validators are stateless
result = input_validator.validate(query)

Benefits:

  • Easy to test (mock LLM engine)
  • Easy to extend (add new specialist)
  • Easy to swap (change LLM provider)

3. Separation of Concerns


πŸ” Debugging & Observability

Debug Mode (Human-in-the-Loop)

python main.py --query "chip shortage" --dev

What you see:

About to execute: logistics_specialist
Current state: 20 articles, 0 specialist reports
Proceed? (y/n/skip):

Your options:

  • y = Run this specialist
  • skip = Skip this specialist (save tokens)
  • n = Abort entire workflow

Debug Logging

python main.py --query "chip shortage" --log-level DEBUG

Shows:

  • Every LLM call with token counts
  • Routing decisions and why
  • Validation pass/fail for each output
  • Cache hits/misses
  • Context engineering details
  • Rate limit tracking

Metrics Dashboard

python main.py --metrics

Output:

BUSINESS METRICS:
  Total analyses: 5
  Average response time: 18s
  Risk distribution:
    HIGH: 3 (60%)
    MEDIUM: 2 (40%)

TECHNICAL METRICS:
  Total analyses: 5
  Average format compliance: 1.0 (100%)
  Cache hit rate: Tracked

πŸ› οΈ Configuration

Swap LLM Providers

Edit config/models.yaml:

models:
  specialists:
    provider: groq  # Options: 'groq' or 'openrouter'
    model: llama-3.3-70b-versatile
    temperature: 0.7
    max_tokens: 4000

Adjust Risk Categories

Edit config/risk_categories.py:

RISK_CATEGORIES = {
    'facility_incident': {
        'keywords': ['fire', 'explosion', 'shutdown', 'accident'],
        'severity': 'high'
    },
    # Add new categories here
    'your_custom_category': {
        'keywords': ['keyword1', 'keyword2'],
        'severity': 'medium'
    }
}

Enable/Disable Patterns

# Disable reflection (save tokens)
python main.py --query "chip shortage" --disable-reflection

# Disable chain-of-thought (save tokens)
python main.py --query "chip shortage" --disable-cot

# Disable both (minimal tokens)
python main.py --query "chip shortage" --disable-reflection --disable-cot

Reinforcement Learning Pattern Selection (Experimental)

What it does: Uses Q-learning to automatically decide when to enable expensive AI patterns (Reflection and Chain-of-Thought) based on query complexity and past performance.

How to use:

python main.py --query "chip shortage" --rl-patterns

How it works:

  1. Extracts state features from query (query length, risk severity, article count)
  2. Uses epsilon-greedy strategy: 20% exploration (random), 80% exploitation (best known)
  3. Executes selected patterns
  4. Calculates reward: 60% analysis quality + 40% token efficiency
  5. Updates Q-table for future learning

Q-table storage:

Limitations:

  • Requires 50-200 queries to learn effective policies
  • First 10-20 queries will be mostly random (exploration phase)
  • State discretization loses some precision
  • Performance depends on query distribution

πŸ“– Deep Dive: See docs/rl_pattern_selection.md for algorithm details.


πŸ“š Architecture Deep Dives

Understanding the WHY, limitations, and trade-offs of each architectural decision:

πŸ€– Multi-Agent Architecture

Doc: docs/multi_agent_architecture.md

Key Questions Answered:

  • Why 4 specialized agents instead of 1 general LLM?
  • Why is the supervisor rule-based (not LLM)?
  • Why don't specialists communicate with each other?
  • What's hard-coded and what's learned?

🧠 Agentic Patterns (Chain-of-Thought & Reflection)

Doc: docs/agentic_patterns.md

Key Questions Answered:

  • Why use prompts instead of fine-tuning?
  • When does Chain-of-Thought help?
  • Why does Reflection double token cost?
  • When do these patterns fail?

🎯 Reinforcement Learning Pattern Selection

Doc: docs/rl_pattern_selection.md

Key Questions Answered:

  • Why Q-learning instead of prompts?
  • What's the cold-start problem?
  • Why discretize continuous features?
  • How is reward calculated?

πŸ›‘οΈ Guardrails and Validation

Doc: docs/guardrails_validation.md

Key Questions Answered:

  • What's validated and what's NOT?
  • Why doesn't LLM judge catch all errors?
  • Why retry instead of fail immediately?
  • What about fact-checking?

What's NOT Validated:

  • Factual accuracy (no knowledge base)
  • Logical consistency
  • Completeness (missing risks)
  • Hallucinated citations

πŸ§ͺ Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src tests/

# Run specific module
pytest tests/test_llm_engine.py -v

# Run specific test
pytest tests/test_evaluation.py::TestInputValidator -v

Test coverage:

  • LLM Engine: Provider initialization, JSON parsing, retry logic
  • Agents: Specialist outputs, reflection, chain-of-thought
  • Tools: Risk categorization accuracy
  • Evaluation: Input/output validation

❓ Troubleshooting

API Key Errors:

ValueError: GROQ_API_KEY not found

β†’ Create .env file with GROQ_API_KEY=your_key_here

Import Errors:

ModuleNotFoundError: No module named 'src'

β†’ Ensure virtual environment is activated: .venv\Scripts\activate

LLM Output Errors:

Failed to parse JSON after 3 attempts

β†’ Check logs in logs/supply_chain_risk.log (guardrails retry automatically)

NewsAPI Errors:

NewsAPIException: API key invalid

β†’ Verify key at https://newsapi.org/account (free tier: 100 requests/day)

Rate Limit Errors:

429 Too Many Requests

β†’ Wait 1 hour or use --disable-reflection --disable-cot to reduce calls


πŸ“Š Performance & Cost

Token Budget (typical query based on development testing):

Component Estimated Input Tokens Estimated Output Tokens Total
Specialist (1) 1,200 400 1,600
Synthesis 2,500 600 3,100
Reflection +500 +300 +800
Chain-of-Thought +800 +400 +1,200
Base Total 3,700 1,000 4,700

With Reflection: +20% tokens With Chain-of-Thought: +25% tokens

Cost estimate (Groq free tier):

  • Free tier: 14,400 requests/day
  • Development testing: ~4,700-6,000 tokens/query (1 specialist + synthesis + patterns)
  • Can run hundreds of queries per day within free tier limits

Note: Token measurements from rate limiter logs show actual consumption. Systematic measurement across large query sample recommended for production deployment.


πŸ†• Recent Updates (v1.1 - Dec 2025)

Critical Bugs Fixed

  1. βœ… Risk Categorization Error: Fixed KeyError when 0 articles returned
  2. βœ… HITL Display Bug: Now shows actual node names instead of __interrupt__
  3. βœ… Chain-of-Thought Error: Fixed article source field handling
  4. βœ… Output Formatting: Dramatically improved clarity and readability

Output Format Improvements

  • Clear visual risk indicators: [!!!] HIGH, [!!] MEDIUM, [!] LOW
  • Explanatory section headers
  • Text wrapping for better readability
  • Shows which AI patterns were used
  • Displays confidence scores

πŸ“„ License

MIT License - See LICENSE file for details


🀝 Contributing

  1. Fork the repository
  2. Create feature branch (git checkout -b feature/new-specialist)
  3. Add tests for new functionality
  4. Ensure all tests pass (pytest)
  5. Submit pull request

πŸ’¬ Support

For issues or questions, please open a GitHub issue.


πŸŽ“ Project Summary

What This System Demonstrates

1. Problem Solved:

  • Automated multi-domain supply chain risk analysis
  • Reduced analysis time from hours to seconds
  • Structured, actionable recommendations

2. Technical Implementation:

  • LangGraph orchestration with hierarchical supervisor pattern
  • 4 specialized AI agents with domain-specific prompts
  • Context engineering for token optimization
  • Validation framework with retry mechanisms

3. Production-Ready Features:

  • Modular architecture (swap LLM providers via config)
  • Comprehensive error handling and retry logic
  • Human-in-the-loop debugging capability
  • Metrics tracking (business + technical)
  • Memory system (short-term + long-term)
  • Rate limiting and caching

4. Advanced AI Patterns Implemented:

  • βœ… Tool Use (News retrieval, risk categorization)
  • βœ… Multi-Agent (4 specialists with parallel execution capability)
  • βœ… Reflection (self-critique and improvement)
  • βœ… Chain-of-Thought (6-step structured reasoning)
  • βœ… Memory (persistent learning and context)

5. Development Status:

  • Functional POC with 100% query completion rate (5/5 test queries)
  • Validation frameworks built and operational
  • Response times: 15-22 seconds average
  • Ready for scaled evaluation with larger test datasets

6. Next Steps for Production:

  • Create labeled test dataset (100+ queries with ground truth)
  • Run systematic evaluation to measure precision, recall, accuracy
  • Tune supervisor routing logic for multi-specialist activation
  • Measure token efficiency across diverse query distribution
  • Implement continuous monitoring and feedback loops

Built with: Python 3.10+ | LangGraph | Groq (Llama 3.3 70B) | NewsAPI

About

This Open Source project was developed as part of Vectorcube Inc.Multi-agent AI system for climate-related supply chain risk assessment. Uses LangGraph, specialized AI agents, and Q-learning pattern selection to analyze real-time news and provide actionable risk insights

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors