A proof-of-concept multi-agent system for assessing supply chain risks using LangGraph, specialized AI agents, and real-time news analysis.
- Problem
- Solution
- Results
- How It Works
- Quick Start
- Installation
- Architecture Deep Dives
- Documentation
- Project Structure
Supply chain managers face critical challenges from real-time supply chain disruptions:
- Information Overload: 100+ news articles daily about shortages, port delays, factory incidents affecting supply chains
- Time Pressure: Supply chain events require rapid response within minutes, not hours
- Expertise Gap: Single analyst can't assess logistics, manufacturing, compliance, AND cybersecurity impacts simultaneously
- Missed Risks: Manual analysis is time-consuming and incomplete
Real-world scenario:
A semiconductor shortage hits major automotive suppliers at 8 AM. By 2 PM, your procurement team needs to know:
- Which suppliers and shipments are affected?
- What risks are cascading through the supply chain?
- Are there alternative suppliers available?
- What compliance or cybersecurity issues may arise?
Traditional approach: 1 analyst, 4+ hours, incomplete analysis Our POC approach: AI multi-agent system, 15-30 seconds, structured risk assessment
This system automatically analyzes supply chain risks by orchestrating 4 specialized AI agents:
USER QUERY: "semiconductor shortage impact on automotive"
β
SYSTEM PROCESSES:
β
Fetches real-time news articles from NewsAPI (1-20 articles)
β
Categorizes risks by type (facility incidents, supply issues, etc.)
β
Routes to appropriate specialists (Logistics, Manufacturing, Compliance, Cybersecurity)
β
Specialist analyzes from their domain expertise
β
Synthesizes findings into actionable recommendations
β
Stores insights in memory for future queries
β
OUTPUT: Executive summary + prioritized recommendations in 15-30 seconds
1. Multi-Agent Architecture
- 4 specialized AI agents (domain experts)
- Parallel execution capability
- Hierarchical supervisor pattern for intelligent routing
2. Advanced AI Patterns
- Tool Use: Real-time news retrieval and risk categorization
- Reflection: Self-critique and improvement
- Chain-of-Thought: 6-step reasoning transparency
- Memory: Short-term (recent 10 queries) + long-term persistence
- Reinforcement Learning (RL): Q-learning pattern selector (experimental)
3. Production-Ready Design
- Context engineering (manages token limits efficiently)
- Retry logic with guardrails
- Human-in-the-loop debugging mode
- Comprehensive metrics tracking
4. Modular & Swappable
- Switch LLM providers via config (Groq/OpenRouter)
- Add new specialists by extending base class
- Config-driven risk categories
Command:
python main.py --query "nvidia ai chips supply chain disruption"Output (15 seconds):
================================================================================
SUPPLY CHAIN RISK ANALYSIS - RESULTS
================================================================================
[!!!] OVERALL RISK LEVEL: HIGH
--------------------------------------------------------------------------------
WHAT THIS MEANS:
The NVIDIA AI chip supply chain is at significant risk of disruption due
to increased demand for Microsoft's Maia 200 AI accelerator, potential
shipping delays, freight cost implications, and geopolitical tensions.
KEY RISKS IDENTIFIED:
1. Potential shipping delays and freight cost implications due to
increased demand for Microsoft's Maia 200 AI accelerator, which
could lead to a 15% increase in freight costs and a 20% delay in
delivery timelines
2. Possible transportation disruptions at ports in Taiwan and China,
affecting delivery timelines for NVIDIA AI chips
3. Geopolitical tensions and trade policies pose significant risk to
the supply chain
RECOMMENDED ACTIONS:
1. Priority: HIGH | When: immediate
Action: Diversify logistics and transportation partnerships to
mitigate potential route availability constraints
Who: NVIDIA Logistics Team
2. Priority: MEDIUM | When: short-term
Action: Implement proactive freight audit and payment processes
Who: NVIDIA Finance Team
ANALYSIS DETAILS:
News Articles Analyzed: 2
Expert Specialists Consulted: logistics
Reflection Pattern: Used (self-critique applied)
Chain-of-Thought: Used (6 reasoning steps)
Analysis Confidence: 85%
| Metric | Target | Measured | Status |
|---|---|---|---|
| System Reliability | 95%+ | 100% (5/5 successful) | β Exceeds target |
| Response Time | <30s | 15-22s avg | β Meets target |
| Recommendations Generated | 100% | 100% (5/5) | β Meets target |
| Chain-of-Thought Reasoning | N/A | 6 steps (consistent) | β Implemented |
| Caching | N/A | Functional (hit/miss tracking) | β Implemented |
| Rate Limiting | N/A | Functional (TPM tracking) | β Implemented |
| Capability | Status | Next Steps |
|---|---|---|
| Article Relevance | ~34% categorization rate observed | Needs manual labeling & tuning for 90%+ target |
| Risk Categorization | Keyword-based; functional | Needs ground truth dataset for accuracy validation |
| Multi-Specialist Activation | 1/4 specialists activated in tests | Review supervisor routing logic for proper activation |
| Token Efficiency | Theoretical 92% reduction (60Kβ4.8K) | Measure systematically across 100+ queries |
| Format Compliance | 95-100% observed in testing | Validation framework built; extract stats from larger sample |
Note: System successfully completes all queries with high reliability. Metrics collection infrastructure is production-ready. Formal evaluation with labeled test dataset (n=100+) is the next milestone for production deployment.
Query 1: "semiconductor shortage impact on automotive industry"
- Articles Retrieved: 1
- Articles Categorized: 0
- Specialists Used: logistics
- Response Time: ~22 seconds
- β Analysis completed successfully
Query 2: "nvidia ai chips supply chain disruption"
- Articles Retrieved: 2
- Articles Categorized: 1 (medium risk)
- Specialists Used: logistics
- Response Time: ~15 seconds
- β Analysis completed successfully
Query 3: "chip shortage" (with --dev flag)
- Articles Retrieved: 20
- Articles Categorized: 7 (medium risk)
- Specialists Used: logistics (human-in-the-loop interrupted)
- Human-in-the-Loop: β Successfully paused workflow
- β System responded correctly to user input
Query 4: "chinese semiconductor industry export restrictions"
- Articles Retrieved: 2
- Articles Categorized: 1 (medium risk)
- Specialists Used: logistics
- Response Time: ~14 seconds
- β Analysis completed successfully
Query 5: "chip shortage" (with DEBUG logging)
- Articles Retrieved: 20
- Articles Categorized: 7 (medium risk)
- Specialists Used: logistics
- Rate Limiting: β Triggered correctly ("TPM limit approaching, waiting 54.4s")
- Token Tracking: β Visible in logs (11,384 β 9,496 tokens remaining)
- β Analysis completed successfully
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β USER QUERY β
β "chip shortage impact on automotive industry" β
ββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββΌβββββββββ
β (1) NEWS β Fetches 1-20 articles from NewsAPI
β RETRIEVAL β Context Engineering: Limit to 20 max
β [Tool, no LLM] β File: src/tools/news_retrieval.py
βββββββββ¬βββββββββ
β
βββββββββΌβββββββββ
β (2) RISK β Categorizes by keywords (supply_issues, etc.)
β CATEGORIZATION β Context Engineering: Keyword matching
β [Tool, no LLM] β File: src/tools/risk_categorizer.py
βββββββββ¬βββββββββ Config: config/risk_categories.py
β
βββββββββΌβββββββββ
β (3) SUPERVISOR β Decides: 1, 2, or 4 specialists needed?
β ROUTING β Rule: 3+ HIGH = all 4, 1-2 HIGH = subset
β [Logic, no LLM]β File: src/graph/nodes.py:supervisor_node()
βββββββββ¬βββββββββ
β
βββββββββΌβββββββββββββββββββββββββββββββββββββββββββββ
β (4) SPECIALIST AGENTS (Parallel Execution) β
β β
β ββββββββββββ ββββββββββββ ββββββββββββ ββββββββ
β βLOGISTICS β βMANUFACTURβ βCOMPLIANCEβ βCYBERββ
β βSPECIALISTβ βING SPEC. β βSPEC. β βSPEC.ββ
β β[LLM Call]β β[LLM Call]β β[LLM Call]β β[LLM]ββ
β ββββββ¬ββββββ ββββββ¬ββββββ ββββββ¬ββββββ ββββ¬ββββ
β β β β β β
β Context Engineering: Each gets top 3 articles β
β File: src/context/manager.py β
β Prompts: config/prompts.py (domain-specific) β
βββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββΌβββββββββ
β (5) SYNTHESIS β Combines all specialist reports
β β Optional: Chain-of-Thought (6 steps)
β [LLM Call] β Optional: Reflection (self-critique)
β β File: src/graph/nodes.py:synthesis_node()
βββββββββ¬βββββββββ Prompts: config/prompts.py
β
βββββββββΌβββββββββ
β (6) MEMORY β Stores analysis for future reference
β STORAGE β Short-term: Recent 10 queries
β [No LLM] β Long-term: Persistent JSON storage
β β File: src/memory/memory_system.py
βββββββββ¬βββββββββ Storage: data/memory_storage/
β
βββββββββΌβββββββββ
β (7) OUTPUT β Formatted terminal output
β FORMATTING β Optional: Save JSON file
β β File: main.py:print_analysis_summary()
ββββββββββββββββββ
Hierarchical Supervisor Pattern:
ββββββββββββββββ
β SUPERVISOR β
β (Decides) β
βββββββββ¬βββββββ
β
ββββββββββββββββββββΌβββββββββββββββββββ
β β β
HIGH SEVERITY MEDIUM-HIGH LOW SEVERITY
(3+ risks) (1-2 risks) (0 risks)
β β β
ββββββΌββββββ ββββββΌββββββ ββββββΌββββββ
β ALL 4 β β SUBSET β β MINIMAL β
βLogistics β βLogistics β βLogistics β
βManufacturβ βManufacturβ βonly β
βComplianceβ β β β β
βCyber β β β β β
ββββββββββββ ββββββββββββ ββββββββββββ
Why This Matters:
- Token Efficiency: Don't call all 4 specialists for minor issues (saves 75% tokens)
- Cost Optimization: Intelligent routing reduces unnecessary LLM calls
- Speed: Parallel execution when multiple specialists are needed
Problem: LLMs have limited context windows and only effectively use 20-30% of stated capacity.
Our Solution - 3 Context Engineering Points:
Point #1: Limit News Articles
# src/tools/news_retrieval.py:42
max_articles = 20 # Hard limit
# Why: NewsAPI can return 100+, LLM can't process effectively
# Result: 20 articles instead of 100+ (80% reduction)Point #2: Top 3 Articles Per Specialist
# src/context/manager.py:40
MAX_ARTICLES_PER_CONTEXT = 3
# How: Rank articles by keyword overlap with query
# Result: Each specialist gets ONLY 3 most relevant articles
# Theoretical Savings: 20 articles (15k tokens) β 3 articles (1.2k tokens) = 92% reductionPoint #3: Token Counting & Truncation
# src/context/manager.py:169
def count_tokens(text):
encoding = tiktoken.get_encoding("cl100k_base")
return len(encoding.encode(text))
# Max per specialist: 4000 tokens
# Truncates smartly (preserves complete sentences)Theoretical Impact:
WITHOUT context engineering:
4 specialists Γ 15,000 tokens = 60,000 tokens per query
Cost: High, slow, often fails
WITH context engineering:
4 specialists Γ 1,200 tokens = 4,800 tokens per query
Theoretical Savings: 92% fewer tokens, 92% lower cost
Note: In practice, observed 1 specialist activation in development tests.
Token tracking exists in rate limiter but requires systematic measurement.
Purpose: Learn from past analyses to improve future recommendations.
Implementation:
Short-Term Memory:
- What: Recent 10 queries and analyses
- Why: Quick access to recent context for follow-up questions
- How: In-memory dictionary with LRU eviction
- File: src/memory/memory_system.py:25-60
Long-Term Memory:
- What: Persistent storage of ALL analyses
- Why: Historical patterns, recurring risks, trend analysis
- How: JSON files indexed by query keywords
- Storage: data/memory_storage/long_term_memory.json
Example Memory Entry:
{
"query": "chip shortage",
"timestamp": "2026-02-04T20:58:42",
"overall_risk_level": "high",
"key_findings": ["Shipping delays anticipated", "Port congestion expected"],
"recommendations": ["Diversify logistics partnerships", "Develop contingency plans"],
"metadata": {
"specialists_used": ["logistics"],
"articles_analyzed": 2,
"response_time_seconds": 15
}
}View memory:
python main.py --memory-statsAll prompts stored in: config/prompts.py
Prompt Structure (4 parts):
SPECIALIST_PROMPT = """
ββββββββββββββββββββββββββββββββββββββββββ
β PART 1: ROLE DEFINITION β
ββββββββββββββββββββββββββββββββββββββββββ
You are a Logistics Supply Chain Specialist.
ββββββββββββββββββββββββββββββββββββββββββ
β PART 2: TASK + CONTEXT β
ββββββββββββββββββββββββββββββββββββββββββ
Analyze the following supply chain risks focusing on LOGISTICS ONLY.
Query: {query} β Filled at runtime
Relevant Articles: {articles} β Top 3 from ContextManager
Other Insights: {other_insights} β Other specialist findings
Focus Areas:
- Transportation disruptions
- Shipping delays
- Port congestion
ββββββββββββββββββββββββββββββββββββββββββ
β PART 3: FORMAT CONSTRAINT (JSON) β
ββββββββββββββββββββββββββββββββββββββββββ
Output MUST be valid JSON in this EXACT format:
{{
"risk_level": "high|medium|low",
"findings": ["specific finding 1", "specific finding 2"],
"recommendations": ["actionable rec 1", "actionable rec 2"],
"confidence_score": 0.0-1.0
}}
ββββββββββββββββββββββββββββββββββββββββββ
β PART 4: GUARDRAILS β
ββββββββββββββββββββββββββββββββββββββββββ
CRITICAL:
- Findings MUST include specific details (companies, locations, dates)
- Recommendations MUST be actionable (not generic advice)
JSON Output:"""Why This Format Works:
- Role definition: LLM knows who to act as
- Context: Gets exactly what it needs (top 3 articles)
- JSON constraint: Forces structured output (not free text)
- Guardrails: Quality requirements (specificity, actionability)
Observed Result: 95-100% format compliance in development testing
| # | Purpose | Prompt Used | File Location | When It Runs |
|---|---|---|---|---|
| 1 | Logistics analysis | LOGISTICS_SPECIALIST_PROMPT |
config/prompts.py:28 | If logistics risk detected |
| 2 | Manufacturing analysis | MANUFACTURING_SPECIALIST_PROMPT |
config/prompts.py:70 | If manufacturing risk detected |
| 3 | Compliance analysis | COMPLIANCE_SPECIALIST_PROMPT |
config/prompts.py:113 | If compliance risk detected |
| 4 | Cybersecurity analysis | CYBERSECURITY_SPECIALIST_PROMPT |
config/prompts.py:156 | If cyber risk detected |
| 5 | Synthesis | SYNTHESIS_PROMPT |
config/prompts.py:199 | Always (combines reports) |
| 6 | Chain-of-Thought (optional) | COT_PROMPT |
config/prompts.py:278 | If enabled |
| 7 | Reflection (optional) | REFLECTION_PROMPT |
config/prompts.py:244 | Enabled by default |
Typical query in development: 2-3 LLM calls (1 specialist + synthesis + reflection) Maximum possible: 9 LLM calls (all specialists + synthesis + CoT + reflection)
What is State?
- Shared dictionary that all nodes read from and write to
- Think of it as a "whiteboard" in a conference room
- Each node (expert) reads the whiteboard, does work, writes back findings
State Structure:
AgentState = {
# User input
'query': "chip shortage impact",
# News data (from News Retrieval node)
'news_articles': [article1, article2, ...],
# Risk categorization (from Risk Categorization node)
'categorized_risks': {'supply_issues': [...], 'facility_incident': [...]},
'risk_summary': {'by_severity': {'high': 3, 'medium': 2, 'low': 0}},
# Routing decisions (from Supervisor node)
'specialists_to_invoke': ['logistics', 'manufacturing', 'compliance', 'cybersecurity'],
# Specialist outputs (from Specialist nodes)
'specialist_reports': {
'logistics': {risk_level: 'high', findings: [...], recommendations: [...]},
'manufacturing': {...},
'compliance': {...},
'cybersecurity': {...}
},
# Advanced patterns (optional)
'cot_reasoning': {...}, # Chain-of-Thought steps
'reflection_critique': {...}, # Reflection improvements
# Final output (from Synthesis node)
'final_analysis': {
'overall_risk_level': 'high',
'executive_summary': '...',
'critical_findings': [...],
'prioritized_recommendations': [...]
},
# Metadata
'metadata': {
'articles_count': 2,
'routing_decision': 'minimal',
'specialists_used': ['logistics'],
'response_time_seconds': 15
},
'errors': []
}State File: src/graph/state.py
1. Basic Analysis
python main.py --query "chip shortage"2. Test Complex Query
python main.py --query "ransomware attack factory shutdown port congestion GDPR violation"3. Human-in-the-Loop (Debug Mode)
python main.py --query "semiconductor shortage" --devPauses before each specialist runs, shows current state, option to skip.
4. Save Full Results
python main.py --query "factory fire Taiwan" --output results.json5. View Metrics Dashboard
python main.py --metrics6. Debug Logging (See Everything)
python main.py --query "port congestion" --log-level DEBUG7. Disable Optional Patterns (Save Tokens)
python main.py --query "chip shortage" --disable-reflection --disable-cot8. RL-Based Pattern Selection (Experimental)
python main.py --query "chip shortage" --rl-patternsUses Q-learning to automatically decide when to enable Reflection/CoT patterns based on query complexity.
- Python 3.10+
- pip package manager
- API keys (Groq, NewsAPI, OpenRouter)
# 1. Clone repository
git clone <repository-url>
cd Supply_Chain_POC
# 2. Create virtual environment
python -m venv .venv
# 3. Activate virtual environment
# Windows:
.venv\Scripts\activate
# Linux/Mac:
source .venv/bin/activate
# 4. Install dependencies
pip install -r requirements.txt
# 5. Configure API keys
cp .env.example .env
# Edit .env and add your API keysCreate .env file in project root:
# Required API Keys
NEWSAPI_KEY=your_newsapi_key_here
GROQ_API_KEY=your_groq_api_key_here
OPENROUTER_API_KEY=your_openrouter_api_key_hereGet API keys:
- NewsAPI: https://newsapi.org (free tier: 100 requests/day)
- Groq: https://console.groq.com (free tier available)
- OpenRouter: https://openrouter.ai (optional fallback)
# Test basic command
python main.py --query "test"
# Expected: Analysis completes in ~20-30 seconds| Document | Purpose | When to Read |
|---|---|---|
| START_HERE.md | Quick start guide with all features | Start here - First time setup and testing |
| ARCHITECTURE_FLOW.md | Complete architecture flow with diagrams | Understanding how it works end-to-end |
| SIMPLE_PIPELINE.md | One-page quick reference | Quick lookup of files, prompts, and flow |
| QUICK_COMMANDS.md | Copy-paste command reference | Running tests and viewing results |
| SYSTEM_VERIFICATION.md | Detailed system documentation | Deep dive into every architectural decision |
Configuration:
- config/prompts.py - All 9 LLM prompts
- config/risk_categories.py - Risk keyword definitions
- config/models.yaml - LLM provider settings
- .env - API keys (create from .env.example)
Core Implementation:
- src/graph/workflow.py - LangGraph workflow orchestration
- src/graph/nodes.py - All node functions
- src/agents/specialist_agents.py - 4 specialist agents
- src/context/manager.py - Context engineering
- src/memory/memory_system.py - Memory system
- main.py - CLI interface
Supply_Chain_POC/
βββ config/
β βββ config.yaml # Main configuration
β βββ models.yaml # LLM provider settings
β βββ prompts.py # All 9 LLM prompts (CENTRALIZED)
β βββ risk_categories.py # Risk keyword definitions
βββ src/
β βββ agents/
β β βββ base_agent.py # Abstract base class
β β βββ specialist_agents.py # 4 domain specialists
β β βββ reflection_agent.py # Reflection pattern
β β βββ cot_agent.py # Chain-of-thought
β βββ context/
β β βββ manager.py # Context engineering (token management)
β βββ evaluation/
β β βββ input_validator.py # Query validation
β β βββ output_validator.py # Format validation
β β βββ llm_judge.py # Quality assessment
β βββ graph/
β β βββ workflow.py # LangGraph workflow orchestration
β β βββ nodes.py # All node functions (7 nodes)
β β βββ routing.py # Supervisor decision logic
β β βββ state.py # State management (shared dictionary)
β βββ memory/
β β βββ memory_system.py # Short-term + long-term memory
β βββ rl/
β β βββ __init__.py # RL module exports
β β βββ pattern_selector.py # Q-learning pattern selection
β βββ metrics/
β β βββ tracker.py # Business & technical metrics
β βββ tools/
β β βββ news_retrieval.py # NewsAPI integration
β β βββ risk_categorizer.py # Rule-based categorization
β βββ utils/
β β βββ logging_config.py # Logging setup
β βββ llm_engine.py # Modular LLM abstraction (Groq/OpenRouter)
βββ data/
β βββ memory_storage/
β β βββ long_term_memory.json # Persistent analysis storage
β β βββ short_term_memory.json # Recent queries cache
β βββ metrics/
β β βββ metrics_*.json # Historical metrics
β βββ rl/
β βββ q_table.json # Q-learning state-action values
βββ tests/
β βββ test_llm_engine.py
β βββ test_evaluation.py
β βββ test_agents/
β βββ test_tools/
βββ docs/ # Comprehensive documentation
βββ .env.example # API key template
βββ main.py # CLI interface (entry point)
βββ requirements.txt # Python dependencies
βββ README.md # This file
Every LLM call enforces strict JSON format:
prompt = f"""
Task: {task}
Output MUST be valid JSON in this exact format:
{{
"risk_level": "high|medium|low",
"findings": ["finding1", "finding2"],
"recommendations": ["rec1", "rec2"]
}}
"""
# Retry logic with guardrails
for attempt in range(3):
output = llm.invoke(prompt)
if validate_format(output):
return output
prompt = make_stricter(prompt, attempt)
return safe_default_output()Observed Result: 95-100% validation compliance in development testing
Each component is independently testable:
# Agents receive LLM engine (don't create it)
specialist = LogisticsSpecialist(llm_engine=engine)
# Tools are pure functions
categories = risk_categorizer.categorize(text)
# Validators are stateless
result = input_validator.validate(query)Benefits:
- Easy to test (mock LLM engine)
- Easy to extend (add new specialist)
- Easy to swap (change LLM provider)
- Business logic β Agents (src/agents/)
- LLM interaction β Engine (src/llm_engine.py)
- Workflow orchestration β LangGraph (src/graph/)
- Context engineering β ContextManager (src/context/manager.py)
- Metrics tracking β Separate from execution (src/metrics/)
python main.py --query "chip shortage" --devWhat you see:
About to execute: logistics_specialist
Current state: 20 articles, 0 specialist reports
Proceed? (y/n/skip):
Your options:
y= Run this specialistskip= Skip this specialist (save tokens)n= Abort entire workflow
python main.py --query "chip shortage" --log-level DEBUGShows:
- Every LLM call with token counts
- Routing decisions and why
- Validation pass/fail for each output
- Cache hits/misses
- Context engineering details
- Rate limit tracking
python main.py --metricsOutput:
BUSINESS METRICS:
Total analyses: 5
Average response time: 18s
Risk distribution:
HIGH: 3 (60%)
MEDIUM: 2 (40%)
TECHNICAL METRICS:
Total analyses: 5
Average format compliance: 1.0 (100%)
Cache hit rate: Tracked
Edit config/models.yaml:
models:
specialists:
provider: groq # Options: 'groq' or 'openrouter'
model: llama-3.3-70b-versatile
temperature: 0.7
max_tokens: 4000Edit config/risk_categories.py:
RISK_CATEGORIES = {
'facility_incident': {
'keywords': ['fire', 'explosion', 'shutdown', 'accident'],
'severity': 'high'
},
# Add new categories here
'your_custom_category': {
'keywords': ['keyword1', 'keyword2'],
'severity': 'medium'
}
}# Disable reflection (save tokens)
python main.py --query "chip shortage" --disable-reflection
# Disable chain-of-thought (save tokens)
python main.py --query "chip shortage" --disable-cot
# Disable both (minimal tokens)
python main.py --query "chip shortage" --disable-reflection --disable-cotWhat it does: Uses Q-learning to automatically decide when to enable expensive AI patterns (Reflection and Chain-of-Thought) based on query complexity and past performance.
How to use:
python main.py --query "chip shortage" --rl-patternsHow it works:
- Extracts state features from query (query length, risk severity, article count)
- Uses epsilon-greedy strategy: 20% exploration (random), 80% exploitation (best known)
- Executes selected patterns
- Calculates reward: 60% analysis quality + 40% token efficiency
- Updates Q-table for future learning
Q-table storage:
- Stored at: data/rl/q_table.json
- Persists across runs for continuous learning
Limitations:
- Requires 50-200 queries to learn effective policies
- First 10-20 queries will be mostly random (exploration phase)
- State discretization loses some precision
- Performance depends on query distribution
π Deep Dive: See docs/rl_pattern_selection.md for algorithm details.
Understanding the WHY, limitations, and trade-offs of each architectural decision:
Doc: docs/multi_agent_architecture.md
Key Questions Answered:
- Why 4 specialized agents instead of 1 general LLM?
- Why is the supervisor rule-based (not LLM)?
- Why don't specialists communicate with each other?
- What's hard-coded and what's learned?
Key Questions Answered:
- Why use prompts instead of fine-tuning?
- When does Chain-of-Thought help?
- Why does Reflection double token cost?
- When do these patterns fail?
Doc: docs/rl_pattern_selection.md
Key Questions Answered:
- Why Q-learning instead of prompts?
- What's the cold-start problem?
- Why discretize continuous features?
- How is reward calculated?
Doc: docs/guardrails_validation.md
Key Questions Answered:
- What's validated and what's NOT?
- Why doesn't LLM judge catch all errors?
- Why retry instead of fail immediately?
- What about fact-checking?
What's NOT Validated:
- Factual accuracy (no knowledge base)
- Logical consistency
- Completeness (missing risks)
- Hallucinated citations
# Run all tests
pytest
# Run with coverage
pytest --cov=src tests/
# Run specific module
pytest tests/test_llm_engine.py -v
# Run specific test
pytest tests/test_evaluation.py::TestInputValidator -vTest coverage:
- LLM Engine: Provider initialization, JSON parsing, retry logic
- Agents: Specialist outputs, reflection, chain-of-thought
- Tools: Risk categorization accuracy
- Evaluation: Input/output validation
API Key Errors:
ValueError: GROQ_API_KEY not found
β Create .env file with GROQ_API_KEY=your_key_here
Import Errors:
ModuleNotFoundError: No module named 'src'
β Ensure virtual environment is activated: .venv\Scripts\activate
LLM Output Errors:
Failed to parse JSON after 3 attempts
β Check logs in logs/supply_chain_risk.log (guardrails retry automatically)
NewsAPI Errors:
NewsAPIException: API key invalid
β Verify key at https://newsapi.org/account (free tier: 100 requests/day)
Rate Limit Errors:
429 Too Many Requests
β Wait 1 hour or use --disable-reflection --disable-cot to reduce calls
Token Budget (typical query based on development testing):
| Component | Estimated Input Tokens | Estimated Output Tokens | Total |
|---|---|---|---|
| Specialist (1) | 1,200 | 400 | 1,600 |
| Synthesis | 2,500 | 600 | 3,100 |
| Reflection | +500 | +300 | +800 |
| Chain-of-Thought | +800 | +400 | +1,200 |
| Base Total | 3,700 | 1,000 | 4,700 |
With Reflection: +20% tokens With Chain-of-Thought: +25% tokens
Cost estimate (Groq free tier):
- Free tier: 14,400 requests/day
- Development testing: ~4,700-6,000 tokens/query (1 specialist + synthesis + patterns)
- Can run hundreds of queries per day within free tier limits
Note: Token measurements from rate limiter logs show actual consumption. Systematic measurement across large query sample recommended for production deployment.
- β Risk Categorization Error: Fixed KeyError when 0 articles returned
- β
HITL Display Bug: Now shows actual node names instead of
__interrupt__ - β Chain-of-Thought Error: Fixed article source field handling
- β Output Formatting: Dramatically improved clarity and readability
- Clear visual risk indicators:
[!!!]HIGH,[!!]MEDIUM,[!]LOW - Explanatory section headers
- Text wrapping for better readability
- Shows which AI patterns were used
- Displays confidence scores
MIT License - See LICENSE file for details
- Fork the repository
- Create feature branch (
git checkout -b feature/new-specialist) - Add tests for new functionality
- Ensure all tests pass (
pytest) - Submit pull request
For issues or questions, please open a GitHub issue.
1. Problem Solved:
- Automated multi-domain supply chain risk analysis
- Reduced analysis time from hours to seconds
- Structured, actionable recommendations
2. Technical Implementation:
- LangGraph orchestration with hierarchical supervisor pattern
- 4 specialized AI agents with domain-specific prompts
- Context engineering for token optimization
- Validation framework with retry mechanisms
3. Production-Ready Features:
- Modular architecture (swap LLM providers via config)
- Comprehensive error handling and retry logic
- Human-in-the-loop debugging capability
- Metrics tracking (business + technical)
- Memory system (short-term + long-term)
- Rate limiting and caching
4. Advanced AI Patterns Implemented:
- β Tool Use (News retrieval, risk categorization)
- β Multi-Agent (4 specialists with parallel execution capability)
- β Reflection (self-critique and improvement)
- β Chain-of-Thought (6-step structured reasoning)
- β Memory (persistent learning and context)
5. Development Status:
- Functional POC with 100% query completion rate (5/5 test queries)
- Validation frameworks built and operational
- Response times: 15-22 seconds average
- Ready for scaled evaluation with larger test datasets
6. Next Steps for Production:
- Create labeled test dataset (100+ queries with ground truth)
- Run systematic evaluation to measure precision, recall, accuracy
- Tune supervisor routing logic for multi-specialist activation
- Measure token efficiency across diverse query distribution
- Implement continuous monitoring and feedback loops
Built with: Python 3.10+ | LangGraph | Groq (Llama 3.3 70B) | NewsAPI