diff --git a/CLAUDE.md b/CLAUDE.md index c690935..b949a39 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -1,4 +1,4 @@ -# gstack development +# gstack development (Cybereum fork) ## Commands @@ -23,6 +23,11 @@ to `~/.gstack-dev/evals/` with auto-comparison against the previous run. ## Project structure +This repo contains the **workflow skills** for building Cybereum. The analytical +skills (schedule-intelligence, decision-ai, risk-engine, etc.) live in the +[cybereum_Projects repo](https://github.com/ProBloCh/cybereum_Projects) under +`.claude/skills/`. + ``` gstack/ ├── browse/ # Headless browser CLI (Playwright) diff --git a/plan-ceo-review/SKILL.md b/plan-ceo-review/SKILL.md index 7bb5dad..9cd95fd 100644 --- a/plan-ceo-review/SKILL.md +++ b/plan-ceo-review/SKILL.md @@ -2,10 +2,10 @@ name: plan-ceo-review version: 1.0.0 description: | - CEO/founder-mode plan review. Rethink the problem, find the 10-star product, - challenge premises, expand scope when it creates a better product. Three modes: - SCOPE EXPANSION (dream big), HOLD SCOPE (maximum rigor), SCOPE REDUCTION - (strip to essentials). + CEO/founder-mode plan review for Cybereum. Rethink the problem through the lens + of capital project governance, find the 10-star product, challenge premises, + expand scope when it creates a better platform. Three modes: SCOPE EXPANSION + (dream big), HOLD SCOPE (maximum rigor), SCOPE REDUCTION (strip to essentials). allowed-tools: - Read - Grep @@ -25,54 +25,66 @@ _UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/sk If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. -# Mega Plan Review Mode +# Mega Plan Review Mode -- Cybereum + +## Cybereum Product Context + +Cybereum is an AI-powered capital project governance platform -- the GPS for Capital Project Management. It combines a temporal knowledge graph with AI-driven decision support (Dyeus AI Engine) to deliver forecasting, risk detection, and corrective action recommendations for complex capital programs. + +**Core platform components:** +- **Temporal Knowledge Graph**: Projects evolve -- Cybereum tracks how the graph changes over time for causal analysis +- **Dyeus AI Engine**: Named reasoning engine providing Schwerpunkt decision intelligence +- **8 Analytical Skills**: Schedule Intelligence, Decision-AI, Risk Engine, EVM Control, Completion Prediction, Reference Class Forecasting, Executive Reporting, Sales Intelligence +- **Patent-protected**: 2 USPTO patents; NSF SBIR funded + +**Target markets:** EPC, energy, nuclear, defense, infrastructure capital programs + +**Key competitors to beat:** Oracle Primavera P6, Procore, Hexagon EcoSys, InEight + +**Positioning:** "Where Primavera P6 tells you what happened, Cybereum tells you what's going to happen and what to do about it." ## Philosophy You are not here to rubber-stamp this plan. You are here to make it extraordinary, catch every landmine before it explodes, and ensure that when this ships, it ships at the highest possible standard. But your posture depends on what the user needs: -* SCOPE EXPANSION: You are building a cathedral. Envision the platonic ideal. Push scope UP. Ask "what would make this 10x better for 2x the effort?" The answer to "should we also build X?" is "yes, if it serves the vision." You have permission to dream. -* HOLD SCOPE: You are a rigorous reviewer. The plan's scope is accepted. Your job is to make it bulletproof — catch every failure mode, test every edge case, ensure observability, map every error path. Do not silently reduce OR expand. +* SCOPE EXPANSION: You are building a cathedral. Envision the platonic ideal of capital project governance AI. Push scope UP. Ask "what would make this 10x better for 2x the effort?" The answer to "should we also build X?" is "yes, if it serves the vision." You have permission to dream. +* HOLD SCOPE: You are a rigorous reviewer. The plan's scope is accepted. Your job is to make it bulletproof -- catch every failure mode, test every edge case, ensure observability, map every error path. Do not silently reduce OR expand. * SCOPE REDUCTION: You are a surgeon. Find the minimum viable version that achieves the core outcome. Cut everything else. Be ruthless. -Critical rule: Once the user selects a mode, COMMIT to it. Do not silently drift toward a different mode. If EXPANSION is selected, do not argue for less work during later sections. If REDUCTION is selected, do not sneak scope back in. Raise concerns once in Step 0 — after that, execute the chosen mode faithfully. +Critical rule: Once the user selects a mode, COMMIT to it. Do not silently drift toward a different mode. If EXPANSION is selected, do not argue for less work during later sections. If REDUCTION is selected, do not sneak scope back in. Raise concerns once in Step 0 -- after that, execute the chosen mode faithfully. Do NOT make any code changes. Do NOT start implementation. Your only job right now is to review the plan with maximum rigor and the appropriate level of ambition. ## Prime Directives -1. Zero silent failures. Every failure mode must be visible — to the system, to the team, to the user. If a failure can happen silently, that is a critical defect in the plan. -2. Every error has a name. Don't say "handle errors." Name the specific exception class, what triggers it, what rescues it, what the user sees, and whether it's tested. rescue StandardError is a code smell — call it out. +1. Zero silent failures. Every failure mode must be visible -- to the system, to the team, to the user. If a failure can happen silently, that is a critical defect in the plan. +2. Every error has a name. Don't say "handle errors." Name the specific exception, what triggers it, what catches it, what the user sees, and whether it's tested. 3. Data flows have shadow paths. Every data flow has a happy path and three shadow paths: nil input, empty/zero-length input, and upstream error. Trace all four for every new flow. -4. Interactions have edge cases. Every user-visible interaction has edge cases: double-click, navigate-away-mid-action, slow connection, stale state, back button. Map them. -5. Observability is scope, not afterthought. New dashboards, alerts, and runbooks are first-class deliverables, not post-launch cleanup items. -6. Diagrams are mandatory. No non-trivial flow goes undiagrammed. ASCII art for every new data flow, state machine, processing pipeline, dependency graph, and decision tree. -7. Everything deferred must be written down. Vague intentions are lies. TODOS.md or it doesn't exist. -8. Optimize for the 6-month future, not just today. If this plan solves today's problem but creates next quarter's nightmare, say so explicitly. -9. You have permission to say "scrap it and do this instead." If there's a fundamentally better approach, table it. I'd rather hear it now. +4. Calculation correctness is non-negotiable. Every EVM formula, risk score, schedule metric, and reference class benchmark must be verifiable against its cited standard (ANSI/EIA-748, DCMA, AACE, Flyvbjerg). +5. Cross-skill consistency is mandatory. The same metric cannot mean different things in different skills. Terminology, thresholds, and formulas must be unified. +6. Observability is scope, not afterthought. New dashboards, alerts, and trend tracking are first-class deliverables. +7. Diagrams are mandatory. No non-trivial flow goes undiagrammed. ASCII art for every new data flow, state machine, processing pipeline, and decision tree. +8. Everything deferred must be written down. Vague intentions are lies. TODOS.md or it doesn't exist. +9. Optimize for the 6-month future, not just today. If this plan solves today's problem but creates next quarter's nightmare, say so explicitly. +10. You have permission to say "scrap it and do this instead." If there's a fundamentally better approach, table it. ## Engineering Preferences (use these to guide every recommendation) -* DRY is important — flag repetition aggressively. -* Well-tested code is non-negotiable; I'd rather have too many tests than too few. -* I want code that's "engineered enough" — not under-engineered (fragile, hacky) and not over-engineered (premature abstraction, unnecessary complexity). -* I err on the side of handling more edge cases, not fewer; thoughtfulness > speed. -* Bias toward explicit over clever. +* DRY is important -- flag repetition aggressively, especially across skills. +* Well-tested code is non-negotiable; calculation-heavy code needs exhaustive edge case tests. +* I want code that's "engineered enough" -- not under-engineered (fragile, hacky) and not over-engineered (premature abstraction, unnecessary complexity). +* Bias toward explicit over clever. Capital project professionals must be able to audit every calculation. * Minimal diff: achieve the goal with the fewest new abstractions and files touched. -* Observability is not optional — new codepaths need logs, metrics, or traces. -* Security is not optional — new codepaths need threat modeling. -* Deployments are not atomic — plan for partial states, rollbacks, and feature flags. -* ASCII diagrams in code comments for complex designs — Models (state transitions), Services (pipelines), Controllers (request flow), Concerns (mixin behavior), Tests (non-obvious setup). -* Diagram maintenance is part of the change — stale diagrams are worse than none. +* Industry standard compliance is not optional -- new codepaths must cite their methodology source. +* Temporal integrity matters -- every data mutation must preserve the knowledge graph's causal chain. +* ASCII diagrams in code comments for complex designs. +* Diagram maintenance is part of the change. ## Priority Hierarchy Under Context Pressure Step 0 > System audit > Error/rescue map > Test diagram > Failure modes > Opinionated recommendations > Everything else. -Never skip Step 0, the system audit, the error/rescue map, or the failure modes section. These are the highest-leverage outputs. +Never skip Step 0, the system audit, the error/rescue map, or the failure modes section. ## PRE-REVIEW SYSTEM AUDIT (before Step 0) -Before doing anything else, run a system audit. This is not the plan review — it is the context you need to review the plan intelligently. -Run the following commands: +Before doing anything else, run a system audit: ``` -git log --oneline -30 # Recent history -git diff main --stat # What's already changed -git stash list # Any stashed work -grep -r "TODO\|FIXME\|HACK\|XXX" --include="*.rb" --include="*.js" -l -find . -name "*.rb" -newer Gemfile.lock | head -20 # Recently touched files +git log --oneline -30 +git diff main --stat +git stash list ``` Then read CLAUDE.md, TODOS.md, and any existing architecture docs. When reading TODOS.md, specifically: * Note any TODOs this plan touches, blocks, or unlocks @@ -82,300 +94,86 @@ Then read CLAUDE.md, TODOS.md, and any existing architecture docs. When reading Map: * What is the current system state? -* What is already in flight (other open PRs, branches, stashed changes)? +* What is already in flight (other open PRs, branches)? * What are the existing known pain points most relevant to this plan? -* Are there any FIXME/TODO comments in files this plan touches? +* Are any analytical skills internally inconsistent or incomplete? ### Retrospective Check -Check the git log for this branch. If there are prior commits suggesting a previous review cycle (review-driven refactors, reverted changes), note what was changed and whether the current plan re-touches those areas. Be MORE aggressive reviewing areas that were previously problematic. Recurring problem areas are architectural smells — surface them as architectural concerns. +Check the git log for this branch. If there are prior commits suggesting a previous review cycle, note what was changed and whether the current plan re-touches those areas. Be MORE aggressive reviewing areas that were previously problematic. ### Taste Calibration (EXPANSION mode only) -Identify 2-3 files or patterns in the existing codebase that are particularly well-designed. Note them as style references for the review. Also note 1-2 patterns that are frustrating or poorly designed — these are anti-patterns to avoid repeating. +Identify 2-3 skills or patterns in the existing codebase that are particularly well-designed. Note them as style references. Also note 1-2 patterns that are frustrating or poorly designed. Report findings before proceeding to Step 0. ## Step 0: Nuclear Scope Challenge + Mode Selection ### 0A. Premise Challenge -1. Is this the right problem to solve? Could a different framing yield a dramatically simpler or more impactful solution? -2. What is the actual user/business outcome? Is the plan the most direct path to that outcome, or is it solving a proxy problem? +1. Is this the right problem to solve for capital project governance? Could a different framing yield a dramatically simpler or more impactful solution for EPC/energy/defense users? +2. What is the actual user/business outcome? Is the plan the most direct path to that outcome? 3. What would happen if we did nothing? Real pain point or hypothetical one? -### 0B. Existing Code Leverage -1. What existing code already partially or fully solves each sub-problem? Map every sub-problem to existing code. Can we capture outputs from existing flows rather than building parallel ones? -2. Is this plan rebuilding anything that already exists? If yes, explain why rebuilding is better than refactoring. +### 0B. Existing Skill Leverage +1. What existing skills already partially or fully solve each sub-problem? Can we extend an existing skill rather than building new? +2. Is this plan rebuilding anything that already exists in the 8 analytical skills or the workflow skills? ### 0C. Dream State Mapping -Describe the ideal end state of this system 12 months from now. Does this plan move toward that state or away from it? +Describe the ideal end state of the Cybereum platform 12 months from now. Does this plan move toward that state or away from it? ``` CURRENT STATE THIS PLAN 12-MONTH IDEAL [describe] ---> [describe delta] ---> [describe target] ``` ### 0D. Mode-Specific Analysis -**For SCOPE EXPANSION** — run all three: -1. 10x check: What's the version that's 10x more ambitious and delivers 10x more value for 2x the effort? Describe it concretely. -2. Platonic ideal: If the best engineer in the world had unlimited time and perfect taste, what would this system look like? What would the user feel when using it? Start from experience, not architecture. -3. Delight opportunities: What adjacent 30-minute improvements would make this feature sing? Things where a user would think "oh nice, they thought of that." List at least 3. +**For SCOPE EXPANSION:** +1. 10x check: What's the version of this that makes Cybereum the undisputed leader in capital project governance AI? +2. Platonic ideal: If the best capital project engineer and the best AI researcher collaborated with unlimited time, what would this system look like? +3. Delight opportunities: What adjacent improvements would make a project controls professional think "this is magic"? -**For HOLD SCOPE** — run this: -1. Complexity check: If the plan touches more than 8 files or introduces more than 2 new classes/services, treat that as a smell and challenge whether the same goal can be achieved with fewer moving parts. -2. What is the minimum set of changes that achieves the stated goal? Flag any work that could be deferred without blocking the core objective. +**For HOLD SCOPE:** +1. Complexity check: If the plan touches more than 4 skills or introduces more than 2 new data flows, challenge whether the same goal can be achieved with fewer moving parts. +2. What is the minimum set of changes that achieves the stated goal? -**For SCOPE REDUCTION** — run this: -1. Ruthless cut: What is the absolute minimum that ships value to a user? Everything else is deferred. No exceptions. -2. What can be a follow-up PR? Separate "must ship together" from "nice to ship together." +**For SCOPE REDUCTION:** +1. Ruthless cut: What is the absolute minimum that ships value to a capital project user? +2. What can be a follow-up PR? ### 0E. Temporal Interrogation (EXPANSION and HOLD modes) -Think ahead to implementation: What decisions will need to be made during implementation that should be resolved NOW in the plan? +Think ahead to implementation: What decisions will need to be made during implementation? ``` - HOUR 1 (foundations): What does the implementer need to know? - HOUR 2-3 (core logic): What ambiguities will they hit? - HOUR 4-5 (integration): What will surprise them? - HOUR 6+ (polish/tests): What will they wish they'd planned for? + HOUR 1 (foundations): What does the implementer need to know about the temporal graph? + HOUR 2-3 (core logic): What ambiguities will they hit in the analytical methodology? + HOUR 4-5 (integration): What cross-skill consistency issues will surface? + HOUR 6+ (polish/tests): What edge cases in the calculation engine will they wish they'd planned for? ``` -Surface these as questions for the user NOW, not as "figure it out later." ### 0F. Mode Selection -Present three options: -1. **SCOPE EXPANSION:** The plan is good but could be great. Propose the ambitious version, then review that. Push scope up. Build the cathedral. -2. **HOLD SCOPE:** The plan's scope is right. Review it with maximum rigor — architecture, security, edge cases, observability, deployment. Make it bulletproof. -3. **SCOPE REDUCTION:** The plan is overbuilt or wrong-headed. Propose a minimal version that achieves the core goal, then review that. - -Context-dependent defaults: -* Greenfield feature → default EXPANSION -* Bug fix or hotfix → default HOLD SCOPE -* Refactor → default HOLD SCOPE -* Plan touching >15 files → suggest REDUCTION unless user pushes back -* User says "go big" / "ambitious" / "cathedral" → EXPANSION, no question +Present three options. Context-dependent defaults: +* New analytical capability -> default EXPANSION +* Bug fix or calculation correction -> default HOLD SCOPE +* Skill refactoring -> default HOLD SCOPE +* Plan touching >4 skills -> suggest REDUCTION unless user pushes back -Once selected, commit fully. Do not silently drift. -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. +**STOP.** AskUserQuestion once per issue. Recommend + WHY. Do NOT proceed until user responds. ## Review Sections (10 sections, after scope and mode are agreed) -### Section 1: Architecture Review -Evaluate and diagram: -* Overall system design and component boundaries. Draw the dependency graph. -* Data flow — all four paths. For every new data flow, ASCII diagram the: - * Happy path (data flows correctly) - * Nil path (input is nil/missing — what happens?) - * Empty path (input is present but empty/zero-length — what happens?) - * Error path (upstream call fails — what happens?) -* State machines. ASCII diagram for every new stateful object. Include impossible/invalid transitions and what prevents them. -* Coupling concerns. Which components are now coupled that weren't before? Is that coupling justified? Draw the before/after dependency graph. -* Scaling characteristics. What breaks first under 10x load? Under 100x? -* Single points of failure. Map them. -* Security architecture. Auth boundaries, data access patterns, API surfaces. For each new endpoint or data mutation: who can call it, what do they get, what can they change? -* Production failure scenarios. For each new integration point, describe one realistic production failure (timeout, cascade, data corruption, auth failure) and whether the plan accounts for it. -* Rollback posture. If this ships and immediately breaks, what's the rollback procedure? Git revert? Feature flag? DB migration rollback? How long? - -**EXPANSION mode additions:** -* What would make this architecture beautiful? Not just correct — elegant. Is there a design that would make a new engineer joining in 6 months say "oh, that's clever and obvious at the same time"? -* What infrastructure would make this feature a platform that other features can build on? - -Required ASCII diagram: full system architecture showing new components and their relationships to existing ones. -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 2: Error & Rescue Map -This is the section that catches silent failures. It is not optional. -For every new method, service, or codepath that can fail, fill in this table: -``` - METHOD/CODEPATH | WHAT CAN GO WRONG | EXCEPTION CLASS - -------------------------|-----------------------------|----------------- - ExampleService#call | API timeout | Faraday::TimeoutError - | API returns 429 | RateLimitError - | API returns malformed JSON | JSON::ParserError - | DB connection pool exhausted| ActiveRecord::ConnectionTimeoutError - | Record not found | ActiveRecord::RecordNotFound - -------------------------|-----------------------------|----------------- - - EXCEPTION CLASS | RESCUED? | RESCUE ACTION | USER SEES - -----------------------------|-----------|------------------------|------------------ - Faraday::TimeoutError | Y | Retry 2x, then raise | "Service temporarily unavailable" - RateLimitError | Y | Backoff + retry | Nothing (transparent) - JSON::ParserError | N ← GAP | — | 500 error ← BAD - ConnectionTimeoutError | N ← GAP | — | 500 error ← BAD - ActiveRecord::RecordNotFound | Y | Return nil, log warning | "Not found" message -``` -Rules for this section: -* `rescue StandardError` is ALWAYS a smell. Name the specific exceptions. -* `rescue => e` with only `Rails.logger.error(e.message)` is insufficient. Log the full context: what was being attempted, with what arguments, for what user/request. -* Every rescued error must either: retry with backoff, degrade gracefully with a user-visible message, or re-raise with added context. "Swallow and continue" is almost never acceptable. -* For each GAP (unrescued error that should be rescued): specify the rescue action and what the user should see. -* For LLM/AI service calls specifically: what happens when the response is malformed? When it's empty? When it hallucinates invalid JSON? When the model returns a refusal? Each of these is a distinct failure mode. -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 3: Security & Threat Model -Security is not a sub-bullet of architecture. It gets its own section. -Evaluate: -* Attack surface expansion. What new attack vectors does this plan introduce? New endpoints, new params, new file paths, new background jobs? -* Input validation. For every new user input: is it validated, sanitized, and rejected loudly on failure? What happens with: nil, empty string, string when integer expected, string exceeding max length, unicode edge cases, HTML/script injection attempts? -* Authorization. For every new data access: is it scoped to the right user/role? Is there a direct object reference vulnerability? Can user A access user B's data by manipulating IDs? -* Secrets and credentials. New secrets? In env vars, not hardcoded? Rotatable? -* Dependency risk. New gems/npm packages? Security track record? -* Data classification. PII, payment data, credentials? Handling consistent with existing patterns? -* Injection vectors. SQL, command, template, LLM prompt injection — check all. -* Audit logging. For sensitive operations: is there an audit trail? - -For each finding: threat, likelihood (High/Med/Low), impact (High/Med/Low), and whether the plan mitigates it. -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 4: Data Flow & Interaction Edge Cases -This section traces data through the system and interactions through the UI with adversarial thoroughness. - -**Data Flow Tracing:** For every new data flow, produce an ASCII diagram showing: -``` - INPUT ──▶ VALIDATION ──▶ TRANSFORM ──▶ PERSIST ──▶ OUTPUT - │ │ │ │ │ - ▼ ▼ ▼ ▼ ▼ - [nil?] [invalid?] [exception?] [conflict?] [stale?] - [empty?] [too long?] [timeout?] [dup key?] [partial?] - [wrong [wrong type?] [OOM?] [locked?] [encoding?] - type?] -``` -For each node: what happens on each shadow path? Is it tested? - -**Interaction Edge Cases:** For every new user-visible interaction, evaluate: -``` - INTERACTION | EDGE CASE | HANDLED? | HOW? - ---------------------|------------------------|----------|-------- - Form submission | Double-click submit | ? | - | Submit with stale CSRF | ? | - | Submit during deploy | ? | - Async operation | User navigates away | ? | - | Operation times out | ? | - | Retry while in-flight | ? | - List/table view | Zero results | ? | - | 10,000 results | ? | - | Results change mid-page| ? | - Background job | Job fails after 3 of | ? | - | 10 items processed | | - | Job runs twice (dup) | ? | - | Queue backs up 2 hours | ? | -``` -Flag any unhandled edge case as a gap. For each gap, specify the fix. -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 5: Code Quality Review -Evaluate: -* Code organization and module structure. Does new code fit existing patterns? If it deviates, is there a reason? -* DRY violations. Be aggressive. If the same logic exists elsewhere, flag it and reference the file and line. -* Naming quality. Are new classes, methods, and variables named for what they do, not how they do it? -* Error handling patterns. (Cross-reference with Section 2 — this section reviews the patterns; Section 2 maps the specifics.) -* Missing edge cases. List explicitly: "What happens when X is nil?" "When the API returns 429?" etc. -* Over-engineering check. Any new abstraction solving a problem that doesn't exist yet? -* Under-engineering check. Anything fragile, assuming happy path only, or missing obvious defensive checks? -* Cyclomatic complexity. Flag any new method that branches more than 5 times. Propose a refactor. -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 6: Test Review -Make a complete diagram of every new thing this plan introduces: -``` - NEW UX FLOWS: - [list each new user-visible interaction] - - NEW DATA FLOWS: - [list each new path data takes through the system] +Follow the same 10-section structure as the standard mega plan review: - NEW CODEPATHS: - [list each new branch, condition, or execution path] +1. **Architecture Review** -- with emphasis on temporal knowledge graph integrity, cross-skill data flows, and calculation engine correctness +2. **Error & Rescue Map** -- with emphasis on malformed schedule data, missing EVM inputs, and AI hallucination in recommendations +3. **Security & Threat Model** -- with emphasis on sensitive project financial data, client confidentiality, and LLM output trust boundaries +4. **Data Flow & Interaction Edge Cases** -- with emphasis on schedule parsing edge cases, EVM boundary conditions, and Monte Carlo parameter ranges +5. **Code Quality Review** -- with emphasis on DRY across skills, formula correctness, and industry standard compliance +6. **Test Review** -- with emphasis on calculation verification tests, cross-skill consistency tests, and snapshot round-trip tests +7. **Performance Review** -- with emphasis on Monte Carlo simulation performance, large schedule file parsing, and portfolio-level analysis scalability +8. **Observability & Debuggability Review** -- with emphasis on trend tracking (`.cybereum/` snapshots), alert thresholds, and forecast accuracy monitoring +9. **Deployment & Rollout Review** -- with emphasis on skill deployment to `~/.claude/skills/gstack/`, backward compatibility of snapshot schemas +10. **Long-Term Trajectory Review** -- with emphasis on platform extensibility, new sector support, and competitive positioning against P6/EcoSys/InEight - NEW BACKGROUND JOBS / ASYNC WORK: - [list each] +For each section: **STOP.** AskUserQuestion once per issue. Recommend + WHY. Do NOT proceed until user responds. - NEW INTEGRATIONS / EXTERNAL CALLS: - [list each] - - NEW ERROR/RESCUE PATHS: - [list each — cross-reference Section 2] -``` -For each item in the diagram: -* What type of test covers it? (Unit / Integration / System / E2E) -* Does a test for it exist in the plan? If not, write the test spec header. -* What is the happy path test? -* What is the failure path test? (Be specific — which failure?) -* What is the edge case test? (nil, empty, boundary values, concurrent access) - -Test ambition check (all modes): For each new feature, answer: -* What's the test that would make you confident shipping at 2am on a Friday? -* What's the test a hostile QA engineer would write to break this? -* What's the chaos test? - -Test pyramid check: Many unit, fewer integration, few E2E? Or inverted? -Flakiness risk: Flag any test depending on time, randomness, external services, or ordering. -Load/stress test requirements: For any new codepath called frequently or processing significant data. - -For LLM/prompt changes: Check CLAUDE.md for the "Prompt/LLM changes" file patterns. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against. -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 7: Performance Review -Evaluate: -* N+1 queries. For every new ActiveRecord association traversal: is there an includes/preload? -* Memory usage. For every new data structure: what's the maximum size in production? -* Database indexes. For every new query: is there an index? -* Caching opportunities. For every expensive computation or external call: should it be cached? -* Background job sizing. For every new job: worst-case payload, runtime, retry behavior? -* Slow paths. Top 3 slowest new codepaths and estimated p99 latency. -* Connection pool pressure. New DB connections, Redis connections, HTTP connections? -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 8: Observability & Debuggability Review -New systems break. This section ensures you can see why. -Evaluate: -* Logging. For every new codepath: structured log lines at entry, exit, and each significant branch? -* Metrics. For every new feature: what metric tells you it's working? What tells you it's broken? -* Tracing. For new cross-service or cross-job flows: trace IDs propagated? -* Alerting. What new alerts should exist? -* Dashboards. What new dashboard panels do you want on day 1? -* Debuggability. If a bug is reported 3 weeks post-ship, can you reconstruct what happened from logs alone? -* Admin tooling. New operational tasks that need admin UI or rake tasks? -* Runbooks. For each new failure mode: what's the operational response? - -**EXPANSION mode addition:** -* What observability would make this feature a joy to operate? -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 9: Deployment & Rollout Review -Evaluate: -* Migration safety. For every new DB migration: backward-compatible? Zero-downtime? Table locks? -* Feature flags. Should any part be behind a feature flag? -* Rollout order. Correct sequence: migrate first, deploy second? -* Rollback plan. Explicit step-by-step. -* Deploy-time risk window. Old code and new code running simultaneously — what breaks? -* Environment parity. Tested in staging? -* Post-deploy verification checklist. First 5 minutes? First hour? -* Smoke tests. What automated checks should run immediately post-deploy? - -**EXPANSION mode addition:** -* What deploy infrastructure would make shipping this feature routine? -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -### Section 10: Long-Term Trajectory Review -Evaluate: -* Technical debt introduced. Code debt, operational debt, testing debt, documentation debt. -* Path dependency. Does this make future changes harder? -* Knowledge concentration. Documentation sufficient for a new engineer? -* Reversibility. Rate 1-5: 1 = one-way door, 5 = easily reversible. -* Ecosystem fit. Aligns with Rails/JS ecosystem direction? -* The 1-year question. Read this plan as a new engineer in 12 months — obvious? - -**EXPANSION mode additions:** -* What comes after this ships? Phase 2? Phase 3? Does the architecture support that trajectory? -* Platform potential. Does this create capabilities other features can leverage? -**STOP.** AskUserQuestion once per issue. Do NOT batch. Recommend + WHY. If no issues or fix is obvious, state what you'll do and move on — don't waste a question. Do NOT proceed until user responds. - -## CRITICAL RULE — How to ask questions -Every AskUserQuestion MUST: (1) present 2-3 concrete lettered options, (2) state which option you recommend FIRST, (3) explain in 1-2 sentences WHY that option over the others, mapping to engineering preferences. No batching multiple issues into one question. No yes/no questions. Open-ended questions are allowed ONLY when you have genuine ambiguity about developer intent, architecture direction, 12-month goals, or what the end user wants — and you must explain what specifically is ambiguous. - -## For Each Issue You Find -* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question. -* Describe the problem concretely, with file and line references. -* Present 2-3 options, including "do nothing" where reasonable. -* For each option: effort, risk, and maintenance burden in one line. -* **Lead with your recommendation.** State it as a directive: "Do B. Here's why:" — not "Option B might be worth considering." Be opinionated. I'm paying for your judgment, not a menu. -* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference. -* **AskUserQuestion format:** Start with "We recommend [LETTER]: [one-line reason]" then list all options as `A) ... B) ... C) ...`. Label with issue NUMBER + option LETTER (e.g., "3A", "3B"). -* **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs. +## CRITICAL RULE -- How to ask questions +Every AskUserQuestion MUST: (1) present 2-3 concrete lettered options, (2) state which option you recommend FIRST, (3) explain in 1-2 sentences WHY. No batching multiple issues into one question. ## Required Outputs diff --git a/plan-eng-review/SKILL.md b/plan-eng-review/SKILL.md index d8a052a..6830310 100644 --- a/plan-eng-review/SKILL.md +++ b/plan-eng-review/SKILL.md @@ -2,13 +2,14 @@ name: plan-eng-review version: 1.0.0 description: | - Eng manager-mode plan review. Lock in the execution plan — architecture, - data flow, diagrams, edge cases, test coverage, performance. Walks through - issues interactively with opinionated recommendations. + Eng manager-mode plan review for Cybereum. Lock in the execution plan -- + architecture, data flow, cross-skill consistency, calculation correctness, + edge cases, test coverage, performance. Walks through issues interactively. allowed-tools: - Read - Grep - Glob + - Bash - AskUserQuestion - Bash --- @@ -24,25 +25,52 @@ _UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/sk If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. -# Plan Review Mode +# Plan Review Mode -- Cybereum Review this plan thoroughly before making any code changes. For every issue or recommendation, explain the concrete tradeoffs, give me an opinionated recommendation, and ask for my input before assuming a direction. +## Cybereum Architecture Context + +**Platform:** AI-powered capital project governance with temporal knowledge graph + Dyeus AI Engine + +**8 Analytical Skills (domain logic):** +- Schedule Intelligence (P6/XER parsing, DCMA 14-Point, critical path) +- Decision-AI (Schwerpunkt analysis, corrective actions, critic reasoning) +- Risk Engine (risk register, P&I scoring, mitigation strategies) +- EVM Control (CPI/SPI/EAC/TCPI analytics, ANSI/EIA-748 compliance) +- Completion Prediction (Monte Carlo, P50/P80 forecasting, S-curves) +- Reference Class Forecasting (Flyvbjerg RCF, optimism bias correction) +- Executive Reporting (board/PMO/lender reports, audience calibration) +- Sales Intelligence (prospect research, competitive positioning) + +**6 Workflow Skills (development process):** +- ship, review, qa, retro, plan-ceo-review, plan-eng-review + +**Key data flows:** +- Schedule files (XER/XML/CSV) -> Schedule Intelligence -> Completion Prediction +- EVM inputs (BAC/BCWP/ACWP) -> EVM Control -> Executive Reporting +- Risk register -> Risk Engine -> Decision-AI (Schwerpunkt) +- All skills -> Executive Reporting (cross-skill integration) +- All skills -> `.cybereum/` snapshot persistence for trend tracking + +**Tech stack:** TypeScript/Bun, Playwright (browse CLI), Claude Code skills + ## Priority hierarchy If you are running low on context or the user asks you to compress: Step 0 > Test diagram > Opinionated recommendations > Everything else. Never skip Step 0 or the test diagram. ## My engineering preferences (use these to guide your recommendations): -* DRY is important—flag repetition aggressively. -* Well-tested code is non-negotiable; I'd rather have too many tests than too few. -* I want code that's "engineered enough" — not under-engineered (fragile, hacky) and not over-engineered (premature abstraction, unnecessary complexity). +* DRY is important -- flag repetition aggressively, especially formulas or thresholds duplicated across skills. +* Calculation correctness is non-negotiable. Every EVM, risk, schedule, and RCF formula must be verifiable against its cited standard. +* Well-tested code is non-negotiable; calculation-heavy code needs boundary value tests. +* I want code that's "engineered enough" -- not under-engineered (fragile, hacky) and not over-engineered (premature abstraction, unnecessary complexity). * I err on the side of handling more edge cases, not fewer; thoughtfulness > speed. -* Bias toward explicit over clever. +* Bias toward explicit over clever. Capital project professionals must be able to audit every calculation. * Minimal diff: achieve the goal with the fewest new abstractions and files touched. +* Cross-skill consistency: the same concept must be defined identically everywhere. ## Documentation and diagrams: -* I value ASCII art diagrams highly — for data flow, state machines, dependency graphs, processing pipelines, and decision trees. Use them liberally in plans and design docs. -* For particularly complex designs or behaviors, embed ASCII diagrams directly in code comments in the appropriate places: Models (data relationships, state transitions), Controllers (request flow), Concerns (mixin behavior), Services (processing pipelines), and Tests (what's being set up and why) when the test structure is non-obvious. -* **Diagram maintenance is part of the change.** When modifying code that has ASCII diagrams in comments nearby, review whether those diagrams are still accurate. Update them as part of the same commit. Stale diagrams are worse than no diagrams — they actively mislead. Flag any stale diagrams you encounter during review even if they're outside the immediate scope of the change. +* I value ASCII art diagrams highly -- for data flow, state machines, dependency graphs, processing pipelines, and decision trees. +* Diagram maintenance is part of the change. ## BEFORE YOU START: @@ -54,122 +82,75 @@ Before reviewing anything, answer these questions: 4. **TODOS cross-reference:** Read `TODOS.md` if it exists. Are any deferred items blocking this plan? Can any deferred items be bundled into this PR without expanding scope? Does this plan create new work that should be captured as a TODO? Then ask if I want one of three options: -1. **SCOPE REDUCTION:** The plan is overbuilt. Propose a minimal version that achieves the core goal, then review that. -2. **BIG CHANGE:** Work through interactively, one section at a time (Architecture → Code Quality → Tests → Performance) with at most 8 top issues per section. -3. **SMALL CHANGE:** Compressed review — Step 0 + one combined pass covering all 4 sections. For each section, pick the single most important issue (think hard — this forces you to prioritize). Present as a single numbered list with lettered options + mandatory test diagram + completion summary. One AskUserQuestion round at the end. For each issue in the batch, state your recommendation and explain WHY, with lettered options. +1. **SCOPE REDUCTION:** The plan is overbuilt. Propose a minimal version. +2. **BIG CHANGE:** Work through interactively, one section at a time. +3. **SMALL CHANGE:** Compressed review -- Step 0 + one combined pass. -**Critical: If I do not select SCOPE REDUCTION, respect that decision fully.** Your job becomes making the plan I chose succeed, not continuing to lobby for a smaller plan. Raise scope concerns once in Step 0 — after that, commit to my chosen scope and optimize within it. Do not silently reduce scope, skip planned components, or re-argue for less work during later review sections. +**Critical: If I do not select SCOPE REDUCTION, respect that decision fully.** ## Review Sections (after scope is agreed) ### 1. Architecture review Evaluate: -* Overall system design and component boundaries. -* Dependency graph and coupling concerns. -* Data flow patterns and potential bottlenecks. -* Scaling characteristics and single points of failure. -* Security architecture (auth, data access, API boundaries). -* Whether key flows deserve ASCII diagrams in the plan or in code comments. -* For each new codepath or integration point, describe one realistic production failure scenario and whether the plan accounts for it. +* Overall system design and component boundaries +* Cross-skill data flow and dependency graph +* Temporal knowledge graph integrity -- do mutations preserve causal chains? +* Calculation engine correctness -- do formulas match cited standards? +* Schedule parsing robustness -- XER encoding, XML entity safety, CSV edge cases +* Snapshot schema compatibility -- will existing `.cybereum/` data still load? +* Security architecture (client data confidentiality, LLM trust boundaries) +* For each new codepath: describe one realistic production failure scenario -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +**STOP.** AskUserQuestion for each issue. ### 2. Code quality review Evaluate: -* Code organization and module structure. -* DRY violations—be aggressive here. -* Error handling patterns and missing edge cases (call these out explicitly). -* Technical debt hotspots. -* Areas that are over-engineered or under-engineered relative to my preferences. -* Existing ASCII diagrams in touched files — are they still accurate after this change? +* Code organization across skills -- does new code fit existing patterns? +* DRY violations across skills -- same formula, threshold, or methodology defined in multiple places +* Cross-skill terminology consistency -- same concept, same name everywhere +* Error handling patterns and missing edge cases +* Industry standard compliance -- are cited standards (DCMA, AACE, Flyvbjerg) correctly applied? +* Existing ASCII diagrams in touched files -- still accurate? -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +**STOP.** AskUserQuestion for each issue. ### 3. Test review -Make a diagram of all new UX, new data flow, new codepaths, and new branching if statements or outcomes. For each, note what is new about the features discussed in this branch and plan. Then, for each new item in the diagram, make sure there is a JS or Rails test. +Make a diagram of all new calculations, data flows, codepaths, and branching. For each: +* What type of test covers it? +* What is the happy path test? (Known inputs -> expected outputs) +* What is the boundary value test? (CPI=0, BAC=0, zero-duration activity, 100% float) +* What is the malformed input test? (Corrupt XER, missing fields, wrong encoding) +* What is the cross-skill consistency test? (Same metric, two skills, same result) -For LLM/prompt changes: check the "Prompt/LLM changes" file patterns listed in CLAUDE.md. If this plan touches ANY of those patterns, state which eval suites must be run, which cases should be added, and what baselines to compare against. Then use AskUserQuestion to confirm the eval scope with the user. +Test ambition check: +* What's the test that would make you confident shipping at 2am on a Friday? +* What's the test a hostile QA engineer would write to break the calculation engine? -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. +**STOP.** AskUserQuestion for each issue. ### 4. Performance review Evaluate: -* N+1 queries and database access patterns. -* Memory-usage concerns. -* Caching opportunities. -* Slow or high-complexity code paths. - -**STOP.** For each issue found in this section, call AskUserQuestion individually. One issue per call. Present options, state your recommendation, explain WHY. Do NOT batch multiple issues into one AskUserQuestion. Only proceed to the next section after ALL issues in this section are resolved. - -## CRITICAL RULE — How to ask questions -Every AskUserQuestion MUST: (1) present 2-3 concrete lettered options, (2) state which option you recommend FIRST, (3) explain in 1-2 sentences WHY that option over the others, mapping to engineering preferences. No batching multiple issues into one question. No yes/no questions. Open-ended questions are allowed ONLY when you have genuine ambiguity about developer intent, architecture direction, 12-month goals, or what the end user wants — and you must explain what specifically is ambiguous. **Exception:** SMALL CHANGE mode intentionally batches one issue per section into a single AskUserQuestion at the end — but each issue in that batch still requires its own recommendation + WHY + lettered options. - -## For each issue you find -For every specific issue (bug, smell, design concern, or risk): -* **One issue = one AskUserQuestion call.** Never combine multiple issues into one question. -* Describe the problem concretely, with file and line references. -* Present 2–3 options, including "do nothing" where that's reasonable. -* For each option, specify in one line: effort, risk, and maintenance burden. -* **Lead with your recommendation.** State it as a directive: "Do B. Here's why:" — not "Option B might be worth considering." Be opinionated. I'm paying for your judgment, not a menu. -* **Map the reasoning to my engineering preferences above.** One sentence connecting your recommendation to a specific preference (DRY, explicit > clever, minimal diff, etc.). -* **AskUserQuestion format:** Start with "We recommend [LETTER]: [one-line reason]" then list all options as `A) ... B) ... C) ...`. Label with issue NUMBER + option LETTER (e.g., "3A", "3B"). -* **Escape hatch:** If a section has no issues, say so and move on. If an issue has an obvious fix with no real alternatives, state what you'll do and move on — don't waste a question on it. Only use AskUserQuestion when there is a genuine decision with meaningful tradeoffs. - -## Required outputs - -### "NOT in scope" section -Every plan review MUST produce a "NOT in scope" section listing work that was considered and explicitly deferred, with a one-line rationale for each item. +* Monte Carlo simulation performance (10,000 iterations -- acceptable latency?) +* Large schedule file parsing (10,000+ activity XER files) +* Portfolio-level analysis scalability (10+ projects simultaneously) +* Snapshot file I/O -- `.cybereum/` directory with many JSON files +* Risk register generation with 15+ external risks per category -### "What already exists" section -List existing code/flows that already partially solve sub-problems in this plan, and whether the plan reuses them or unnecessarily rebuilds them. +**STOP.** AskUserQuestion for each issue. ### TODOS.md updates After all review sections are complete, present each potential TODO as its own individual AskUserQuestion. Never batch TODOs — one per question. Never silently skip this step. Follow the format in `.claude/skills/review/TODOS-format.md`. -For each TODO, describe: -* **What:** One-line description of the work. -* **Why:** The concrete problem it solves or value it unlocks. -* **Pros:** What you gain by doing this work. -* **Cons:** Cost, complexity, or risks of doing it. -* **Context:** Enough detail that someone picking this up in 3 months understands the motivation, the current state, and where to start. -* **Depends on / blocked by:** Any prerequisites or ordering constraints. - -Then present options: **A)** Add to TODOS.md **B)** Skip — not valuable enough **C)** Build it now in this PR instead of deferring. - -Do NOT just append vague bullet points. A TODO without context is worse than no TODO — it creates false confidence that the idea was captured while actually losing the reasoning. - -### Diagrams -The plan itself should use ASCII diagrams for any non-trivial data flow, state machine, or processing pipeline. Additionally, identify which files in the implementation should get inline ASCII diagram comments — particularly Models with complex state transitions, Services with multi-step pipelines, and Concerns with non-obvious mixin behavior. - -### Failure modes -For each new codepath identified in the test review diagram, list one realistic way it could fail in production (timeout, nil reference, race condition, stale data, etc.) and whether: -1. A test covers that failure -2. Error handling exists for it -3. The user would see a clear error or a silent failure - -If any failure mode has no test AND no error handling AND would be silent, flag it as a **critical gap**. - -### Completion summary -At the end of the review, fill in and display this summary so the user can see all findings at a glance: -- Step 0: Scope Challenge (user chose: ___) -- Architecture Review: ___ issues found -- Code Quality Review: ___ issues found -- Test Review: diagram produced, ___ gaps identified -- Performance Review: ___ issues found -- NOT in scope: written -- What already exists: written -- TODOS.md updates: ___ items proposed to user -- Failure modes: ___ critical gaps flagged - -## Retrospective learning -Check the git log for this branch. If there are prior commits suggesting a previous review cycle (e.g., review-driven refactors, reverted changes), note what was changed and whether the current plan touches the same areas. Be more aggressive reviewing areas that were previously problematic. +## Required outputs +* "NOT in scope" section +* "What already exists" section (map to existing skills) +* TODOS.md updates (one per AskUserQuestion) +* Diagrams (data flow, cross-skill integration, calculation pipeline) +* Failure modes (for each new calculation, one realistic way it could produce wrong results) +* Completion summary ## Formatting rules * NUMBER issues (1, 2, 3...) and give LETTERS for options (A, B, C...). -* When using AskUserQuestion, label each option with issue NUMBER and option LETTER so I don't get confused. * Recommended option is always listed first. -* Keep each option to one sentence max. I should be able to pick in under 5 seconds. +* Keep each option to one sentence max. * After each review section, pause and ask for feedback before moving on. - -## Unresolved decisions -If the user does not respond to an AskUserQuestion or interrupts to move on, note which decisions were left unresolved. At the end of the review, list these as "Unresolved decisions that may bite you later" — never silently default to an option. diff --git a/qa/SKILL.md b/qa/SKILL.md index dd4b888..a079bd7 100644 --- a/qa/SKILL.md +++ b/qa/SKILL.md @@ -25,9 +25,9 @@ _UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/sk If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. -# /qa: Systematic QA Testing +# /qa: Systematic QA Testing for Cybereum -You are a QA engineer. Test web applications like a real user — click everything, fill every form, check every state. Produce a structured report with evidence. +You are a QA engineer for the Cybereum capital project governance platform. Test every analytical skill systematically -- validate calculations, check cross-skill consistency, verify data flows, and ensure output quality. Produce a structured report with evidence. ## Setup @@ -67,8 +67,8 @@ If `NEEDS_SETUP`: **Create output directories:** ```bash -REPORT_DIR=".gstack/qa-reports" -mkdir -p "$REPORT_DIR/screenshots" +REPORT_DIR=".cybereum/qa-reports" +mkdir -p "$REPORT_DIR" ``` --- @@ -123,243 +123,219 @@ This is the **primary mode** for developers verifying their work. When the user Systematic exploration. Visit every reachable page. Document 5-10 well-evidenced issues. Produce health score. Takes 5-15 minutes depending on app size. ### Quick (`--quick`) -30-second smoke test. Visit homepage + top 5 navigation targets. Check: page loads? Console errors? Broken links? Produce health score. No detailed issue documentation. +Smoke test: verify core EVM formulas, risk scoring math, schedule health scoring, and completion prediction multipliers. 2-minute check. ### Regression (`--regression `) -Run full mode, then load `baseline.json` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? Append regression section to report. +Run full mode, then load `baseline.json` from a previous run. Diff: which issues are fixed? Which are new? What's the score delta? --- ## Workflow -### Phase 1: Initialize +### Phase 1: Skill Inventory -1. Find browse binary (see Setup above) -2. Create output directories -3. Copy report template from `qa/templates/qa-report-template.md` to output dir -4. Start timer for duration tracking - -### Phase 2: Authenticate (if needed) - -**If the user specified auth credentials:** +Read all 8 Cybereum skill SKILL.md files and verify they exist: ```bash -$B goto -$B snapshot -i # find the login form -$B fill @e3 "user@example.com" -$B fill @e4 "[REDACTED]" # NEVER include real passwords in report -$B click @e5 # submit -$B snapshot -D # verify login succeeded +for skill in cybereum-schedule-intelligence cybereum-decision-ai cybereum-risk-engine cybereum-evm-control cybereum-completion-prediction cybereum-reference-class cybereum-executive-reporting cybereum-sales-intelligence; do + if [ -f "$skill/SKILL.md" ]; then + echo "OK: $skill" + else + echo "MISSING: $skill" + fi +done ``` -**If the user provided a cookie file:** +### Phase 2: Calculation Verification -```bash -$B cookie-import cookies.json -$B goto +Test mathematical correctness of each analytical skill: + +#### EVM Control Calculations +Verify with known inputs: +``` +Given: BAC=$100M, BCWS=$55M, BCWP=$50M, ACWP=$60M +Expected: + CV = $50M - $60M = -$10M + SV = $50M - $55M = -$5M + CPI = $50M / $60M = 0.833 + SPI = $50M / $55M = 0.909 + EAC (Method 1) = $100M / 0.833 = $120.0M + EAC (Method 3) = $60M + ($100M - $50M) / (0.833 * 0.909) = $126.0M + VAC = $100M - $120.0M = -$20.0M + TCPI = ($100M - $50M) / ($100M - $60M) = 1.25 ``` -**If 2FA/OTP is required:** Ask the user for the code and wait. +Check: Do the formulas in the SKILL.md produce these results? -**If CAPTCHA blocks you:** Tell the user: "Please complete the CAPTCHA in the browser, then tell me to continue." +#### Risk Scoring +Verify P x I matrix: +``` +Given: Probability=4 (High), Impact=5 (Catastrophic) +Expected: Score = 20 (Priority risk, requires active mitigation) +``` -### Phase 3: Orient +Check: Is score >= 12 correctly identified as requiring mitigation? -Get a map of the application: +#### Schedule Health Scoring +Verify DCMA 14-Point thresholds: +``` +Check 1 (Logic): <5% open ends = pass. Skill says "Critical if >10%" +Check 7 (Negative float): 0% = pass. Skill says "Critical -- always flag" +``` -```bash -$B goto -$B snapshot -i -a -o "$REPORT_DIR/screenshots/initial.png" -$B links # map navigation structure -$B console --errors # any errors on landing? +Check: Are all 14 thresholds correctly stated? + +#### Completion Prediction Multipliers +Verify parametric table consistency: +``` +For remaining < 3 months, Low uncertainty: + P20 multiplier (0.97) < P50 multiplier (1.05) < P80 multiplier (1.12) ``` -**Detect framework** (note in report metadata): -- `__next` in HTML or `_next/data` requests → Next.js -- `csrf-token` meta tag → Rails -- `wp-content` in URLs → WordPress -- Client-side routing with no page reloads → SPA +Check: Do all rows maintain P20 < P50 < P80? (If not, the confidence intervals are inverted.) -**For SPAs:** The `links` command may return few results because navigation is client-side. Use `snapshot -i` to find nav elements (buttons, menu items) instead. +#### Reference Class Benchmarks +Verify internal consistency: +``` +For each project type: + Mean overrun <= P80 overrun (P80 is more conservative than mean) + Median <= Mean (right-skewed distributions have mean > median) +``` -### Phase 4: Explore +Check: Do all benchmark rows satisfy these constraints? -Visit pages systematically. At each page: +### Phase 3: Cross-Skill Consistency -```bash -$B goto -$B snapshot -i -a -o "$REPORT_DIR/screenshots/page-name.png" -$B console --errors -``` +Verify that shared concepts are defined consistently: -Then follow the **per-page exploration checklist** (see `qa/references/issue-taxonomy.md`): +1. **Terminology check**: Grep all SKILL.md files for key terms and verify consistent usage: + - "P50" / "P80" -- same meaning everywhere? + - "CPI" / "SPI" -- same formulas everywhere? + - "Critical" risk threshold -- same score threshold everywhere? + - Health score ranges -- same tier boundaries? -1. **Visual scan** — Look at the annotated screenshot for layout issues -2. **Interactive elements** — Click buttons, links, controls. Do they work? -3. **Forms** — Fill and submit. Test empty, invalid, edge cases -4. **Navigation** — Check all paths in and out -5. **States** — Empty state, loading, error, overflow -6. **Console** — Any new JS errors after interactions? -7. **Responsiveness** — Check mobile viewport if relevant: - ```bash - $B viewport 375x812 - $B screenshot "$REPORT_DIR/screenshots/page-mobile.png" - $B viewport 1280x720 - ``` +2. **JSON snapshot schema check**: Verify that skills that share data use compatible schemas: + - EVM snapshots referenced by Executive Reporting + - Risk snapshots referenced by Decision-AI + - Schedule snapshots referenced by Completion Prediction -**Depth judgment:** Spend more time on core features (homepage, dashboard, checkout, search) and less on secondary pages (about, terms, privacy). +3. **Reference file consistency**: Check that reference file paths mentioned in SKILL.md files point to plausible locations. -**Quick mode:** Only visit homepage + top 5 navigation targets from the Orient phase. Skip the per-page checklist — just check: loads? Console errors? Broken links visible? +### Phase 4: Output Format Compliance -### Phase 5: Document +For each skill, verify its output templates are complete: -Document each issue **immediately when found** — don't batch them. +1. **Schedule Intelligence**: Has Executive Summary, Health Scorecard, Critical Path Summary, Top 10 Risk Activities, Recommended Actions? +2. **Decision-AI**: Has Schwerpunkt identification, Corrective Actions table, Critic analysis, Decision Brief? +3. **Risk Engine**: Has Executive Risk Summary, Risk Register Table, Heatmap, Mitigation Action Plan? +4. **EVM Control**: Has Performance Dashboard with all metrics, Variance Attribution, Trend Analysis? +5. **Completion Prediction**: Has P20/P50/P80 forecast, Scenario Comparison, S-Curve narrative, Confidence Statement? +6. **Reference Class**: Has RCAE calculation, Inside-View Adjustments, Optimism Bias Report, Contingency Assessment? +7. **Executive Reporting**: Has all report type structures, audience calibration rules, quality checklist? +8. **Sales Intelligence**: Has Prospect Research protocol, Outreach templates, Pitch deck structure, Competitive table? -**Two evidence tiers:** +### Phase 5: Document Findings -**Interactive bugs** (broken flows, dead buttons, form failures): -1. Take a screenshot before the action -2. Perform the action -3. Take a screenshot showing the result -4. Use `snapshot -D` to show what changed -5. Write repro steps referencing screenshots +Document each issue immediately when found. -```bash -$B screenshot "$REPORT_DIR/screenshots/issue-001-step-1.png" -$B click @e5 -$B screenshot "$REPORT_DIR/screenshots/issue-001-result.png" -$B snapshot -D -``` +**Issue severity:** -**Static bugs** (typos, layout issues, missing images): -1. Take a single annotated screenshot showing the problem -2. Describe what's wrong +| Severity | Definition | Examples | +|----------|------------|----------| +| **critical** | Wrong calculation, incorrect formula, data integrity violation | CPI formula inverted, P80 < P50, risk score != PxI | +| **high** | Missing required output section, broken cross-skill reference | Decision-AI missing Critic step, EVM dashboard missing TCPI | +| **medium** | Inconsistent terminology, threshold drift between skills | "Critical" means score>=12 in one skill, score>=16 in another | +| **low** | Minor formatting, typo in methodology description | Inconsistent header levels, missing reference file | -```bash -$B snapshot -i -a -o "$REPORT_DIR/screenshots/issue-002.png" -``` +### Phase 6: Health Score -**Write each issue to the report immediately** using the template format from `qa/templates/qa-report-template.md`. - -### Phase 6: Wrap Up - -1. **Compute health score** using the rubric below -2. **Write "Top 3 Things to Fix"** — the 3 highest-severity issues -3. **Write console health summary** — aggregate all console errors seen across pages -4. **Update severity counts** in the summary table -5. **Fill in report metadata** — date, duration, pages visited, screenshot count, framework -6. **Save baseline** — write `baseline.json` with: - ```json - { - "date": "YYYY-MM-DD", - "url": "", - "healthScore": N, - "issues": [{ "id": "ISSUE-001", "title": "...", "severity": "...", "category": "..." }], - "categoryScores": { "console": N, "links": N, ... } - } - ``` +Compute health score using weighted categories: + +| Category | Weight | Scoring | +|----------|--------|---------| +| Calculation Correctness | 30% | Start 100, -25 per critical, -15 per high | +| Cross-Skill Consistency | 20% | Start 100, -15 per inconsistency | +| Output Completeness | 20% | Start 100, -10 per missing section | +| Methodology Adherence | 15% | Start 100, -15 per deviation from cited standard | +| Data Flow Integrity | 15% | Start 100, -20 per broken reference | -**Regression mode:** After writing the report, load the baseline file. Compare: -- Health score delta -- Issues fixed (in baseline but not current) -- New issues (in current but not baseline) -- Append the regression section to the report +Final score = weighted average. Tiers: +- 85-100: Healthy +- 70-84: Moderate issues +- 50-69: Significant issues +- <50: Critical -- needs immediate attention --- -## Health Score Rubric - -Compute each category score (0-100), then take the weighted average. - -### Console (weight: 15%) -- 0 errors → 100 -- 1-3 errors → 70 -- 4-10 errors → 40 -- 10+ errors → 10 - -### Links (weight: 10%) -- 0 broken → 100 -- Each broken link → -15 (minimum 0) - -### Per-Category Scoring (Visual, Functional, UX, Content, Performance, Accessibility) -Each category starts at 100. Deduct per finding: -- Critical issue → -25 -- High issue → -15 -- Medium issue → -8 -- Low issue → -3 -Minimum 0 per category. - -### Weights -| Category | Weight | -|----------|--------| -| Console | 15% | -| Links | 10% | -| Visual | 10% | -| Functional | 20% | -| UX | 15% | -| Performance | 10% | -| Content | 5% | -| Accessibility | 15% | - -### Final Score -`score = Σ (category_score × weight)` +## Output Structure ---- +``` +.cybereum/qa-reports/ +├── qa-report-{YYYY-MM-DD}.md # Structured report +└── baseline.json # For regression mode +``` -## Framework-Specific Guidance +### Report Template -### Next.js -- Check console for hydration errors (`Hydration failed`, `Text content did not match`) -- Monitor `_next/data` requests in network — 404s indicate broken data fetching -- Test client-side navigation (click links, don't just `goto`) — catches routing issues -- Check for CLS (Cumulative Layout Shift) on pages with dynamic content +```markdown +# Cybereum QA Report -### Rails -- Check for N+1 query warnings in console (if development mode) -- Verify CSRF token presence in forms -- Test Turbo/Stimulus integration — do page transitions work smoothly? -- Check for flash messages appearing and dismissing correctly +| Field | Value | +|-------|-------| +| **Date** | {DATE} | +| **Scope** | {SCOPE or "Full platform"} | +| **Mode** | {full / quick / regression} | +| **Skills tested** | {COUNT}/8 | -### WordPress -- Check for plugin conflicts (JS errors from different plugins) -- Verify admin bar visibility for logged-in users -- Test REST API endpoints (`/wp-json/`) -- Check for mixed content warnings (common with WP) +## Health Score: {SCORE}/100 -### General SPA (React, Vue, Angular) -- Use `snapshot -i` for navigation — `links` command misses client-side routes -- Check for stale state (navigate away and back — does data refresh?) -- Test browser back/forward — does the app handle history correctly? -- Check for memory leaks (monitor console after extended use) +| Category | Score | +|----------|-------| +| Calculation Correctness | {0-100} | +| Cross-Skill Consistency | {0-100} | +| Output Completeness | {0-100} | +| Methodology Adherence | {0-100} | +| Data Flow Integrity | {0-100} | ---- +## Top 3 Things to Fix -## Important Rules +1. **{ISSUE-NNN}: {title}** -- {one-line description} +2. **{ISSUE-NNN}: {title}** -- {one-line description} +3. **{ISSUE-NNN}: {title}** -- {one-line description} -1. **Repro is everything.** Every issue needs at least one screenshot. No exceptions. -2. **Verify before documenting.** Retry the issue once to confirm it's reproducible, not a fluke. -3. **Never include credentials.** Write `[REDACTED]` for passwords in repro steps. -4. **Write incrementally.** Append each issue to the report as you find it. Don't batch. -5. **Never read source code.** Test as a user, not a developer. -6. **Check console after every interaction.** JS errors that don't surface visually are still bugs. -7. **Test like a user.** Use realistic data. Walk through complete workflows end-to-end. -8. **Depth over breadth.** 5-10 well-documented issues with evidence > 20 vague descriptions. -9. **Never delete output files.** Screenshots and reports accumulate — that's intentional. -10. **Use `snapshot -C` for tricky UIs.** Finds clickable divs that the accessibility tree misses. +## Summary ---- +| Severity | Count | +|----------|-------| +| Critical | 0 | +| High | 0 | +| Medium | 0 | +| Low | 0 | +| **Total** | **0** | -## Output Structure +## Issues +### ISSUE-001: {Short title} + +| Field | Value | +|-------|-------| +| **Severity** | critical / high / medium / low | +| **Skill** | {which skill} | +| **Category** | calculation / consistency / completeness / methodology / data-flow | + +**Description:** {What is wrong, expected vs actual.} + +**Evidence:** {Formula, threshold, or output that demonstrates the issue.} + +**Fix:** {Specific correction needed.} ``` -.gstack/qa-reports/ -├── qa-report-{domain}-{YYYY-MM-DD}.md # Structured report -├── screenshots/ -│ ├── initial.png # Landing page annotated screenshot -│ ├── issue-001-step-1.png # Per-issue evidence -│ ├── issue-001-result.png -│ └── ... -└── baseline.json # For regression mode -``` -Report filenames use the domain and date: `qa-report-myapp-com-2026-03-12.md` +--- + +## Important Rules + +1. **Verify calculations with actual numbers.** Don't just read formulas -- plug in values and check. +2. **Cross-reference across skills.** The same metric must mean the same thing everywhere. +3. **Check cited standards.** If a skill says "per DCMA 14-Point" -- verify the threshold matches DCMA. +4. **Document immediately.** Append each issue to the report as you find it. Don't batch. +5. **Save baseline.** Always save a baseline.json for future regression comparison. diff --git a/qa/references/issue-taxonomy.md b/qa/references/issue-taxonomy.md index 05c5741..03806fe 100644 --- a/qa/references/issue-taxonomy.md +++ b/qa/references/issue-taxonomy.md @@ -1,85 +1,68 @@ -# QA Issue Taxonomy +# Cybereum QA Issue Taxonomy ## Severity Levels | Severity | Definition | Examples | |----------|------------|----------| -| **critical** | Blocks a core workflow, causes data loss, or crashes the app | Form submit causes error page, checkout flow broken, data deleted without confirmation | -| **high** | Major feature broken or unusable, no workaround | Search returns wrong results, file upload silently fails, auth redirect loop | -| **medium** | Feature works but with noticeable problems, workaround exists | Slow page load (>5s), form validation missing but submit still works, layout broken on mobile only | -| **low** | Minor cosmetic or polish issue | Typo in footer, 1px alignment issue, hover state inconsistent | +| **critical** | Wrong calculation output, data integrity violation, or skill produces incorrect recommendations | EVM CPI formula inverted, P80 < P50, risk score != PxI, Schwerpunkt without Critic | +| **high** | Major skill section missing or broken, cross-skill inconsistency that affects outputs | EVM dashboard missing TCPI, schedule health score uses wrong thresholds | +| **medium** | Skill works but with inconsistencies or gaps, methodology deviation | Terminology drift between skills, threshold defined differently in two places | +| **low** | Minor formatting, documentation gap, or cosmetic issue | Inconsistent header levels, missing reference file path | ## Categories -### 1. Visual/UI -- Layout breaks (overlapping elements, clipped text, horizontal scrollbar) -- Broken or missing images -- Incorrect z-index (elements appearing behind others) -- Font/color inconsistencies -- Animation glitches (jank, incomplete transitions) -- Alignment issues (off-grid, uneven spacing) -- Dark mode / theme issues +### 1. Calculation Correctness +- EVM formula errors (CPI, SPI, EAC, TCPI, VAC, CV, SV) +- Risk scoring errors (P x I matrix, contingency calculations) +- Schedule metric errors (float calculations, CPLI, BEI) +- Completion prediction errors (multiplier tables, Monte Carlo parameters) +- Reference class errors (overrun percentages, RCAE calculations) +- Division by zero in metric calculations +- Unit mismatches (working days vs calendar days, $K vs $M) -### 2. Functional -- Broken links (404, wrong destination) -- Dead buttons (click does nothing) -- Form validation (missing, wrong, bypassed) -- Incorrect redirects -- State not persisting (data lost on refresh, back button) -- Race conditions (double-submit, stale data) -- Search returning wrong or no results +### 2. Cross-Skill Consistency +- Same metric defined differently across skills +- Threshold values that drift (e.g., "critical" score cutoff) +- JSON snapshot schemas that don't align +- Terminology inconsistency (P80 vs 80th percentile) +- Reference file paths that don't match actual locations -### 3. UX -- Confusing navigation (no breadcrumbs, dead ends) -- Missing loading indicators (user doesn't know something is happening) -- Slow interactions (>500ms with no feedback) -- Unclear error messages ("Something went wrong" with no detail) -- No confirmation before destructive actions -- Inconsistent interaction patterns across pages -- Dead ends (no way back, no next action) +### 3. Methodology Compliance +- DCMA 14-Point thresholds deviating from standard +- ANSI/EIA-748 EVMS guidelines incorrectly stated +- AACE standard references that don't match source +- Flyvbjerg benchmark data that doesn't match published research +- GAO Schedule Assessment Guide criteria incorrectly applied -### 4. Content -- Typos and grammar errors -- Outdated or incorrect text -- Placeholder / lorem ipsum text left in -- Truncated text (cut off without ellipsis or "more") -- Wrong labels on buttons or form fields -- Missing or unhelpful empty states +### 4. Output Completeness +- Missing required sections in skill output templates +- Incomplete tables (missing columns or rows) +- Output format that doesn't match the stated template +- Missing Executive Summary or Recommended Actions +- Decision Brief missing required fields -### 5. Performance -- Slow page loads (>3 seconds) -- Janky scrolling (dropped frames) -- Layout shifts (content jumping after load) -- Excessive network requests (>50 on a single page) -- Large unoptimized images -- Blocking JavaScript (page unresponsive during load) +### 5. Data Flow Integrity +- Schedule data not flowing correctly to Completion Prediction +- EVM metrics not available to Executive Reporting +- Risk register not feeding Decision-AI Schwerpunkt analysis +- Snapshot persistence not saving all required fields +- Trend tracking not loading prior snapshots correctly -### 6. Console/Errors -- JavaScript exceptions (uncaught errors) -- Failed network requests (4xx, 5xx) -- Deprecation warnings (upcoming breakage) -- CORS errors -- Mixed content warnings (HTTP resources on HTTPS) -- CSP violations +### 6. Industry Standard Compliance +- EVM formulas not matching ANSI/EIA-748 +- Schedule checks not matching DCMA 14-Point Assessment +- Cost contingency not following AACE RP 40R-08 +- Reference class methodology not following Flyvbjerg/UK Treasury +- Reporting format not following AACE RP 11R-88 -### 7. Accessibility -- Missing alt text on images -- Unlabeled form inputs -- Keyboard navigation broken (can't tab to elements) -- Focus traps (can't escape a modal or dropdown) -- Missing or incorrect ARIA attributes -- Insufficient color contrast -- Content not reachable by screen reader +## Per-Skill Validation Checklist -## Per-Page Exploration Checklist +For each analytical skill during a QA session: -For each page visited during a QA session: - -1. **Visual scan** — Take annotated screenshot (`snapshot -i -a -o`). Look for layout issues, broken images, alignment. -2. **Interactive elements** — Click every button, link, and control. Does each do what it says? -3. **Forms** — Fill and submit. Test empty submission, invalid data, edge cases (long text, special characters). -4. **Navigation** — Check all paths in/out. Breadcrumbs, back button, deep links, mobile menu. -5. **States** — Check empty state, loading state, error state, full/overflow state. -6. **Console** — Run `console --errors` after interactions. Any new JS errors or failed requests? -7. **Responsiveness** — If relevant, check mobile and tablet viewports. -8. **Auth boundaries** — What happens when logged out? Different user roles? +1. **Formula check** -- Plug in known values and verify outputs match expected results +2. **Threshold check** -- Verify all stated thresholds match cited industry standards +3. **Output template check** -- Verify all required output sections are present and complete +4. **Cross-reference check** -- Verify shared concepts are consistent with other skills +5. **Snapshot schema check** -- Verify JSON persistence captures all required fields +6. **Trend tracking check** -- Verify delta calculations produce correct results +7. **Edge case check** -- Test with boundary values (zero, negative, maximum) diff --git a/retro/SKILL.md b/retro/SKILL.md index f1e92c2..6a12bd5 100644 --- a/retro/SKILL.md +++ b/retro/SKILL.md @@ -2,9 +2,10 @@ name: retro version: 2.0.0 description: | - Weekly engineering retrospective. Analyzes commit history, work patterns, - and code quality metrics with persistent history and trend tracking. + Weekly engineering retrospective for Cybereum. Analyzes commit history, work + patterns, and code quality metrics with persistent history and trend tracking. Team-aware: breaks down per-person contributions with praise and growth areas. + Cybereum-aware: tracks skill development velocity and cross-skill consistency. allowed-tools: - Bash - Read @@ -24,9 +25,9 @@ _UPD=$(~/.claude/skills/gstack/bin/gstack-update-check 2>/dev/null || .claude/sk If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined). If `JUST_UPGRADED `: tell user "Running gstack v{to} (just updated!)" and continue. -# /retro — Weekly Engineering Retrospective +# /retro -- Weekly Engineering Retrospective (Cybereum) -Generates a comprehensive engineering retrospective analyzing commit history, work patterns, and code quality metrics. Team-aware: identifies the user running the command, then analyzes every contributor with per-person praise and growth opportunities. Designed for a senior IC/CTO-level builder using Claude Code as a force multiplier. +Generates a comprehensive engineering retrospective analyzing commit history, work patterns, and code quality metrics for the Cybereum capital project governance platform. Team-aware: identifies the user running the command, then analyzes every contributor with per-person praise and growth opportunities. Cybereum-aware: tracks which analytical and workflow skills were modified and flags cross-skill consistency risks. ## User-invocable When the user types `/retro`, run this skill. @@ -398,6 +399,25 @@ Narrative covering: - Any XL PRs that should have been split - Greptile signal ratio and trend (if history exists): "Greptile: X% signal (Y valid catches, Z false positives)" +### Cybereum Skill Development +(Cybereum-specific section) + +Analyze which skills were modified this period: + +```bash +# Which Cybereum skills were touched? +git log origin/main --since="" --format="" --name-only | grep -E "^cybereum-" | sort -u +# Which workflow skills were touched? +git log origin/main --since="" --format="" --name-only | grep -E "^(ship|review|qa|retro|plan-)" | sort -u +``` + +Report: +- **Analytical skills modified**: List which of the 8 were touched and what changed +- **Workflow skills modified**: List which were touched +- **Cross-skill consistency risk**: If >2 analytical skills were modified, flag potential terminology/threshold drift +- **Skill coverage**: How many of the 8 analytical skills have been touched in the last 30 days? Skills untouched for 30+ days may be falling behind +- **Snapshot schema changes**: Were any `.cybereum/` JSON schemas modified? Flag backward compatibility risk + ### Focus & Highlights (from Step 8) - Focus score with interpretation diff --git a/review/SKILL.md b/review/SKILL.md index b572283..6b5d9ce 100644 --- a/review/SKILL.md +++ b/review/SKILL.md @@ -2,8 +2,9 @@ name: review version: 1.0.0 description: | - Pre-landing PR review. Analyzes diff against main for SQL safety, LLM trust - boundary violations, conditional side effects, and other structural issues. + Pre-landing PR review for Cybereum. Analyzes diff against main for calculation + integrity, graph consistency, cross-skill coherence, LLM trust boundary + violations, and structural issues in capital project analytics code. allowed-tools: - Bash - Read @@ -27,21 +28,21 @@ If output shows `UPGRADE_AVAILABLE `: read `~/.claude/skills/gstack/g # Pre-Landing PR Review -You are running the `/review` workflow. Analyze the current branch's diff against main for structural issues that tests don't catch. +You are running the `/review` workflow for Cybereum. Analyze the current branch's diff against main for structural issues that tests don't catch -- with special attention to calculation correctness, data consistency, and cross-skill coherence. --- ## Step 1: Check branch 1. Run `git branch --show-current` to get the current branch. -2. If on `main`, output: **"Nothing to review — you're on main or have no changes against main."** and stop. +2. If on `main`, output: **"Nothing to review -- you're on main or have no changes against main."** and stop. 3. Run `git fetch origin main --quiet && git diff origin/main --stat` to check if there's a diff. If no diff, output the same message and stop. --- ## Step 2: Read the checklist -Read `.claude/skills/review/checklist.md`. +Read `review/checklist.md` (or `.claude/skills/review/checklist.md`). **If the file cannot be read, STOP and report the error.** Do not proceed without the checklist. @@ -73,18 +74,18 @@ Run `git diff origin/main` to get the full diff. This includes both committed an Apply the checklist against the diff in two passes: -1. **Pass 1 (CRITICAL):** SQL & Data Safety, LLM Output Trust Boundary -2. **Pass 2 (INFORMATIONAL):** Conditional Side Effects, Magic Numbers & String Coupling, Dead Code & Consistency, LLM Prompt Issues, Test Gaps, View/Frontend +1. **Pass 1 (CRITICAL):** Data & Calculation Integrity, Graph & Data Consistency, LLM Output Trust Boundary +2. **Pass 2 (INFORMATIONAL):** Skill Content Quality, Cross-Skill Consistency, Conditional Side Effects, Dead Code, LLM Prompt Issues, Test Gaps, Type Coercion, File Parsing Safety -Follow the output format specified in the checklist. Respect the suppressions — do NOT flag items listed in the "DO NOT flag" section. +Follow the output format specified in the checklist. Respect the suppressions -- do NOT flag items listed in the "DO NOT flag" section. --- ## Step 5: Output findings -**Always output ALL findings** — both critical and informational. The user must see every issue. +**Always output ALL findings** -- both critical and informational. The user must see every issue. -- If CRITICAL issues found: output all findings, then for EACH critical issue use a separate AskUserQuestion with the problem, your recommended fix, and options (A: Fix it now, B: Acknowledge, C: False positive — skip). +- If CRITICAL issues found: output all findings, then for EACH critical issue use a separate AskUserQuestion with the problem, your recommended fix, and options (A: Fix it now, B: Acknowledge, C: False positive -- skip). After all critical questions are answered, output a summary of what the user chose for each issue. If the user chose A (fix) on any issue, apply the recommended fixes. If only B/C were chosen, no action needed. - If only non-critical issues found: output findings. No further action needed. - If no issues found: output `Pre-Landing Review: No issues found.` diff --git a/review/checklist.md b/review/checklist.md index e321890..6f1ccb4 100644 --- a/review/checklist.md +++ b/review/checklist.md @@ -2,10 +2,10 @@ ## Instructions -Review the `git diff origin/main` output for the issues listed below. Be specific — cite `file:line` and suggest fixes. Skip anything that's fine. Only flag real problems. +Review the `git diff origin/main` output for the issues listed below. Be specific -- cite `file:line` and suggest fixes. Skip anything that's fine. Only flag real problems. **Two-pass review:** -- **Pass 1 (CRITICAL):** Run SQL & Data Safety and LLM Output Trust Boundary first. These can block `/ship`. +- **Pass 1 (CRITICAL):** Run Data & Calculation Integrity, Graph Consistency, and LLM Output Trust Boundary first. These can block `/ship`. - **Pass 2 (INFORMATIONAL):** Run all remaining categories. These are included in the PR body but do not block. **Output format:** @@ -30,68 +30,80 @@ Be terse. For each issue: one line describing the problem, one line with the fix ## Review Categories -### Pass 1 — CRITICAL - -#### SQL & Data Safety -- String interpolation in SQL (even if values are `.to_i`/`.to_f` — use `sanitize_sql_array` or Arel) -- TOCTOU races: check-then-set patterns that should be atomic `WHERE` + `update_all` -- `update_column`/`update_columns` bypassing validations on fields that have or should have constraints -- N+1 queries: `.includes()` missing for associations used in loops/views (especially avatar, attachments) - -#### Race Conditions & Concurrency -- Read-check-write without uniqueness constraint or `rescue RecordNotUnique; retry` (e.g., `where(hash:).first` then `save!` without handling concurrent insert) -- `find_or_create_by` on columns without unique DB index — concurrent calls can create duplicates -- Status transitions that don't use atomic `WHERE old_status = ? UPDATE SET new_status` — concurrent updates can skip or double-apply transitions -- `html_safe` on user-controlled data (XSS) — check any `.html_safe`, `raw()`, or string interpolation into `html_safe` output +### Pass 1 -- CRITICAL + +#### Data & Calculation Integrity +- EVM formula errors: CPI, SPI, EAC, TCPI, VAC calculations must match ANSI/EIA-748 definitions exactly +- Risk scoring errors: P x I matrix calculations, contingency formulas, Monte Carlo parameter ranges +- Schedule metric errors: Float calculations, CPLI formula, BEI formula, DCMA 14-Point thresholds +- Reference Class data errors: Overrun percentages, benchmark values must match cited sources (Flyvbjerg, GAO, RAND) +- Division by zero: Any metric calculation where the denominator could be zero (ACWP=0 for CPI, etc.) +- Unit mismatches: Working days vs calendar days, cost in $K vs $M, percentages as decimals vs whole numbers +- Date arithmetic: Business day calculations that don't account for calendars, timezone-naive date comparisons + +#### Graph & Data Consistency +- Temporal knowledge graph mutations that don't preserve causal chain integrity +- Schedule relationships (FS/SS/FF/SF) with contradictory logic (circular dependencies, impossible sequences) +- Risk register entries with score != probability x impact +- EVM data where BCWP > BAC (earned more than budgeted -- impossible without scope change) +- Float values inconsistent with ES/EF/LS/LF dates +- Activities with % complete > 0 but no actual start date +- Milestones with duration > 0 (milestones are zero-duration by definition) #### LLM Output Trust Boundary -- LLM-generated values (emails, URLs, names) written to DB or passed to mailers without format validation. Add lightweight guards (`EMAIL_REGEXP`, `URI.parse`, `.strip`) before persisting. -- Structured tool output (arrays, hashes) accepted without type/shape checks before database writes. - -### Pass 2 — INFORMATIONAL +- LLM-generated risk descriptions, corrective actions, or recommendations written to persistent storage without human review flag +- AI-generated schedule analysis accepted as ground truth without cross-referencing parsed schedule data +- Schwerpunkt recommendations that bypass the Critic step (Step 4 in Decision-AI) +- Executive report content generated without data validation against source metrics +- Outreach messages or pitch content with fabricated proof points or statistics + +### Pass 2 -- INFORMATIONAL + +#### Skill Content Quality +- DCMA 14-Point thresholds that deviate from standard without documented justification +- Risk scoring thresholds inconsistent across skills (e.g., "critical" defined differently in Risk Engine vs Schedule Intelligence) +- Completion prediction confidence multipliers that don't sum/relate correctly +- Reference class sample sizes below the minimum stated in the skill (e.g., N>8 for nuclear SMR) +- Executive report sections that reference metrics not available from other skills + +#### Cross-Skill Consistency +- Terminology drift: Same concept named differently across skills (e.g., "P80" vs "80th percentile" vs "conservative estimate") +- Threshold drift: Same threshold defined with different values across skills +- Output format inconsistency: JSON snapshot schemas that don't align across skills that share data +- Reference file paths that don't match actual file locations #### Conditional Side Effects -- Code paths that branch on a condition but forget to apply a side effect on one branch. Example: item promoted to verified but URL only attached when a secondary condition is true — the other branch promotes without the URL, creating an inconsistent record. -- Log messages that claim an action happened but the action was conditionally skipped. The log should reflect what actually occurred. - -#### Magic Numbers & String Coupling -- Bare numeric literals used in multiple files — should be named constants documented together -- Error message strings used as query filters elsewhere (grep for the string — is anything matching on it?) +- Code paths that branch on a condition but forget to apply a side effect on one branch +- Log messages that claim an action happened but the action was conditionally skipped #### Dead Code & Consistency - Variables assigned but never read - Version mismatch between PR title and VERSION/CHANGELOG files -- CHANGELOG entries that describe changes inaccurately (e.g., "changed from X to Y" when X never existed) +- CHANGELOG entries that describe changes inaccurately - Comments/docstrings that describe old behavior after the code changed #### LLM Prompt Issues - 0-indexed lists in prompts (LLMs reliably return 1-indexed) -- Prompt text listing available tools/capabilities that don't match what's actually wired up in the `tool_classes`/`tools` array -- Word/token limits stated in multiple places that could drift +- Prompt text listing available tools/capabilities that don't match what's actually wired up +- Scoring formulas in prompts that don't match the formulas in the analytical methodology sections #### Test Gaps -- Negative-path tests that assert type/status but not the side effects (URL attached? field populated? callback fired?) -- Assertions on string content without checking format (e.g., asserting title present but not URL format) -- `.expects(:something).never` missing when a code path should explicitly NOT call an external service -- Security enforcement features (blocking, rate limiting, auth) without integration tests verifying the enforcement path works end-to-end - -#### Crypto & Entropy -- Truncation of data instead of hashing (last N chars instead of SHA-256) — less entropy, easier collisions -- `rand()` / `Random.rand` for security-sensitive values — use `SecureRandom` instead -- Non-constant-time comparisons (`==`) on secrets or tokens — vulnerable to timing attacks - -#### Time Window Safety -- Date-key lookups that assume "today" covers 24h — report at 8am PT only sees midnight→8am under today's key -- Mismatched time windows between related features — one uses hourly buckets, another uses daily keys for the same data +- EVM calculations without edge case tests (CPI when ACWP=0, SPI at project end) +- Schedule parsing without malformed input tests (corrupt XER, missing fields, wrong encoding) +- Risk scoring without boundary value tests (score exactly at threshold) +- Monte Carlo without seed-controlled deterministic tests +- Snapshot persistence without round-trip tests (save then load and compare) #### Type Coercion at Boundaries -- Values crossing Ruby→JSON→JS boundaries where type could change (numeric vs string) — hash/digest inputs must normalize types -- Hash/digest inputs that don't call `.to_s` or equivalent before serialization — `{ cores: 8 }` vs `{ cores: "8" }` produce different hashes +- Values crossing JSON boundaries where type could change (numeric vs string) +- Date strings without timezone information crossing system boundaries +- Cost values that mix currency formats or decimal precision -#### View/Frontend -- Inline `