Conversation
12b324c to
8135e62
Compare
Implements a two-tier nightly GitHub Actions workflow that verifies git-ai hooks fire correctly with real agent CLI binaries (Claude Code, Codex, Gemini CLI, Droid, OpenCode) on both stable and latest releases. Tier 1 (no API keys): Installs each agent CLI, runs `git-ai install`, verifies hook config files contain the correct checkpoint commands, then exercises the full attribution pipeline with synthetic checkpoint data via the agent-v1 preset. Tier 2 (live, requires API keys): Runs each agent with a deterministic prompt in a test repo and verifies authorship notes and blame output. New files: - .github/workflows/nightly-agent-integration.yml - scripts/nightly/verify-hook-wiring.sh - scripts/nightly/test-synthetic-checkpoint.sh - scripts/nightly/test-live-agent.sh - scripts/nightly/verify-attribution.sh Hook config paths verified against src/mdm/agents/*.rs: - claude: ~/.claude/settings.json - codex: ~/.codex/config.toml - gemini: ~/.gemini/settings.json - droid: ~/.factory/settings.json - opencode: ~/.config/opencode/plugin/git-ai.ts Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Neither file belongs in the repo: .mcp.json is local tooling config and the plan document was a design scratch pad, not a deliverable. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. scripts/nightly/test-synthetic-checkpoint.sh: Fix transcript message schema in the synthetic checkpoint JSON payload. The Rust Message enum uses `#[serde(tag = "type", rename_all = "snake_case")]`, so messages require `"type"` and `"text"` fields — not `"role"` and `"content"`. The old schema caused deserialization to fail for every Tier 1 run. 2. .github/workflows/nightly-agent-integration.yml: Fix notify-on-failure condition. With `if: failure()`, GitHub Actions skips the job entirely when tier2-live-integration is skipped (e.g. when running tier1-only), silently swallowing Tier 1 failures. Replace with an explicit always() guard that checks each dependency's result individually. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a pull_request `labeled` event trigger so the full nightly suite runs whenever someone applies the 'Integration' label to any PR — in addition to the existing nightly schedule and workflow_dispatch paths. The gate condition on the resolve-versions job ensures the downstream matrix jobs only run for the correct trigger, not for every label event. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The label is 'integration', not 'Integration'. GitHub label names are case-sensitive in Actions expressions. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the placeholder hello.txt smoke test with real end-to-end tests that verify git-ai's entire attribution pipeline: test-live-agent.sh: - Seeds the test repo with a real Python module (utils/math_utils.py) containing add, subtract, and is_prime functions - Runs the real agent CLI with a substantive prompt: add a fibonacci function using an iterative approach and commit it - Falls back to a manual commit if the agent wrote code but didn't commit (post-commit hook still fires and writes the authorship note as long as working log data was captured during the agent run) - Idempotent across retry attempts verify-attribution.sh: - Checks fibonacci function was actually added to the Python file - Verifies ≥3 commits exist (initial + seed + agent) - Fetches and parses the authorship note from refs/notes/ai - Asserts schema_version = "authorship/3.0.0" - Asserts at least one prompt session was recorded (hard fail) - Fuzzy-matches agent_id.tool against the agent name - Checks transcript messages were captured - Verifies utils/math_utils.py appears in the attestation section - Runs git-ai blame and checks AI attribution on fibonacci lines - Saves all artefacts (raw note, parsed metadata, blame output) to RESULTS_DIR for upload Workflow: increase Tier 2 job timeout from 25→45 min and retry timeout from 12→20 min to accommodate seeding + real agent API calls. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The install-scripts-local workflow does more than validate install scripts — it verifies full end-to-end hook wiring between git-ai and Claude Code. Rename the workflow and job names to reflect that. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the fake claude binary stub with real npm-installed agent CLIs and add a matrix covering all four supported agents. This makes the End-to-End tests meaningful: install.sh now runs git-ai install-hooks against actual agent binaries, which auto-detect the installed tool and write real hook configuration to each agent's config directory. Verification uses the existing verify-hook-wiring.sh script (Unix) and equivalent inline PowerShell checks (Windows) to confirm hooks were written to the correct agent-specific location. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs in the E2E test setup: 1. opencode npm package: the package is "opencode-ai" not "opencode". The bare "opencode" name returns a 404 from the npm registry. Fixed in both the E2E install workflow and the nightly agent integration workflow. 2. codex hook verification: grep pattern "checkpoint codex" expects a JSON-style command string, but Codex config uses a TOML array where elements are comma-separated: notify = ["<bin>", "checkpoint", "codex", ...]. Changed to grep for just "checkpoint" which appears in the array and is sufficient to confirm the hook is configured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The same TOML array format issue that was fixed in verify-hook-wiring.sh for Unix also affects the Windows inline PowerShell check. Codex stores its hook as a TOML array (notify = ["<bin>", "checkpoint", "codex", ...]) so Select-String for "checkpoint codex" never matches. Changed to match just "checkpoint". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n verify-attribution.sh The `[ $? -eq 0 ] || fail "..."` guard was dead code under `set -euo pipefail`: if the python3 heredoc exits with code 1, `set -e` terminates the script immediately before the guard is reached, producing a silent exit with no diagnostic logged to $LOG. Replace with `if ! python3 ... <<'PYEOF' ... then fail "..." fi`, which is exempt from `set -e` and ensures the descriptive failure message is written to $LOG before exiting. Resolves Devin review comment BUG_pr-review-job-8b70596b_0002. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Tier 1 and Tier 2 nightly jobs were calling `git-ai install` to set up agent hooks, but never creating the `git` → `git-ai` symlink in the release directory. When test scripts called `git commit`, the system git ran instead of the git-ai proxy, so the post-commit hook never fired and no authorship note was written to refs/notes/ai. Add `ln -sf .../git-ai .../git` in both the Tier 1 and Tier 2 "Install git-ai hooks in test repo" steps so that all `git` invocations inside test scripts (which prepend the release dir to PATH) route through git-ai and trigger the expected hook behaviour. Resolves Devin review comment BUG_pr-review-job-bf54cac596f44273b5f8565f81a63daf_0001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous Lint (ubuntu-latest) check failed on `go-task/setup-task@v1` (not on any code change) — the same action passed on the identical commit via e2e-tests. No code changes; forcing a clean CI run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. verify-attribution.sh: guard empty-string fuzzy match `"" in "claude"` is True in Python, so a missing agent_id.tool would always report PASS. Added `if tool and (...)` to require a non-empty tool string before the fuzzy match runs. Resolves Devin BUG_pr-review-job-032b242ab75044ebac035a42020d7fe3_0001. 2. test-live-agent.sh: add `sudo` to ripgrep fallback install `apt-get install` on GitHub Actions ubuntu-latest requires root. Without `sudo` the install failed silently (2>/dev/null || true), leaving `rg` absent and potentially causing the Gemini CLI to hang. Resolves Devin BUG_pr-review-job-6b947f0c5f1e475bb3ffbeba9e6056de_0001. 3. nightly-agent-integration.yml: deduplicate stable/latest matrix entries `npm view <pkg> version` and `npm view <pkg> dist-tags.latest` return the same value, so stable and latest channels always tested the same version, doubling CI cost for zero extra coverage. Now queries `dist-tags.next` for the latest channel (pre-release/canary), falling back to stable_ver if no `next` tag exists, and skips the latest entry entirely when it would duplicate stable. Resolves Devin BUG_pr-review-job-6b947f0c5f1e475bb3ffbeba9e6056de_0002. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous fix queried dist-tags.next for latest_ver but still used @latest in the npm install command, which resolves to the stable release — identical to the stable channel and defeating the entire purpose of the latest matrix entry. Change the npm_pkg construction for the latest channel to use @next so the pre-release/canary version is actually installed when it exists. Resolves Devin BUG_pr-review-job-070479ba6d7041699555d4dfa9779fa3_0001. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
npm view <pkg> dist-tags.next exits with code 0 and returns an empty
string (or "undefined") when the tag does not exist in npm 10+, rather
than raising a non-zero exit. This meant CalledProcessError was never
raised, latest_ver was set to "" or "undefined", the dedup check
("" != stable_ver) didn't fire, and a matrix entry was emitted with
npm_pkg="<pkg>@next" — causing npm install to fail with ETARGET.
Add an explicit check after .strip(): if the result is empty or equals
the string "undefined", fall back to stable_ver, triggering the same
deduplication skip as the CalledProcessError path.
Resolves Devin BUG_pr-review-job-874dec7614a64a5e952cf18579ebc182_0001.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- install-scripts-local.yml: replace hardcoded `grep checkpoint claude` with a bash case statement matching the Windows switch, so each agent matrix entry verifies its own hook config file (claude→settings.json, codex→config.toml, gemini→settings.json, opencode→plugin file) - nightly-agent-integration.yml: pass workflow_dispatch `agents` input as AGENTS_FILTER env var and filter the Python matrix builder so that specifying e.g. `agents: "claude"` actually limits the matrix instead of unconditionally running all four npm agents Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
8135e62 to
cbdfa6e
Compare
The droid entry was appended unconditionally after the filtered npm-agent loop, so specifying `agents: "claude"` via workflow_dispatch would still include droid in the matrix. Wrap the append in the same filter check so droid is only included when the filter is absent, set to "all", or explicitly contains "droid". Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The bare `.vscode` entry would silently hide any new files added under .vscode/ from `git status`, requiring `git add -f` to track them, and misleads contributors into thinking the whole directory should be untracked. Replace it with `.vscode/*` + `!.vscode/settings.json` so that the tracked project settings file remains visible while any other editor-local files are still ignored. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every matrix cell in the E2E install workflow now runs three additional phases after verifying agent hook configuration: 1. Simulate an AI commit — create a test git repo, wire the git→git-ai proxy symlink and post-commit hook (via `git-ai install`), then feed synthetic checkpoint data through `git-ai checkpoint agent-v1` and commit, exactly as the nightly Tier 1 tests do. 2. Verify attribution tracking — new script `scripts/nightly/verify-synthetic-attribution.sh` checks: - Authorship note exists on HEAD (post-commit hook fired) - Note contains parseable JSON with schema_version = authorship/3.0.0 - At least one prompt session was recorded (prompt stored) - At least one transcript message was captured - `git-ai stats HEAD --json` shows ai_additions > 0 - Test file appears in the note's attestation section - `git-ai blame` shows AI attribution markers 3. Upload results artifact for every matrix cell (always). Windows job mirrors the Unix flow using PowerShell: copies git-ai.exe as git.exe (proxy without requiring developer mode for symlinks), builds the checkpoint JSON via ConvertTo-Json, and performs the same 8 attribution checks inline. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| $lines | Set-Content -Path $log | ||
| Write-Log "=== Synthetic attribution verification COMPLETE: $agent ===" |
There was a problem hiding this comment.
🟡 Windows verification log file written before final Write-Log call, losing the COMPLETE message
In the Windows "Verify attribution pipeline" step, $lines | Set-Content -Path $log at line 418 writes the log file to disk, but then Write-Log is called at line 419 which appends to $lines (via $lines.Add($msg)) after the file was already written. The "COMPLETE" message is printed to stdout via Write-Host but is missing from the log file that gets uploaded as an artifact. The pattern used for all other failure paths writes $lines | Set-Content then throws, but the happy-path final write was placed before the last log line.
Mismatched write sequence at lines 418-419
Line 418: $lines | Set-Content -Path $log (file written)
Line 419: Write-Log "=== Synthetic attribution verification COMPLETE: $agent ===" (adds to $lines AND Write-Host, but file already on disk)
| $lines | Set-Content -Path $log | |
| Write-Log "=== Synthetic attribution verification COMPLETE: $agent ===" | |
| Write-Log "=== Synthetic attribution verification COMPLETE: $agent ===" | |
| $lines | Set-Content -Path $log |
Was this helpful? React with 👍 or 👎 to provide feedback.
The agent-v1 checkpoint format stores an empty messages[] because conversation transcripts are only captured by live agent hooks, not synthetic checkpoints. This is expected behaviour — downgrade the check from a hard failure to a warning, consistent with how verify-attribution.sh handles the same condition for live agent runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same fix as the bash script — synthetic checkpoints don't store conversation messages, so this should be a warning not a failure. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Summary
.github/workflows/nightly-agent-integration.yml— a two-tier nightly workflow that installs real agent CLI binaries and verifies git-ai hook wiring and attribution end-to-endscripts/nightly/with four helper scripts implementing the test logicNIGHTLY_INTEGRATION_PLAN.mddocumenting the full design rationale and open questionsTest Architecture
Tier 1 — Hook Wiring (no API keys, free)
Builds
git-aifrom source, installs each agent CLI (Claude Code, Codex, Gemini, Droid, OpenCode) at bothstableandlatestversions via a dynamic matrix, then:git-ai installand verifies the correct checkpoint commands appear in each agent's config fileagent-v1preset)Tier 2 — Live Integration (requires API key secrets)
Runs each agent with a minimal deterministic prompt ("create hello.txt, commit it"), then verifies the file was created, a commit landed, and authorship notes are present in
refs/notes/ai. Pre-release failures are non-blocking (continue-on-error: true).Hook config paths (verified against
src/mdm/agents/*.rs)~/.claude/settings.json~/.codex/config.toml~/.gemini/settings.json~/.factory/settings.json~/.config/opencode/plugin/git-ai.tsSecrets required (Tier 2 only)
ANTHROPIC_API_KEY,OPENAI_API_KEY,GEMINI_API_KEY,FACTORY_API_KEY,SLACK_BOT_TOKEN,SLACK_CHANNEL_IDTier 1 runs without any secrets.
Cost estimate
~$0.05–0.25/night (weekdays only). See
NIGHTLY_INTEGRATION_PLAN.md§6 for cost management strategies.Test plan
workflow_dispatchwithtier: tier1to validate hook-wiring jobs (no API keys needed)ANTHROPIC_API_KEYsecret and triggertier: bothto validate Claude Code Tier 2 end-to-endNIGHTLY_INTEGRATION_PLAN.md§13 before enabling the nightly schedule🤖 Generated with Claude Code