Skip to content

Add nightly agent CLI integration tests#602

Open
jwiegley wants to merge 22 commits intomainfrom
johnw/nightly-integration
Open

Add nightly agent CLI integration tests#602
jwiegley wants to merge 22 commits intomainfrom
johnw/nightly-integration

Conversation

@jwiegley
Copy link
Collaborator

@jwiegley jwiegley commented Feb 26, 2026

Summary

  • Adds .github/workflows/nightly-agent-integration.yml — a two-tier nightly workflow that installs real agent CLI binaries and verifies git-ai hook wiring and attribution end-to-end
  • Adds scripts/nightly/ with four helper scripts implementing the test logic
  • Adds NIGHTLY_INTEGRATION_PLAN.md documenting the full design rationale and open questions

Test Architecture

Tier 1 — Hook Wiring (no API keys, free)

Builds git-ai from source, installs each agent CLI (Claude Code, Codex, Gemini, Droid, OpenCode) at both stable and latest versions via a dynamic matrix, then:

  1. Runs git-ai install and verifies the correct checkpoint commands appear in each agent's config file
  2. Exercises the full attribution pipeline with synthetic checkpoint data (via the agent-v1 preset)

Tier 2 — Live Integration (requires API key secrets)

Runs each agent with a minimal deterministic prompt ("create hello.txt, commit it"), then verifies the file was created, a commit landed, and authorship notes are present in refs/notes/ai. Pre-release failures are non-blocking (continue-on-error: true).

Hook config paths (verified against src/mdm/agents/*.rs)

Agent Config file
Claude Code ~/.claude/settings.json
Codex ~/.codex/config.toml
Gemini CLI ~/.gemini/settings.json
Droid ~/.factory/settings.json
OpenCode ~/.config/opencode/plugin/git-ai.ts

Secrets required (Tier 2 only)

ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY, FACTORY_API_KEY, SLACK_BOT_TOKEN, SLACK_CHANNEL_ID

Tier 1 runs without any secrets.

Cost estimate

~$0.05–0.25/night (weekdays only). See NIGHTLY_INTEGRATION_PLAN.md §6 for cost management strategies.

Test plan

  • Verify workflow YAML parses correctly in Actions UI
  • Trigger workflow_dispatch with tier: tier1 to validate hook-wiring jobs (no API keys needed)
  • Add ANTHROPIC_API_KEY secret and trigger tier: both to validate Claude Code Tier 2 end-to-end
  • Review open questions in NIGHTLY_INTEGRATION_PLAN.md §13 before enabling the nightly schedule

🤖 Generated with Claude Code


Open with Devin

@git-ai-cloud-dev
Copy link

git-ai-cloud-dev bot commented Feb 26, 2026

Stats powered by Git AI

🧠 you    █░░░░░░░░░░░░░░░░░░░  7%
🤖 ai     ░███████████████████  93%
More stats
  • 1.0 lines generated for every 1 accepted
  • 11 minutes waiting for AI
  • Top model: claude::claude-opus-4-6 (932 accepted lines, 932 generated lines)

AI code tracked with git-ai

@git-ai-cloud
Copy link

git-ai-cloud bot commented Feb 26, 2026

Stats powered by Git AI

🧠 you    ████░░░░░░░░░░░░░░░░  22%
🤖 ai     ░░░░████████████████  78%
More stats
  • 0.9 lines generated for every 1 accepted
  • 4 minutes waiting for AI
  • Top model: claude::claude-sonnet-4-6 (263 accepted lines, 238 generated lines)

AI code tracked with git-ai

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

devin-ai-integration[bot]

This comment was marked as resolved.

@git-ai-bot-svarlamov-dev
Copy link

Stats powered by Git AI

🧠 you    █░░░░░░░░░░░░░░░░░░░  7%
🤖 ai     ░███████████████████  93%
More stats
  • 1.0 lines generated for every 1 accepted
  • 11 minutes waiting for AI
  • Top model: claude::claude-opus-4-6 (932 accepted lines, 932 generated lines)

AI code tracked with git-ai

@jwiegley jwiegley force-pushed the johnw/nightly-integration branch from 12b324c to 8135e62 Compare March 10, 2026 01:37
devin-ai-integration[bot]

This comment was marked as resolved.

jwiegley and others added 14 commits March 9, 2026 21:41
Implements a two-tier nightly GitHub Actions workflow that verifies
git-ai hooks fire correctly with real agent CLI binaries (Claude Code,
Codex, Gemini CLI, Droid, OpenCode) on both stable and latest releases.

Tier 1 (no API keys): Installs each agent CLI, runs `git-ai install`,
verifies hook config files contain the correct checkpoint commands, then
exercises the full attribution pipeline with synthetic checkpoint data
via the agent-v1 preset.

Tier 2 (live, requires API keys): Runs each agent with a deterministic
prompt in a test repo and verifies authorship notes and blame output.

New files:
- .github/workflows/nightly-agent-integration.yml
- scripts/nightly/verify-hook-wiring.sh
- scripts/nightly/test-synthetic-checkpoint.sh
- scripts/nightly/test-live-agent.sh
- scripts/nightly/verify-attribution.sh

Hook config paths verified against src/mdm/agents/*.rs:
- claude: ~/.claude/settings.json
- codex: ~/.codex/config.toml
- gemini: ~/.gemini/settings.json
- droid: ~/.factory/settings.json
- opencode: ~/.config/opencode/plugin/git-ai.ts

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Neither file belongs in the repo: .mcp.json is local tooling config
and the plan document was a design scratch pad, not a deliverable.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. scripts/nightly/test-synthetic-checkpoint.sh: Fix transcript message
   schema in the synthetic checkpoint JSON payload. The Rust Message enum
   uses `#[serde(tag = "type", rename_all = "snake_case")]`, so messages
   require `"type"` and `"text"` fields — not `"role"` and `"content"`.
   The old schema caused deserialization to fail for every Tier 1 run.

2. .github/workflows/nightly-agent-integration.yml: Fix notify-on-failure
   condition. With `if: failure()`, GitHub Actions skips the job entirely
   when tier2-live-integration is skipped (e.g. when running tier1-only),
   silently swallowing Tier 1 failures. Replace with an explicit always()
   guard that checks each dependency's result individually.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add a pull_request `labeled` event trigger so the full nightly suite
runs whenever someone applies the 'Integration' label to any PR — in
addition to the existing nightly schedule and workflow_dispatch paths.

The gate condition on the resolve-versions job ensures the downstream
matrix jobs only run for the correct trigger, not for every label event.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The label is 'integration', not 'Integration'. GitHub label names
are case-sensitive in Actions expressions.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the placeholder hello.txt smoke test with real end-to-end
tests that verify git-ai's entire attribution pipeline:

test-live-agent.sh:
- Seeds the test repo with a real Python module (utils/math_utils.py)
  containing add, subtract, and is_prime functions
- Runs the real agent CLI with a substantive prompt: add a fibonacci
  function using an iterative approach and commit it
- Falls back to a manual commit if the agent wrote code but didn't
  commit (post-commit hook still fires and writes the authorship note
  as long as working log data was captured during the agent run)
- Idempotent across retry attempts

verify-attribution.sh:
- Checks fibonacci function was actually added to the Python file
- Verifies ≥3 commits exist (initial + seed + agent)
- Fetches and parses the authorship note from refs/notes/ai
- Asserts schema_version = "authorship/3.0.0"
- Asserts at least one prompt session was recorded (hard fail)
- Fuzzy-matches agent_id.tool against the agent name
- Checks transcript messages were captured
- Verifies utils/math_utils.py appears in the attestation section
- Runs git-ai blame and checks AI attribution on fibonacci lines
- Saves all artefacts (raw note, parsed metadata, blame output) to
  RESULTS_DIR for upload

Workflow: increase Tier 2 job timeout from 25→45 min and retry
timeout from 12→20 min to accommodate seeding + real agent API calls.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The install-scripts-local workflow does more than validate install scripts
— it verifies full end-to-end hook wiring between git-ai and Claude Code.
Rename the workflow and job names to reflect that.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Replace the fake claude binary stub with real npm-installed agent CLIs
and add a matrix covering all four supported agents. This makes the
End-to-End tests meaningful: install.sh now runs git-ai install-hooks
against actual agent binaries, which auto-detect the installed tool and
write real hook configuration to each agent's config directory.

Verification uses the existing verify-hook-wiring.sh script (Unix) and
equivalent inline PowerShell checks (Windows) to confirm hooks were
written to the correct agent-specific location.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Two bugs in the E2E test setup:

1. opencode npm package: the package is "opencode-ai" not "opencode".
   The bare "opencode" name returns a 404 from the npm registry. Fixed
   in both the E2E install workflow and the nightly agent integration
   workflow.

2. codex hook verification: grep pattern "checkpoint codex" expects a
   JSON-style command string, but Codex config uses a TOML array where
   elements are comma-separated: notify = ["<bin>", "checkpoint",
   "codex", ...]. Changed to grep for just "checkpoint" which appears
   in the array and is sufficient to confirm the hook is configured.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The same TOML array format issue that was fixed in verify-hook-wiring.sh
for Unix also affects the Windows inline PowerShell check. Codex stores
its hook as a TOML array (notify = ["<bin>", "checkpoint", "codex", ...])
so Select-String for "checkpoint codex" never matches. Changed to match
just "checkpoint".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…n verify-attribution.sh

The `[ $? -eq 0 ] || fail "..."` guard was dead code under `set -euo
pipefail`: if the python3 heredoc exits with code 1, `set -e` terminates
the script immediately before the guard is reached, producing a silent
exit with no diagnostic logged to $LOG.

Replace with `if ! python3 ... <<'PYEOF' ... then fail "..." fi`, which
is exempt from `set -e` and ensures the descriptive failure message is
written to $LOG before exiting.

Resolves Devin review comment BUG_pr-review-job-8b70596b_0002.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The Tier 1 and Tier 2 nightly jobs were calling `git-ai install` to set
up agent hooks, but never creating the `git` → `git-ai` symlink in the
release directory. When test scripts called `git commit`, the system git
ran instead of the git-ai proxy, so the post-commit hook never fired and
no authorship note was written to refs/notes/ai.

Add `ln -sf .../git-ai .../git` in both the Tier 1 and Tier 2 "Install
git-ai hooks in test repo" steps so that all `git` invocations inside
test scripts (which prepend the release dir to PATH) route through
git-ai and trigger the expected hook behaviour.

Resolves Devin review comment BUG_pr-review-job-bf54cac596f44273b5f8565f81a63daf_0001.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The previous Lint (ubuntu-latest) check failed on `go-task/setup-task@v1`
(not on any code change) — the same action passed on the identical commit
via e2e-tests. No code changes; forcing a clean CI run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1. verify-attribution.sh: guard empty-string fuzzy match
   `"" in "claude"` is True in Python, so a missing agent_id.tool would
   always report PASS. Added `if tool and (...)` to require a non-empty
   tool string before the fuzzy match runs.
   Resolves Devin BUG_pr-review-job-032b242ab75044ebac035a42020d7fe3_0001.

2. test-live-agent.sh: add `sudo` to ripgrep fallback install
   `apt-get install` on GitHub Actions ubuntu-latest requires root.
   Without `sudo` the install failed silently (2>/dev/null || true),
   leaving `rg` absent and potentially causing the Gemini CLI to hang.
   Resolves Devin BUG_pr-review-job-6b947f0c5f1e475bb3ffbeba9e6056de_0001.

3. nightly-agent-integration.yml: deduplicate stable/latest matrix entries
   `npm view <pkg> version` and `npm view <pkg> dist-tags.latest` return
   the same value, so stable and latest channels always tested the same
   version, doubling CI cost for zero extra coverage. Now queries
   `dist-tags.next` for the latest channel (pre-release/canary), falling
   back to stable_ver if no `next` tag exists, and skips the latest entry
   entirely when it would duplicate stable.
   Resolves Devin BUG_pr-review-job-6b947f0c5f1e475bb3ffbeba9e6056de_0002.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jwiegley and others added 3 commits March 9, 2026 21:41
The previous fix queried dist-tags.next for latest_ver but still used
@latest in the npm install command, which resolves to the stable release
— identical to the stable channel and defeating the entire purpose of
the latest matrix entry.

Change the npm_pkg construction for the latest channel to use @next so
the pre-release/canary version is actually installed when it exists.

Resolves Devin BUG_pr-review-job-070479ba6d7041699555d4dfa9779fa3_0001.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
npm view <pkg> dist-tags.next exits with code 0 and returns an empty
string (or "undefined") when the tag does not exist in npm 10+, rather
than raising a non-zero exit. This meant CalledProcessError was never
raised, latest_ver was set to "" or "undefined", the dedup check
("" != stable_ver) didn't fire, and a matrix entry was emitted with
npm_pkg="<pkg>@next" — causing npm install to fail with ETARGET.

Add an explicit check after .strip(): if the result is empty or equals
the string "undefined", fall back to stable_ver, triggering the same
deduplication skip as the CalledProcessError path.

Resolves Devin BUG_pr-review-job-874dec7614a64a5e952cf18579ebc182_0001.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- install-scripts-local.yml: replace hardcoded `grep checkpoint claude`
  with a bash case statement matching the Windows switch, so each agent
  matrix entry verifies its own hook config file (claude→settings.json,
  codex→config.toml, gemini→settings.json, opencode→plugin file)

- nightly-agent-integration.yml: pass workflow_dispatch `agents` input
  as AGENTS_FILTER env var and filter the Python matrix builder so that
  specifying e.g. `agents: "claude"` actually limits the matrix instead
  of unconditionally running all four npm agents

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jwiegley jwiegley force-pushed the johnw/nightly-integration branch from 8135e62 to cbdfa6e Compare March 10, 2026 04:45
devin-ai-integration[bot]

This comment was marked as resolved.

The droid entry was appended unconditionally after the filtered npm-agent
loop, so specifying `agents: "claude"` via workflow_dispatch would still
include droid in the matrix. Wrap the append in the same filter check so
droid is only included when the filter is absent, set to "all", or
explicitly contains "droid".

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
devin-ai-integration[bot]

This comment was marked as resolved.

jwiegley and others added 2 commits March 10, 2026 09:24
The bare `.vscode` entry would silently hide any new files added under
.vscode/ from `git status`, requiring `git add -f` to track them, and
misleads contributors into thinking the whole directory should be
untracked. Replace it with `.vscode/*` + `!.vscode/settings.json` so
that the tracked project settings file remains visible while any other
editor-local files are still ignored.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every matrix cell in the E2E install workflow now runs three additional
phases after verifying agent hook configuration:

1. Simulate an AI commit — create a test git repo, wire the git→git-ai
   proxy symlink and post-commit hook (via `git-ai install`), then feed
   synthetic checkpoint data through `git-ai checkpoint agent-v1` and
   commit, exactly as the nightly Tier 1 tests do.

2. Verify attribution tracking — new script
   `scripts/nightly/verify-synthetic-attribution.sh` checks:
   - Authorship note exists on HEAD (post-commit hook fired)
   - Note contains parseable JSON with schema_version = authorship/3.0.0
   - At least one prompt session was recorded (prompt stored)
   - At least one transcript message was captured
   - `git-ai stats HEAD --json` shows ai_additions > 0
   - Test file appears in the note's attestation section
   - `git-ai blame` shows AI attribution markers

3. Upload results artifact for every matrix cell (always).

Windows job mirrors the Unix flow using PowerShell: copies git-ai.exe
as git.exe (proxy without requiring developer mode for symlinks), builds
the checkpoint JSON via ConvertTo-Json, and performs the same 8
attribution checks inline.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 35 additional findings in Devin Review.

Open in Devin Review

Comment on lines +418 to +419
$lines | Set-Content -Path $log
Write-Log "=== Synthetic attribution verification COMPLETE: $agent ==="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Windows verification log file written before final Write-Log call, losing the COMPLETE message

In the Windows "Verify attribution pipeline" step, $lines | Set-Content -Path $log at line 418 writes the log file to disk, but then Write-Log is called at line 419 which appends to $lines (via $lines.Add($msg)) after the file was already written. The "COMPLETE" message is printed to stdout via Write-Host but is missing from the log file that gets uploaded as an artifact. The pattern used for all other failure paths writes $lines | Set-Content then throws, but the happy-path final write was placed before the last log line.

Mismatched write sequence at lines 418-419

Line 418: $lines | Set-Content -Path $log (file written)
Line 419: Write-Log "=== Synthetic attribution verification COMPLETE: $agent ===" (adds to $lines AND Write-Host, but file already on disk)

Suggested change
$lines | Set-Content -Path $log
Write-Log "=== Synthetic attribution verification COMPLETE: $agent ==="
Write-Log "=== Synthetic attribution verification COMPLETE: $agent ==="
$lines | Set-Content -Path $log
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

jwiegley and others added 2 commits March 10, 2026 13:16
The agent-v1 checkpoint format stores an empty messages[] because
conversation transcripts are only captured by live agent hooks, not
synthetic checkpoints. This is expected behaviour — downgrade the
check from a hard failure to a warning, consistent with how
verify-attribution.sh handles the same condition for live agent runs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Same fix as the bash script — synthetic checkpoints don't store
conversation messages, so this should be a warning not a failure.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants