docs: CI performance and warm Docker CI research#613
docs: CI performance and warm Docker CI research#613
Conversation
Research into current Depot CI performance (latency breakdown, runner tier comparison, path-gated optimizations) and a proposed warm Docker CI setup on Hetzner (sidecar containers, lockfile-hash caching, Playwright parallelism, Cloud vs Dedicated comparison). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22636557718or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.12.12.dev22636557718MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.12.12.dev22636557718" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 18a7fbd4de
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pytest -vv tests/unit/ & | ||
| (cd packages/buckaroo-js-core && pnpm test) & | ||
| wait |
There was a problem hiding this comment.
Propagate background test failures in trigger script
This script backgrounds both test commands and then calls wait with no job IDs; in Bash, help wait states that when no IDs are given it waits for all children and returns status 0, so a failing pytest or pnpm test can still produce an overall success status. In the CI-trigger context here, that creates false-green runs and can let broken commits appear healthy.
Useful? React with 👍 / 👎.
|
|
||
| # 1. Activate rescue system (~5s API call) | ||
| curl -s -u "$AUTH" "$API/boot/$SERVER_NUM/rescue" \ | ||
| -d "os=linux&authorized_key[]=$SSH_FINGERPRINT" |
There was a problem hiding this comment.
Define SSH key variable before invoking Robot rescue API
The rebuild script uses authorized_key[]=$SSH_FINGERPRINT but never initializes SSH_FINGERPRINT, so running the snippet as written sends an empty key and the later SSH wait loops cannot authenticate to the rescue system. This makes the documented wipe/reprovision flow fail unless callers add hidden external setup.
Useful? React with 👍 / 👎.
- Pin uv/node/pnpm versions (don't track releases, bump when needed) - Bump Node 20 → 22 LTS - Add HETZNER_SERVER_ID/IP to .env.example - Add development verification section (how Claude tests each script locally) - Add monitoring & alerting section (health endpoint, systemd watchdog, disk hygiene, dead man's switch) - Expand testing & ongoing verification (Depot as canary, deprecation criteria) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds ci/hetzner/ with everything needed to run CI on a persistent CCX33: - Dockerfile: Ubuntu 24.04, uv 0.6.6, Python 3.11-3.14, Node 22 LTS, pnpm 9.10.0, all deps pre-installed, Playwright chromium - docker-compose.yml: warm sidecar container (sleep infinity), bind-mounts repo + logs, named volume for Playwright browsers - webhook.py: Flask on :9000, HMAC-SHA256, per-branch cancellation via pkill, /health + /logs/<sha> endpoints, systemd watchdog - run-ci.sh: 5-phase orchestrator (parallel lint+test-js+test-py-3.13 → build-wheel → sequential py 3.11/3.12/3.14 → parallel mcp+smoke → sequential playwright) with lockfile-aware dep skipping - lib/status.sh: GitHub commit status API helpers - lib/lockcheck.sh: SHA256 lockfile comparison, rebuilds deps only on change - cloud-init.yml: one-shot CCX33 provisioning - .env.example: template for required secrets Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add lib/status.sh (GitHub commit status API) and lib/lockcheck.sh (lockfile hash comparison for warm dep skipping). Unblock them from the lib/ gitignore rule which was intended for Python venv dirs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Remove owner:ci:ci from write_files (ci user doesn't exist yet at that stage) - Fix echo runcmd entry with colon causing YAML dict parse error - status.sh: skip GitHub API calls gracefully when GITHUB_TOKEN unset Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…it branch fix - Add build-essential + libffi-dev + libssl-dev so cffi can compile - cloud-init: clone --branch main (not default), add safe.directory Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e unused import - Dockerfile: git config --system safe.directory /repo so git checkout works inside the container (bind-mount owned by ci on host, root in container) - test_playwright_jupyter.sh: add --allow-root so JupyterLab starts as root - webhook.py: remove unused import signal Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… SHA Dockerfile COPYs ci/hetzner/run-ci.sh and lib/ into /opt/ci-runner/. run-ci.sh sources lib from CI_RUNNER_DIR (/opt/ci-runner/) instead of /repo/ci/hetzner/lib/, so they survive `git checkout <sha>` even when the SHA has no ci/hetzner/ directory (e.g. commits on main branch). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
job_lint_python was running uv sync --dev --no-install-project on the 3.13 venv, which strips --all-extras packages (e.g. pl-series-hash) because optional extras require the project to be installed. This ran in parallel with job_test_python_3.13, causing a race condition that randomly removed pl-series-hash from the venv before tests ran. ruff is already installed in the venv from the image build — no sync needed. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JupyterLab refuses to start as root without --allow-root. Rather than patching every test script, bake c.ServerApp.allow_root = True into /root/.jupyter/jupyter_lab_config.py in the image. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- mp_timeout tests: forkserver subprocess spawn takes >1s in Docker (timeout) - test_server_killed_on_parent_death: SIGKILL propagation differs in containers - Python 3.14.0a5: segfaults on pytest startup (CPython pre-release bug) All three disabled with a note to revisit once timing/stability is known. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documents all 9 bugs fixed during bringup, known Docker-incompatible tests (disabled), and final timing: 8m59s wall time, all jobs passing. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Each version has its own venv at /opt/venvs/3.11-3.14 — no shared state, safe to run concurrently. Saves ~70-80s wall time on CCX33. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Run 7 (warm, sequential Phase 3): 8m23s Run 8 (warm, parallel Phase 3): 7m21s — saves 1m07s Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All 5 jobs bind to distinct ports (6006/8701/2718/8765/8889) — no port conflicts. Redirect PLAYWRIGHT_HTML_OUTPUT_DIR per job to avoid playwright-report/ write collisions. Expected saving: ~3m. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Every run should collect mpstat data so we can correlate flakes with CPU contention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Start pw-jupyter alongside other wheel-dependent jobs instead of waiting for pw-server/marimo/wasm-marimo to finish. With early warmup (exp 28) + window.jupyterapp kernel check (exp 21), pw-jupyter should be reliable under CPU contention. Also adds mpstat CPU sampling to every CI run. Expected: 2m25s → ~1m44s if pw-jupyter passes under contention. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mpstat not installed in container, vmstat is. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pw-jupyter passes 7/7 under CPU contention (40-75%) with window.jupyterapp + early warmup. Heavyweight PW gate was unnecessary — total CI drops from 2m25s to 1m43s (-42s). Also fixes CPU monitoring to use vmstat (mpstat not in container). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
With window.jupyterapp kernel check + early warmup + no heavyweight gate, CPU during pw-jupyter is only 6-20%. Increase from P=4 (3 batches: 4+4+1) to P=9 (1 batch: all 9 at once). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Revert PARALLEL=9 → 4 (P=9 too many processes, Exp 31 confirmed) - Move pw-wasm-marimo from Wave 0 to wheel-dependent (needs widget.js) - Only test-python-3.13 in Wave 0 for fast signal - Delay 3.11/3.12/3.14 by 5s after wheel-dependent jobs start to reduce CPU contention during PW job startup Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Exp 31: PARALLEL=9 still too slow (4m+), confirmed P=4 optimal. Exp 32: lean Wave 0 + defer pytest = 1m51s median, +8s vs Exp 30 (1m43s). Exp 30 remains best config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- pw-jupyter starts first (critical path) at PARALLEL=6 - Other PW jobs staggered every 5s: marimo → wasm-marimo → server → pytest - Single JUPYTER_PARALLEL variable controls concurrency - Fine-grain CPU monitoring via /proc/stat at 100ms intervals Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…chdog - Extract warmup_one_kernel to top-level so it's available between batches - After shutdown_kernels_on_port, re-warm next batch's servers via WebSocket nudge (fixes batch 2 hang — kernels stuck in "starting" without nudge) - Add timeout 120 on pw-jupyter to prevent infinite hangs - Add 210s CI watchdog (kill -TERM 0) to cap total CI time - Add Exp 34 (early pnpm install) to future experiments Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert PARALLEL back to 6 and BASE_PORT to 8889. Add pre-run cleanup, between-batch re-warmup, and 120s/210s timeouts as permanent improvements. P=9 failed all 4 attempts (0s/1s/2s stagger, port 8900) due to CPU starvation: 9 servers + 9 kernels + 9 Chromiums = ~27 processes on 16 vCPU. P=6 batched (6+3) passes 9/9 notebooks in 66s. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
Context
Research/brainstorming docs, no code changes. Captures findings for future reference when implementing a faster CI setup.
🤖 Generated with Claude Code