Open
Conversation
Made-with: Cursor
- Add geak_v3_triton agent: full geak-preprocess + geak-orchestrate pipeline with patch application from worktree evaluation - Add 8 Triton eval kernels (L1/L2/L3) with harnesses and configs including compile_command for AKA evaluator compatibility - Add run_geak_triton.sh for dual-stream parallel execution - Add config files for Triton and HIP benchmark runs - Switch geak_v3 HIP agent from 'mini' to 'geak' entrypoint - Fix GPU baseline measurement: set HIP_VISIBLE_DEVICES during compilation and performance measurement - Register geak_v3_triton in module_registration.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- geak_v3_triton now calls `geak --kernel-url --harness` (same entrypoint as HIP/geak_v3) instead of separate preprocessor + orchestrator calls - Both Triton and HIP agents use the unified geak CLI - Update README with instructions for both HIP and Triton runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- geak_v3 (HIP): now calls `geak --kernel-url <kernel> --eval "<commands>"` instead of `geak -t <task_prompt.md>` (Path B) - geak_v3_triton: uses `--eval <harness>` instead of `--harness` - Both agents use the same unified geak CLI with --eval auto-detection (file path → harness mode, shell commands → command mode) - Updated README with instructions for both HIP and Triton runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The geak CLI writes results directly to logs_dir/ (not logs_dir/preprocess/). The launcher was looking in logs_dir/preprocess/ which doesn't exist with the new unified CLI, causing _apply_best_patch to never run and all AKA speedups to show 0.0x despite real GEAK optimizations. Fix: check for final_report.json in logs_dir first, fall back to preprocess subdir for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… score launch_agent.py: read full_benchmark.verified_speedup from round evaluation JSONs instead of benchmark_speedup. The select_agent score can be inflated (e.g. 2.53x) while the actual FULL_BENCHMARK verification shows regression (0.96x). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arch Two fixes for _apply_best_patch(): 1. Worktree search: Use rglob() to find kernel.py recursively under slot_* dirs (files are nested at tasks/triton2triton/geak_eval/.../kernel.py) 2. Patch strip: Try -p1 through -p8 since GEAK patches have nested paths like a/tasks/triton2triton/geak_eval/L2/topk/kernel.py (-p6 needed) Previously all patches failed with "can't find file to patch" because -p1 only stripped the git a/ prefix, leaving the full tasks/... path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The .to('cuda') call on a requires_grad tensor creates a non-leaf tensor,
so .grad is never populated during backward(). Fixed by creating directly
on device='cuda'.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When AKA's performance_command parser fails to extract test cases from harness output, fall back to reading GEAK's final_report.json which contains already-verified baseline_ms, candidate_ms, and verified_speedup. This ensures speedup_ratio in task_result.yaml reflects GEAK's actual verified results instead of always being 0.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reports benchmark_speedup, best_task, best_round, and full round_history from GEAK's final_report.json so AKA captures both the task-local benchmark speedup and the verified FULL_BENCHMARK speedup per round. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pid_grid was a plain function called from @jit kernel — needs @triton.jit. EVEN_M_N heuristic was defined but never used in kernel body, and caused KeyError on newer Triton versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consistent with GEAK's results.py — use whichever measurement is higher to avoid undercounting on noisy tiny-kernel benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consistent with GEAK — FULL_BENCHMARK verified_speedup is the independently reproducible ground truth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 8 new kernel tasks from AIG-Eval (sdubagun/fix-kernel-harness-parity):
- L1: fused_append_shared_experts, mla_decode
- L2: rope
- L3: gemm, gemm_a16w16_atomic, fused_qk_rope_cache_mla,
fused_mxfp4_quant_moe_sort, fused_moe_mxfp4
New tasks use aiter_commit field in config.yaml for reproducible
benchmarks. When aiter_commit is present, AKA evaluator automatically
runs harness commands inside Docker via docker exec, with the correct
aiter version checked out.
Framework changes:
- evaluator_utils.py: add docker_container param to run_command(),
add checkout_aiter() for pinned aiter versions
- evaluator.py: thread docker_container through evaluate functions
- performance.py: thread docker_container + GEAK_RESULT_LATENCY_MS parsing
- main.py: detect aiter_commit, checkout aiter, pass docker_container
Backward compatible: existing 8 kernels run on host unchanged.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add ws_mem* to .gitignore and remove tracked workspace files. Update kernel task config prompts with architecture-specific guidance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add do_task.sh, traj.json, baseline_metrics.json, and profile.json to prevent git diff pollution in GEAK worktrees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HARNESS_SHAPES now equals ALL_SHAPES in 4 harness files (topk, moe_routing, fused_qkv_rope, gemm_a16wfp4). Workers optimize on the same shape set used for verification, eliminating subset-mismatch speedup drift. Iteration counts reduced to keep benchmark runtime comparable. Also includes launch_agent fixes: .gitignore before git init, best-verified-round patch selection, and geak_summary.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensures --benchmark and --full-benchmark run on the identical shape set in every harness file. This eliminates shape-subset mismatch between task-local and verified speedups for all 15 remaining kernels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add batch 1 and batch 2 slot configs for GEAK triton kernel optimization runs. Update README with per-slot kernel lists, launch commands, and monitoring instructions. Whitelist GEAK triton mem configs in .gitignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per INSTRUCTIONS.md, all harnesses must accept --iterations N to override benchmark iteration count. Ten harnesses only read GEAK_BENCHMARK_ITERATIONS from the environment and rejected the CLI flag, causing preprocessing baseline capture to fail and all overall_speedup values to be null during optimization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three changes to align the Triton agent with the HIP agent pattern and make AKA parse GEAK results directly: 1. agents/geak_v3_triton/launch_agent.py: Replace two-step geak-preprocess + geak-orchestrate Python module invocation with a single `geak --kernel-url --test-command` CLI call. GEAK main auto-promotes the test command to harness mode. Keep workspace git init, robust multi-strip patch application, worktree kernel copy with correctness verification, and geak_env injection. 2. src/evaluator.py: Add _read_geak_results() with cascading read: final_report.json verified_speedup -> benchmark_speedup -> round_N_evaluation.json -> best_results.json (homogeneous) -> geak_summary.json. Check GEAK JSON first in evaluate_kernel() before running any commands. Backward-compatible for non-GEAK agents. 3. main.py: Skip AKA baseline measurement for triton_geak tasks since GEAK provides verified baseline/candidate/speedup. 4. agent_config.yaml: Add GEAK_MAX_ROUNDS=3 to geak_env. Made-with: Cursor
When GEAK's full_benchmark verification fails (large config sets timing out), the evaluator falls back to benchmark_speedup. Previously it used the final round's value, which can regress. Now scans all round evaluations and picks the highest benchmark_speedup. Example: mla_decode had round 2 = 1.09x but round 3 = 1.0x. The old code returned 1.0x; the fix returns 1.09x. Made-with: Cursor
When GEAK achieves optimization via config changes (not kernel code patches), it reports results in total_speedup as a string like "2.03x" and/or best_speedup as a float, but round_evaluation.benchmark_speedup may be None. The cascade now parses these fields as a fallback. Example: gemm kernel got 2.03x via config tuning but AKA reported 0.0x because it only checked numeric benchmark_speedup fields. Made-with: Cursor
When --iterations is inside argparse's mutually_exclusive_group with
--benchmark/--full-benchmark/--correctness/--profile, passing
`--benchmark --iterations 30` fails. GEAK's preprocessor does exactly
this during baseline capture, causing benchmark_baseline.txt to never
be written, which in turn prevents FULL_BENCHMARK verification during
round evaluation.
Fix: change `group.add_argument("--iterations", ...)` to
`parser.add_argument("--iterations", ...)` in 6 harnesses, matching
the pattern already used by the 17 working harnesses.
Affected: mla_decode, mla_prefill_reduce, rope, fused_mxfp4_quant_moe_sort,
gemm, nsa_forward.
Made-with: Cursor
…periments) Made-with: Cursor
- README: rewrite with correct unified geak CLI pipeline, defaults table, all 18 kernels with levels and @triton.jit status - config_geak_triton_slot1.yaml: 9 kernels (L1: 5, L2: 2, L3: 2) - config_geak_triton_slot2.yaml: 9 kernels (L1: 2, L2: 2, L3: 5) Defaults: heterogeneous mode (auto-detected for Triton), 5 rounds, working memory ON, model ensemble gpt-5.2 + claude-opus-4.6. Made-with: Cursor
Remove the GEAK JSON-only early-exit in evaluate_kernel() and the baseline skip for triton_geak tasks. For fair cross-agent comparison, ALL agents (Cursor, Claude Code, SWE Agent, GEAK HIP, GEAK Triton) must go through the same evaluation path: 1. AKA measures baseline (before agent) 2. Agent optimizes kernel 3. AKA re-evaluates: compile -> correctness -> performance 4. AKA computes speedup from its own measurements 5. GEAK JSON fallback only when AKA's performance parsing yields 0 This matches the HIP agent (geak_v3) evaluation pattern exactly. The GEAK JSON cascade (_read_geak_results) remains as a fallback at Step 3b when AKA's own performance measurement fails. Made-with: Cursor
Three configs had `open("kernel.py")` with unescaped double quotes
inside a double-quoted YAML string, causing `NameError: name 'kernel'
is not defined` when AKA runs the compile check. One config had
escaped double quotes which worked but was inconsistent.
All 4 now use single quotes: `open('kernel.py')` matching the 19
other kernel configs.
Affected: refk_identity, refk_mla_decode, refk_moe, refk_fp8_blockwise_mm.
Made-with: Cursor
_profile_indices used _n_all which was never defined, causing NameError at import time. This crashed AKA's baseline measurement (performance_command fails), resulting in base=0.0 and speedup=0.0. Add _n_all = len(ALL_SHAPES) or len(ALL_CONFIGS) before the _profile_indices line in moe_routing_sigmoid_top1, topk, fused_qkv_rope. Made-with: Cursor
1. Revert geak_v3 to match geak_benchmark branch exactly. The geak_v3 agent should remain unchanged across branches for fair cross-agent comparison. All HIP + existing triton (vllm, rocmbench) tasks work with the original geak_v3. 2. Make geak_v3_triton work with ALL triton tasks, not just harness-based ones. If harness_path exists in config and the file exists in workspace, use --test-command with the harness. Otherwise fall back to building a compile && correctness && performance command chain from the task config. This enables geak_v3_triton to optimize vllm/rocmbench triton kernels that use task_runner.py instead of a harness. 3. Change task_type from triton_geak to triton2triton in all 23 geak_eval kernel configs. This ensures other prompt-based agents (cursor, claude, single_llm_call) work with these tasks through the standard evaluation flow. Tested: - geak_eval harness task (gemm_a16w16_atomic): harness path resolves correctly, correctness passes, AKA baseline=0.2275ms, speedup=1.003x - vllm command task (triton_apply_grammar_bitmask): command chain builds correctly, compile passes, AKA baseline=0.0309ms, speedup=1.013x - geak_v3 files: identical to geak_benchmark branch (diff = 0) Made-with: Cursor
Simplify agent_config.yaml to sensible defaults that work out of the
box without any env var overrides:
- GEAK_MAX_ROUNDS=3 (heterogeneous multi-round optimization)
- Single model claude-opus-4.6 (no ensemble, fair comparison)
- timeout 3600s (matches HIP geak_v3)
- Remove unused orchestrate/preprocess/configs sections
The reviewer just needs:
agent: { template: geak_v3_triton }
tasks: [ triton2triton/... ]
No GEAK-specific flags needed.
Made-with: Cursor
Made-with: Cursor
Correctness was only testing 25 sampled configs while benchmark tested all configs. This allowed agents to produce kernels that pass correctness on the sample but crash on untested shapes during benchmark (e.g. mla_decode GPU memory fault on full 320 configs). Now correctness uses the same full config set as benchmark/full-benchmark, ensuring any shape-dependent bug is caught before performance measurement. Made-with: Cursor
The last-resort worktree scan was applying the first modified kernel it found (sorted by slot number), regardless of which strategy produced it. This caused regressions when a low-speedup slot's kernel was picked over a high-speedup slot's kernel. Example: gemm had slot_0 (1.0x strategy, regressed to 0.41x) and slot_3 (1.87x strategy). The old code applied slot_0 first. Fix: read best_results.json per strategy to get speedup, map to worktree slots via task logs, and try kernels in descending speedup order. The highest-speedup kernel that passes correctness gets applied. Made-with: Cursor
AKA set HIP_VISIBLE_DEVICES for baseline measurement but unset it before post-agent evaluation. This caused evaluate_kernel to run on all visible GPUs without pinning, leading to GPU contention with leftover GEAK worker processes and severely degraded performance measurements. Example: gemm_a16w16_atomic manual test shows 0.069ms (3.31x speedup) but AKA measured 0.969ms (0.24x) due to multi-GPU contention during the unpinned evaluation. Fix: set HIP_VISIBLE_DEVICES=baseline_gpu before evaluate_kernel() so both baseline and optimized measurements use the same single GPU. Made-with: Cursor
_apply_best_patch preferred verified_speedup over benchmark_speedup when selecting the best round. Due to GEAK's CWD bug (PR #118), verified_speedup is always ~1.0x regardless of actual improvement. This caused round selection to pick by noise (~1.007 vs ~1.003) instead of by actual optimization quality (1.07x vs 1.79x). Fix: use max(verified, benchmark) so the task-local benchmark_speedup (which correctly reflects the optimization) takes priority when verified_speedup is inaccurate. Made-with: Cursor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds triton tasks to Arena.