Geak triton common benchmark by iraj465 · Pull Request #28 · AMD-AGI/AgentKernelArena

iraj465 · 2026-04-06T08:23:40Z

Adds triton tasks to Arena.

Made-with: Cursor

- Add geak_v3_triton agent: full geak-preprocess + geak-orchestrate pipeline with patch application from worktree evaluation - Add 8 Triton eval kernels (L1/L2/L3) with harnesses and configs including compile_command for AKA evaluator compatibility - Add run_geak_triton.sh for dual-stream parallel execution - Add config files for Triton and HIP benchmark runs - Switch geak_v3 HIP agent from 'mini' to 'geak' entrypoint - Fix GPU baseline measurement: set HIP_VISIBLE_DEVICES during compilation and performance measurement - Register geak_v3_triton in module_registration.py Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- geak_v3_triton now calls `geak --kernel-url --harness` (same entrypoint as HIP/geak_v3) instead of separate preprocessor + orchestrator calls - Both Triton and HIP agents use the unified geak CLI - Update README with instructions for both HIP and Triton runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

- geak_v3 (HIP): now calls `geak --kernel-url <kernel> --eval "<commands>"` instead of `geak -t <task_prompt.md>` (Path B) - geak_v3_triton: uses `--eval <harness>` instead of `--harness` - Both agents use the same unified geak CLI with --eval auto-detection (file path → harness mode, shell commands → command mode) - Updated README with instructions for both HIP and Triton runs Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The geak CLI writes results directly to logs_dir/ (not logs_dir/preprocess/). The launcher was looking in logs_dir/preprocess/ which doesn't exist with the new unified CLI, causing _apply_best_patch to never run and all AKA speedups to show 0.0x despite real GEAK optimizations. Fix: check for final_report.json in logs_dir first, fall back to preprocess subdir for backward compatibility. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

… score launch_agent.py: read full_benchmark.verified_speedup from round evaluation JSONs instead of benchmark_speedup. The select_agent score can be inflated (e.g. 2.53x) while the actual FULL_BENCHMARK verification shows regression (0.96x). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…arch Two fixes for _apply_best_patch(): 1. Worktree search: Use rglob() to find kernel.py recursively under slot_* dirs (files are nested at tasks/triton2triton/geak_eval/.../kernel.py) 2. Patch strip: Try -p1 through -p8 since GEAK patches have nested paths like a/tasks/triton2triton/geak_eval/L2/topk/kernel.py (-p6 needed) Previously all patches failed with "can't find file to patch" because -p1 only stripped the git a/ prefix, leaving the full tasks/... path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The .to('cuda') call on a requires_grad tensor creates a non-leaf tensor, so .grad is never populated during backward(). Fixed by creating directly on device='cuda'. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

When AKA's performance_command parser fails to extract test cases from harness output, fall back to reading GEAK's final_report.json which contains already-verified baseline_ms, candidate_ms, and verified_speedup. This ensures speedup_ratio in task_result.yaml reflects GEAK's actual verified results instead of always being 0.0. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Reports benchmark_speedup, best_task, best_round, and full round_history from GEAK's final_report.json so AKA captures both the task-local benchmark speedup and the verified FULL_BENCHMARK speedup per round. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

@jit

pid_grid was a plain function called from @jit kernel — needs @triton.jit. EVEN_M_N heuristic was defined but never used in kernel body, and caused KeyError on newer Triton versions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Consistent with GEAK's results.py — use whichever measurement is higher to avoid undercounting on noisy tiny-kernel benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Consistent with GEAK — FULL_BENCHMARK verified_speedup is the independently reproducible ground truth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add 8 new kernel tasks from AIG-Eval (sdubagun/fix-kernel-harness-parity): - L1: fused_append_shared_experts, mla_decode - L2: rope - L3: gemm, gemm_a16w16_atomic, fused_qk_rope_cache_mla, fused_mxfp4_quant_moe_sort, fused_moe_mxfp4 New tasks use aiter_commit field in config.yaml for reproducible benchmarks. When aiter_commit is present, AKA evaluator automatically runs harness commands inside Docker via docker exec, with the correct aiter version checked out. Framework changes: - evaluator_utils.py: add docker_container param to run_command(), add checkout_aiter() for pinned aiter versions - evaluator.py: thread docker_container through evaluate functions - performance.py: thread docker_container + GEAK_RESULT_LATENCY_MS parsing - main.py: detect aiter_commit, checkout aiter, pass docker_container Backward compatible: existing 8 kernels run on host unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add ws_mem* to .gitignore and remove tracked workspace files. Update kernel task config prompts with architecture-specific guidance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add do_task.sh, traj.json, baseline_metrics.json, and profile.json to prevent git diff pollution in GEAK worktrees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

HARNESS_SHAPES now equals ALL_SHAPES in 4 harness files (topk, moe_routing, fused_qkv_rope, gemm_a16wfp4). Workers optimize on the same shape set used for verification, eliminating subset-mismatch speedup drift. Iteration counts reduced to keep benchmark runtime comparable. Also includes launch_agent fixes: .gitignore before git init, best-verified-round patch selection, and geak_summary.json. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Ensures --benchmark and --full-benchmark run on the identical shape set in every harness file. This eliminates shape-subset mismatch between task-local and verified speedups for all 15 remaining kernels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add batch 1 and batch 2 slot configs for GEAK triton kernel optimization runs. Update README with per-slot kernel lists, launch commands, and monitoring instructions. Whitelist GEAK triton mem configs in .gitignore. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Per INSTRUCTIONS.md, all harnesses must accept --iterations N to override benchmark iteration count. Ten harnesses only read GEAK_BENCHMARK_ITERATIONS from the environment and rejected the CLI flag, causing preprocessing baseline capture to fail and all overall_speedup values to be null during optimization. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Three changes to align the Triton agent with the HIP agent pattern and make AKA parse GEAK results directly: 1. agents/geak_v3_triton/launch_agent.py: Replace two-step geak-preprocess + geak-orchestrate Python module invocation with a single `geak --kernel-url --test-command` CLI call. GEAK main auto-promotes the test command to harness mode. Keep workspace git init, robust multi-strip patch application, worktree kernel copy with correctness verification, and geak_env injection. 2. src/evaluator.py: Add _read_geak_results() with cascading read: final_report.json verified_speedup -> benchmark_speedup -> round_N_evaluation.json -> best_results.json (homogeneous) -> geak_summary.json. Check GEAK JSON first in evaluate_kernel() before running any commands. Backward-compatible for non-GEAK agents. 3. main.py: Skip AKA baseline measurement for triton_geak tasks since GEAK provides verified baseline/candidate/speedup. 4. agent_config.yaml: Add GEAK_MAX_ROUNDS=3 to geak_env. Made-with: Cursor

When GEAK's full_benchmark verification fails (large config sets timing out), the evaluator falls back to benchmark_speedup. Previously it used the final round's value, which can regress. Now scans all round evaluations and picks the highest benchmark_speedup. Example: mla_decode had round 2 = 1.09x but round 3 = 1.0x. The old code returned 1.0x; the fix returns 1.09x. Made-with: Cursor

When GEAK achieves optimization via config changes (not kernel code patches), it reports results in total_speedup as a string like "2.03x" and/or best_speedup as a float, but round_evaluation.benchmark_speedup may be None. The cascade now parses these fields as a fallback. Example: gemm kernel got 2.03x via config tuning but AKA reported 0.0x because it only checked numeric benchmark_speedup fields. Made-with: Cursor

When --iterations is inside argparse's mutually_exclusive_group with --benchmark/--full-benchmark/--correctness/--profile, passing `--benchmark --iterations 30` fails. GEAK's preprocessor does exactly this during baseline capture, causing benchmark_baseline.txt to never be written, which in turn prevents FULL_BENCHMARK verification during round evaluation. Fix: change `group.add_argument("--iterations", ...)` to `parser.add_argument("--iterations", ...)` in 6 harnesses, matching the pattern already used by the 17 working harnesses. Affected: mla_decode, mla_prefill_reduce, rope, fused_mxfp4_quant_moe_sort, gemm, nsa_forward. Made-with: Cursor

…periments) Made-with: Cursor

- README: rewrite with correct unified geak CLI pipeline, defaults table, all 18 kernels with levels and @triton.jit status - config_geak_triton_slot1.yaml: 9 kernels (L1: 5, L2: 2, L3: 2) - config_geak_triton_slot2.yaml: 9 kernels (L1: 2, L2: 2, L3: 5) Defaults: heterogeneous mode (auto-detected for Triton), 5 rounds, working memory ON, model ensemble gpt-5.2 + claude-opus-4.6. Made-with: Cursor

Remove the GEAK JSON-only early-exit in evaluate_kernel() and the baseline skip for triton_geak tasks. For fair cross-agent comparison, ALL agents (Cursor, Claude Code, SWE Agent, GEAK HIP, GEAK Triton) must go through the same evaluation path: 1. AKA measures baseline (before agent) 2. Agent optimizes kernel 3. AKA re-evaluates: compile -> correctness -> performance 4. AKA computes speedup from its own measurements 5. GEAK JSON fallback only when AKA's performance parsing yields 0 This matches the HIP agent (geak_v3) evaluation pattern exactly. The GEAK JSON cascade (_read_geak_results) remains as a fallback at Step 3b when AKA's own performance measurement fails. Made-with: Cursor

Three configs had `open("kernel.py")` with unescaped double quotes inside a double-quoted YAML string, causing `NameError: name 'kernel' is not defined` when AKA runs the compile check. One config had escaped double quotes which worked but was inconsistent. All 4 now use single quotes: `open('kernel.py')` matching the 19 other kernel configs. Affected: refk_identity, refk_mla_decode, refk_moe, refk_fp8_blockwise_mm. Made-with: Cursor

_profile_indices used _n_all which was never defined, causing NameError at import time. This crashed AKA's baseline measurement (performance_command fails), resulting in base=0.0 and speedup=0.0. Add _n_all = len(ALL_SHAPES) or len(ALL_CONFIGS) before the _profile_indices line in moe_routing_sigmoid_top1, topk, fused_qkv_rope. Made-with: Cursor

1. Revert geak_v3 to match geak_benchmark branch exactly. The geak_v3 agent should remain unchanged across branches for fair cross-agent comparison. All HIP + existing triton (vllm, rocmbench) tasks work with the original geak_v3. 2. Make geak_v3_triton work with ALL triton tasks, not just harness-based ones. If harness_path exists in config and the file exists in workspace, use --test-command with the harness. Otherwise fall back to building a compile && correctness && performance command chain from the task config. This enables geak_v3_triton to optimize vllm/rocmbench triton kernels that use task_runner.py instead of a harness. 3. Change task_type from triton_geak to triton2triton in all 23 geak_eval kernel configs. This ensures other prompt-based agents (cursor, claude, single_llm_call) work with these tasks through the standard evaluation flow. Tested: - geak_eval harness task (gemm_a16w16_atomic): harness path resolves correctly, correctness passes, AKA baseline=0.2275ms, speedup=1.003x - vllm command task (triton_apply_grammar_bitmask): command chain builds correctly, compile passes, AKA baseline=0.0309ms, speedup=1.013x - geak_v3 files: identical to geak_benchmark branch (diff = 0) Made-with: Cursor

Simplify agent_config.yaml to sensible defaults that work out of the box without any env var overrides: - GEAK_MAX_ROUNDS=3 (heterogeneous multi-round optimization) - Single model claude-opus-4.6 (no ensemble, fair comparison) - timeout 3600s (matches HIP geak_v3) - Remove unused orchestrate/preprocess/configs sections The reviewer just needs: agent: { template: geak_v3_triton } tasks: [ triton2triton/... ] No GEAK-specific flags needed. Made-with: Cursor

Made-with: Cursor

Correctness was only testing 25 sampled configs while benchmark tested all configs. This allowed agents to produce kernels that pass correctness on the sample but crash on untested shapes during benchmark (e.g. mla_decode GPU memory fault on full 320 configs). Now correctness uses the same full config set as benchmark/full-benchmark, ensuring any shape-dependent bug is caught before performance measurement. Made-with: Cursor

The last-resort worktree scan was applying the first modified kernel it found (sorted by slot number), regardless of which strategy produced it. This caused regressions when a low-speedup slot's kernel was picked over a high-speedup slot's kernel. Example: gemm had slot_0 (1.0x strategy, regressed to 0.41x) and slot_3 (1.87x strategy). The old code applied slot_0 first. Fix: read best_results.json per strategy to get speedup, map to worktree slots via task logs, and try kernels in descending speedup order. The highest-speedup kernel that passes correctness gets applied. Made-with: Cursor

AKA set HIP_VISIBLE_DEVICES for baseline measurement but unset it before post-agent evaluation. This caused evaluate_kernel to run on all visible GPUs without pinning, leading to GPU contention with leftover GEAK worker processes and severely degraded performance measurements. Example: gemm_a16w16_atomic manual test shows 0.069ms (3.31x speedup) but AKA measured 0.969ms (0.24x) due to multi-GPU contention during the unpinned evaluation. Fix: set HIP_VISIBLE_DEVICES=baseline_gpu before evaluate_kernel() so both baseline and optimized measurements use the same single GPU. Made-with: Cursor

_apply_best_patch preferred verified_speedup over benchmark_speedup when selecting the best round. Due to GEAK's CWD bug (PR #118), verified_speedup is always ~1.0x regardless of actual improvement. This caused round selection to pick by noise (~1.007 vs ~1.003) instead of by actual optimization quality (1.07x vs 1.79x). Fix: use max(verified, benchmark) so the task-local benchmark_speedup (which correctly reflects the optimization) takes priority when verified_speedup is inaccurate. Made-with: Cursor

yueliu14 and others added 30 commits March 8, 2026 21:52

add geak benchmark and repo tasks

c41742c

modify task_runner

3b49279

change rocprim task_runner.py

ba8cb08

change rocprim task_runner.py

87b2508

modify geak_benchmark parallel

6de72f6

modify geak_benchmark parallel

1db99a0

add task configs

5b25da6

add task configs

76362bb

add readme

df491fa

Update README.md

2ca9a6d

Update README.md

a8674b1

change agent name from geak_benchamrk to geak_v3

b00392e

change name geak_benchmark to geak_v3

0bdb055

Merge main into geak_benchmark: add repo_url support

a1e872a

Made-with: Cursor

Add repo_url support for rocprim tasks

c06d30c

change to geak_v3

7621596

Add README for geak_v3_triton agent

497b027

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use max of benchmark and verified speedup in AKA evaluator

658ec8c

Consistent with GEAK's results.py — use whichever measurement is higher to avoid undercounting on noisy tiny-kernel benchmarks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: use verified_speedup as canonical, revert max() fallback

a549cff

Consistent with GEAK — FULL_BENCHMARK verified_speedup is the independently reproducible ground truth. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

iraj465 and others added 20 commits March 28, 2026 13:43

track kernel.py for patch generation

bcf5e2c

Add kernel.py for patch tracking

14ab798

Track kernel.py for fused_moe_mxfp4 optimization

135bc28

Add kernel.py for optimization

e223c35

Track kernel.py for MLA decode optimization

0101cbd

Add kernel.py baseline for GEMM optimization

d6c495f

Add sitecustomize.py for namespace stub neutralization

e703d46

Stop tracking workspace dirs, update task prompts

286ab33

Add ws_mem* to .gitignore and remove tracked workspace files. Update kernel task config prompts with architecture-specific guidance. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Extend .gitignore to cover all runtime artifacts

4d446c5

Add do_task.sh, traj.json, baseline_metrics.json, and profile.json to prevent git diff pollution in GEAK worktrees. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

config: restore GEAK_MAX_ROUNDS=5 as default (override via env for ex…

39c4656

…periments) Made-with: Cursor

irvineoy self-requested a review April 7, 2026 19:18

iraj465 added 9 commits April 7, 2026 15:09

config: remove GEAK_EXCLUDED_AGENTS, let GEAK handle its own defaults

5a1075b

Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Geak triton common benchmark#28

Geak triton common benchmark#28
iraj465 wants to merge 79 commits intomainfrom
geak-triton-common-benchmark

iraj465 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iraj465 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants