Skip to content

Geak triton common benchmark#28

Open
iraj465 wants to merge 79 commits intomainfrom
geak-triton-common-benchmark
Open

Geak triton common benchmark#28
iraj465 wants to merge 79 commits intomainfrom
geak-triton-common-benchmark

Conversation

@iraj465
Copy link
Copy Markdown

@iraj465 iraj465 commented Apr 6, 2026

Adds triton tasks to Arena.

yueliu14 and others added 30 commits March 8, 2026 21:52
- Add geak_v3_triton agent: full geak-preprocess + geak-orchestrate
  pipeline with patch application from worktree evaluation
- Add 8 Triton eval kernels (L1/L2/L3) with harnesses and configs
  including compile_command for AKA evaluator compatibility
- Add run_geak_triton.sh for dual-stream parallel execution
- Add config files for Triton and HIP benchmark runs
- Switch geak_v3 HIP agent from 'mini' to 'geak' entrypoint
- Fix GPU baseline measurement: set HIP_VISIBLE_DEVICES during
  compilation and performance measurement
- Register geak_v3_triton in module_registration.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- geak_v3_triton now calls `geak --kernel-url --harness` (same entrypoint
  as HIP/geak_v3) instead of separate preprocessor + orchestrator calls
- Both Triton and HIP agents use the unified geak CLI
- Update README with instructions for both HIP and Triton runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- geak_v3 (HIP): now calls `geak --kernel-url <kernel> --eval "<commands>"`
  instead of `geak -t <task_prompt.md>` (Path B)
- geak_v3_triton: uses `--eval <harness>` instead of `--harness`
- Both agents use the same unified geak CLI with --eval auto-detection
  (file path → harness mode, shell commands → command mode)
- Updated README with instructions for both HIP and Triton runs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The geak CLI writes results directly to logs_dir/ (not logs_dir/preprocess/).
The launcher was looking in logs_dir/preprocess/ which doesn't exist with the
new unified CLI, causing _apply_best_patch to never run and all AKA speedups
to show 0.0x despite real GEAK optimizations.

Fix: check for final_report.json in logs_dir first, fall back to preprocess
subdir for backward compatibility.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… score

launch_agent.py: read full_benchmark.verified_speedup from round
evaluation JSONs instead of benchmark_speedup. The select_agent
score can be inflated (e.g. 2.53x) while the actual FULL_BENCHMARK
verification shows regression (0.96x).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…arch

Two fixes for _apply_best_patch():
1. Worktree search: Use rglob() to find kernel.py recursively under
   slot_* dirs (files are nested at tasks/triton2triton/geak_eval/.../kernel.py)
2. Patch strip: Try -p1 through -p8 since GEAK patches have nested paths
   like a/tasks/triton2triton/geak_eval/L2/topk/kernel.py (-p6 needed)

Previously all patches failed with "can't find file to patch" because
-p1 only stripped the git a/ prefix, leaving the full tasks/... path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The .to('cuda') call on a requires_grad tensor creates a non-leaf tensor,
so .grad is never populated during backward(). Fixed by creating directly
on device='cuda'.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
When AKA's performance_command parser fails to extract test cases from
harness output, fall back to reading GEAK's final_report.json which
contains already-verified baseline_ms, candidate_ms, and verified_speedup.

This ensures speedup_ratio in task_result.yaml reflects GEAK's actual
verified results instead of always being 0.0.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Reports benchmark_speedup, best_task, best_round, and full round_history
from GEAK's final_report.json so AKA captures both the task-local
benchmark speedup and the verified FULL_BENCHMARK speedup per round.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
pid_grid was a plain function called from @jit kernel — needs @triton.jit.
EVEN_M_N heuristic was defined but never used in kernel body, and caused
KeyError on newer Triton versions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consistent with GEAK's results.py — use whichever measurement is
higher to avoid undercounting on noisy tiny-kernel benchmarks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Consistent with GEAK — FULL_BENCHMARK verified_speedup is the
independently reproducible ground truth.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add 8 new kernel tasks from AIG-Eval (sdubagun/fix-kernel-harness-parity):
- L1: fused_append_shared_experts, mla_decode
- L2: rope
- L3: gemm, gemm_a16w16_atomic, fused_qk_rope_cache_mla,
      fused_mxfp4_quant_moe_sort, fused_moe_mxfp4

New tasks use aiter_commit field in config.yaml for reproducible
benchmarks. When aiter_commit is present, AKA evaluator automatically
runs harness commands inside Docker via docker exec, with the correct
aiter version checked out.

Framework changes:
- evaluator_utils.py: add docker_container param to run_command(),
  add checkout_aiter() for pinned aiter versions
- evaluator.py: thread docker_container through evaluate functions
- performance.py: thread docker_container + GEAK_RESULT_LATENCY_MS parsing
- main.py: detect aiter_commit, checkout aiter, pass docker_container

Backward compatible: existing 8 kernels run on host unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
iraj465 and others added 20 commits March 28, 2026 13:43
Add ws_mem* to .gitignore and remove tracked workspace files.
Update kernel task config prompts with architecture-specific guidance.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add do_task.sh, traj.json, baseline_metrics.json, and profile.json
to prevent git diff pollution in GEAK worktrees.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
HARNESS_SHAPES now equals ALL_SHAPES in 4 harness files (topk,
moe_routing, fused_qkv_rope, gemm_a16wfp4). Workers optimize on the
same shape set used for verification, eliminating subset-mismatch
speedup drift. Iteration counts reduced to keep benchmark runtime
comparable. Also includes launch_agent fixes: .gitignore before git
init, best-verified-round patch selection, and geak_summary.json.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Ensures --benchmark and --full-benchmark run on the identical shape
set in every harness file. This eliminates shape-subset mismatch
between task-local and verified speedups for all 15 remaining kernels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add batch 1 and batch 2 slot configs for GEAK triton kernel
optimization runs. Update README with per-slot kernel lists,
launch commands, and monitoring instructions. Whitelist GEAK
triton mem configs in .gitignore.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Per INSTRUCTIONS.md, all harnesses must accept --iterations N to
override benchmark iteration count.  Ten harnesses only read
GEAK_BENCHMARK_ITERATIONS from the environment and rejected the CLI
flag, causing preprocessing baseline capture to fail and all
overall_speedup values to be null during optimization.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three changes to align the Triton agent with the HIP agent pattern
and make AKA parse GEAK results directly:

1. agents/geak_v3_triton/launch_agent.py: Replace two-step
   geak-preprocess + geak-orchestrate Python module invocation with
   a single `geak --kernel-url --test-command` CLI call. GEAK main
   auto-promotes the test command to harness mode. Keep workspace
   git init, robust multi-strip patch application, worktree kernel
   copy with correctness verification, and geak_env injection.

2. src/evaluator.py: Add _read_geak_results() with cascading read:
   final_report.json verified_speedup -> benchmark_speedup ->
   round_N_evaluation.json -> best_results.json (homogeneous) ->
   geak_summary.json. Check GEAK JSON first in evaluate_kernel()
   before running any commands. Backward-compatible for non-GEAK
   agents.

3. main.py: Skip AKA baseline measurement for triton_geak tasks
   since GEAK provides verified baseline/candidate/speedup.

4. agent_config.yaml: Add GEAK_MAX_ROUNDS=3 to geak_env.

Made-with: Cursor
When GEAK's full_benchmark verification fails (large config sets timing
out), the evaluator falls back to benchmark_speedup. Previously it used
the final round's value, which can regress. Now scans all round
evaluations and picks the highest benchmark_speedup.

Example: mla_decode had round 2 = 1.09x but round 3 = 1.0x. The old
code returned 1.0x; the fix returns 1.09x.

Made-with: Cursor
When GEAK achieves optimization via config changes (not kernel code
patches), it reports results in total_speedup as a string like "2.03x"
and/or best_speedup as a float, but round_evaluation.benchmark_speedup
may be None. The cascade now parses these fields as a fallback.

Example: gemm kernel got 2.03x via config tuning but AKA reported 0.0x
because it only checked numeric benchmark_speedup fields.

Made-with: Cursor
When --iterations is inside argparse's mutually_exclusive_group with
--benchmark/--full-benchmark/--correctness/--profile, passing
`--benchmark --iterations 30` fails. GEAK's preprocessor does exactly
this during baseline capture, causing benchmark_baseline.txt to never
be written, which in turn prevents FULL_BENCHMARK verification during
round evaluation.

Fix: change `group.add_argument("--iterations", ...)` to
`parser.add_argument("--iterations", ...)` in 6 harnesses, matching
the pattern already used by the 17 working harnesses.

Affected: mla_decode, mla_prefill_reduce, rope, fused_mxfp4_quant_moe_sort,
gemm, nsa_forward.

Made-with: Cursor
- README: rewrite with correct unified geak CLI pipeline, defaults
  table, all 18 kernels with levels and @triton.jit status
- config_geak_triton_slot1.yaml: 9 kernels (L1: 5, L2: 2, L3: 2)
- config_geak_triton_slot2.yaml: 9 kernels (L1: 2, L2: 2, L3: 5)

Defaults: heterogeneous mode (auto-detected for Triton), 5 rounds,
working memory ON, model ensemble gpt-5.2 + claude-opus-4.6.

Made-with: Cursor
Remove the GEAK JSON-only early-exit in evaluate_kernel() and the
baseline skip for triton_geak tasks. For fair cross-agent comparison,
ALL agents (Cursor, Claude Code, SWE Agent, GEAK HIP, GEAK Triton)
must go through the same evaluation path:

  1. AKA measures baseline (before agent)
  2. Agent optimizes kernel
  3. AKA re-evaluates: compile -> correctness -> performance
  4. AKA computes speedup from its own measurements
  5. GEAK JSON fallback only when AKA's performance parsing yields 0

This matches the HIP agent (geak_v3) evaluation pattern exactly.
The GEAK JSON cascade (_read_geak_results) remains as a fallback
at Step 3b when AKA's own performance measurement fails.

Made-with: Cursor
@irvineoy irvineoy self-requested a review April 7, 2026 19:18
iraj465 added 9 commits April 7, 2026 15:09
Three configs had `open("kernel.py")` with unescaped double quotes
inside a double-quoted YAML string, causing `NameError: name 'kernel'
is not defined` when AKA runs the compile check. One config had
escaped double quotes which worked but was inconsistent.

All 4 now use single quotes: `open('kernel.py')` matching the 19
other kernel configs.

Affected: refk_identity, refk_mla_decode, refk_moe, refk_fp8_blockwise_mm.
Made-with: Cursor
_profile_indices used _n_all which was never defined, causing
NameError at import time. This crashed AKA's baseline measurement
(performance_command fails), resulting in base=0.0 and speedup=0.0.

Add _n_all = len(ALL_SHAPES) or len(ALL_CONFIGS) before the
_profile_indices line in moe_routing_sigmoid_top1, topk, fused_qkv_rope.

Made-with: Cursor
1. Revert geak_v3 to match geak_benchmark branch exactly. The geak_v3
   agent should remain unchanged across branches for fair cross-agent
   comparison. All HIP + existing triton (vllm, rocmbench) tasks work
   with the original geak_v3.

2. Make geak_v3_triton work with ALL triton tasks, not just harness-based
   ones. If harness_path exists in config and the file exists in workspace,
   use --test-command with the harness. Otherwise fall back to building a
   compile && correctness && performance command chain from the task config.
   This enables geak_v3_triton to optimize vllm/rocmbench triton kernels
   that use task_runner.py instead of a harness.

3. Change task_type from triton_geak to triton2triton in all 23 geak_eval
   kernel configs. This ensures other prompt-based agents (cursor, claude,
   single_llm_call) work with these tasks through the standard evaluation
   flow.

Tested:
- geak_eval harness task (gemm_a16w16_atomic): harness path resolves
  correctly, correctness passes, AKA baseline=0.2275ms, speedup=1.003x
- vllm command task (triton_apply_grammar_bitmask): command chain builds
  correctly, compile passes, AKA baseline=0.0309ms, speedup=1.013x
- geak_v3 files: identical to geak_benchmark branch (diff = 0)

Made-with: Cursor
Simplify agent_config.yaml to sensible defaults that work out of the
box without any env var overrides:

- GEAK_MAX_ROUNDS=3 (heterogeneous multi-round optimization)
- Single model claude-opus-4.6 (no ensemble, fair comparison)
- timeout 3600s (matches HIP geak_v3)
- Remove unused orchestrate/preprocess/configs sections

The reviewer just needs:
  agent: { template: geak_v3_triton }
  tasks: [ triton2triton/... ]
No GEAK-specific flags needed.

Made-with: Cursor
Correctness was only testing 25 sampled configs while benchmark tested
all configs. This allowed agents to produce kernels that pass correctness
on the sample but crash on untested shapes during benchmark (e.g.
mla_decode GPU memory fault on full 320 configs).

Now correctness uses the same full config set as benchmark/full-benchmark,
ensuring any shape-dependent bug is caught before performance measurement.

Made-with: Cursor
The last-resort worktree scan was applying the first modified kernel
it found (sorted by slot number), regardless of which strategy produced
it. This caused regressions when a low-speedup slot's kernel was picked
over a high-speedup slot's kernel.

Example: gemm had slot_0 (1.0x strategy, regressed to 0.41x) and
slot_3 (1.87x strategy). The old code applied slot_0 first.

Fix: read best_results.json per strategy to get speedup, map to
worktree slots via task logs, and try kernels in descending speedup
order. The highest-speedup kernel that passes correctness gets applied.

Made-with: Cursor
AKA set HIP_VISIBLE_DEVICES for baseline measurement but unset it
before post-agent evaluation. This caused evaluate_kernel to run
on all visible GPUs without pinning, leading to GPU contention with
leftover GEAK worker processes and severely degraded performance
measurements.

Example: gemm_a16w16_atomic manual test shows 0.069ms (3.31x speedup)
but AKA measured 0.969ms (0.24x) due to multi-GPU contention during
the unpinned evaluation.

Fix: set HIP_VISIBLE_DEVICES=baseline_gpu before evaluate_kernel()
so both baseline and optimized measurements use the same single GPU.

Made-with: Cursor
_apply_best_patch preferred verified_speedup over benchmark_speedup
when selecting the best round. Due to GEAK's CWD bug (PR #118),
verified_speedup is always ~1.0x regardless of actual improvement.
This caused round selection to pick by noise (~1.007 vs ~1.003)
instead of by actual optimization quality (1.07x vs 1.79x).

Fix: use max(verified, benchmark) so the task-local benchmark_speedup
(which correctly reflects the optimization) takes priority when
verified_speedup is inaccurate.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants