[Gluon] [Triton] [MI450] [MI350] Enable Unified Attention option for decode by k50112113 · Pull Request #566 · ROCm/ATOM

k50112113 · 2026-04-14T18:22:46Z

This PR enables Unified Attention option for decode (Triton for MI350 and Gluon for MI450)

On MI350, the implementation here is just to verify the results on MI350. The primary purpose of this PR is to enable Gluon Unified Attention for decode on MI450.

Triton/Gluon Unified Attention right now supports shuffling of both Key and Value cache, therefore, I also edited atom/model_engine/model_runner.py to fix the layout for Value cache.

I added an env var ATOM_ENABLE_TRITON_UNIFIED_ATTENTION_DECODE for the user to toggle between Unified Attention or Gluon Paged Attention.

The block size has to be fixed to 64 for BF16 KV cache and 128 for FP8 KV cache, so I also added the switch at atom/model_ops/attentions/aiter_attention.py

This PR depends on ROCm/aiter#2472

Server commend:

model_path="/data/openai/gpt-oss-120b"
export ATOM_ENABLE_TRITON_UNIFIED_ATTENTION_DECODE=1
python -m atom.entrypoints.openai_server \
  --model $model_path --kv_cache_dtype fp8 --block-size 1024 -tp 1 --torch-profiler-dir /app/_test/trace --mark-trace

lm_eval results (TP1):

local-completions ({'model': '/data/openai/gpt-oss-120b', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 64, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.4390|±  |0.0137|
|     |       |strict-match    |     3|exact_match|↑  |0.2259|±  |0.0115|

lm_eval results (TP8)

local-completions ({'model': '/data/openai/gpt-oss-120b', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 64, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.4405|±  |0.0137|
|     |       |strict-match    |     3|exact_match|↑  |0.2153|±  |0.0113|

…for decode

k50112113 added 3 commits April 14, 2026 18:16

force v_cache to be shuffled upon init, add unified attention option …

8e639bf

…for decode

formatting

85531cb

formatting

620e361

k50112113 requested a review from valarLip April 16, 2026 15:49

merge with main

f8c3996

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Gluon] [Triton] [MI450] [MI350] Enable Unified Attention option for decode#566

[Gluon] [Triton] [MI450] [MI350] Enable Unified Attention option for decode#566
k50112113 wants to merge 4 commits intomainfrom
shaoclee/ua3d-gfx12

k50112113 commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

k50112113 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

k50112113 commented Apr 14, 2026 •

edited

Loading