Skip to content

[Gluon] [Triton] [MI450] [MI350] Enable Unified Attention option for decode#566

Open
k50112113 wants to merge 4 commits intomainfrom
shaoclee/ua3d-gfx12
Open

[Gluon] [Triton] [MI450] [MI350] Enable Unified Attention option for decode#566
k50112113 wants to merge 4 commits intomainfrom
shaoclee/ua3d-gfx12

Conversation

@k50112113
Copy link
Copy Markdown
Contributor

@k50112113 k50112113 commented Apr 14, 2026

This PR enables Unified Attention option for decode (Triton for MI350 and Gluon for MI450)

On MI350, the implementation here is just to verify the results on MI350. The primary purpose of this PR is to enable Gluon Unified Attention for decode on MI450.

Triton/Gluon Unified Attention right now supports shuffling of both Key and Value cache, therefore, I also edited atom/model_engine/model_runner.py to fix the layout for Value cache.

I added an env var ATOM_ENABLE_TRITON_UNIFIED_ATTENTION_DECODE for the user to toggle between Unified Attention or Gluon Paged Attention.

The block size has to be fixed to 64 for BF16 KV cache and 128 for FP8 KV cache, so I also added the switch at atom/model_ops/attentions/aiter_attention.py

This PR depends on ROCm/aiter#2472

Server commend:

model_path="/data/openai/gpt-oss-120b"
export ATOM_ENABLE_TRITON_UNIFIED_ATTENTION_DECODE=1
python -m atom.entrypoints.openai_server \
  --model $model_path --kv_cache_dtype fp8 --block-size 1024 -tp 1 --torch-profiler-dir /app/_test/trace --mark-trace

lm_eval results (TP1):

local-completions ({'model': '/data/openai/gpt-oss-120b', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 64, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.4390|±  |0.0137|
|     |       |strict-match    |     3|exact_match|↑  |0.2259|±  |0.0115|

lm_eval results (TP8)

local-completions ({'model': '/data/openai/gpt-oss-120b', 'base_url': 'http://localhost:8000/v1/completions', 'num_concurrent': 64, 'max_retries': 1, 'tokenized_requests': False}), gen_kwargs: ({}), limit: None, num_fewshot: 3, batch_size: 1
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     3|exact_match|↑  |0.4405|±  |0.0137|
|     |       |strict-match    |     3|exact_match|↑  |0.2153|±  |0.0113|

@k50112113 k50112113 requested a review from valarLip April 16, 2026 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant