Skip to content

[SDPA][HipDNN] ASM kernel loading and dispatch#5686

Merged
AnaghaRaoAMD merged 5 commits intousers/dahawkin/hipdnn-aiter-ck-spda-pocfrom
user/anarao/kernelLoadDispatch
Mar 23, 2026
Merged

[SDPA][HipDNN] ASM kernel loading and dispatch#5686
AnaghaRaoAMD merged 5 commits intousers/dahawkin/hipdnn-aiter-ck-spda-pocfrom
user/anarao/kernelLoadDispatch

Conversation

@AnaghaRaoAMD
Copy link
Copy Markdown
Contributor

@AnaghaRaoAMD AnaghaRaoAMD commented Mar 20, 2026

Motivation

Implements ASM kernel loading and dispatch

Technical Details

Kernel Execution Implementation

Implements the complete kernel loading and dispatch pipeline for fwd_hd128_bf16_rtne.co ASM kernel:

  1. SdpaKernelPlan - Kernel execution state (26 member variables):

    • Stores kernel module/function handles
    • Tensor UIDs and metadata (dims, strides)
    • Attention scale
    • execute(): Populates fmha_fwd_v3_args (656 bytes) and launches via hipModuleLaunchKernel()
  2. SdpaKernelPlanBuilder - Plan creation:

    • buildPlan(): Loads kernel via hipModuleLoad() and extracts tensor metadata from graph
    • Parses Q/K/V tensor dimensions and strides from SDPA graph node
    • Computes attention scale (default: 1/√D_qk)
  3. Key Implementation Details:

    • Kernel launch: HIP_LAUNCH_PARAM mechanism for large arg structures
    • Grid dimensions: [ceil(S_q/256), H_q, B]
    • Block dimensions: [512, 1, 1] (fixed for this kernel)
    • Strides: Converted from elements to bytes (stride × 2 for BF16)
    • Module lifecycle: Loaded on plan build, unloaded in destructor

Modified Files:

  • src/SdpaKernelPlan.{hpp,cpp} - Execution implementation
  • src/SdpaKernelPlanBuilder.cpp - Plan building with graph parsing

Test Plan

Verified with existing unit test infrastructure and integration tests

ninja install
 ./bin/sdpa_kernel_plugin_integration_tests
[==========] Running 2 tests from 2 test suites.
[----------] Global test environment set-up.
[----------] 1 test from IntegrationSdpaKernelNoEngines
[ RUN      ] IntegrationSdpaKernelNoEngines.BatchnormInferenceGraphBuildFails
[       OK ] IntegrationSdpaKernelNoEngines.BatchnormInferenceGraphBuildFails (1476 ms)
[----------] 1 test from IntegrationSdpaKernelNoEngines (1476 ms total)

[----------] 1 test from Smoke/IntegrationGpuSdpaFwdBf16
[ RUN      ] Smoke/IntegrationGpuSdpaFwdBf16.Correctness/0
[       OK ] Smoke/IntegrationGpuSdpaFwdBf16.Correctness/0 (1806 ms)
[----------] 1 test from Smoke/IntegrationGpuSdpaFwdBf16 (1806 ms total)

[----------] Global test environment tear-down
[==========] 2 tests from 2 test suites ran. (3283 ms total)
[  PASSED  ] 2 tests.
image

Submission Checklist

@AnaghaRaoAMD AnaghaRaoAMD changed the title User/anarao/kernel load dispatch [SDPA][HipDNN] ASM kernel loading and dispatch Mar 20, 2026
@AnaghaRaoAMD AnaghaRaoAMD marked this pull request as ready for review March 20, 2026 21:33
Copy link
Copy Markdown
Contributor

@DarylHawkinsAMD DarylHawkinsAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just had one question inline

Comment thread dnn-providers/sdpa-kernel-provider/src/SdpaKernelPlan.cpp
…pda-poc' into user/anarao/kernelLoadDispatch
Copy link
Copy Markdown
Contributor

@jerehartAMD jerehartAMD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I have a couple of comments, but I think the only one that's especially important to address is the move constructor one

Comment thread dnn-providers/sdpa-kernel-provider/src/SdpaKernelPlan.cpp
Comment thread dnn-providers/sdpa-kernel-provider/src/SdpaKernelPlan.cpp Outdated
Comment thread dnn-providers/sdpa-kernel-provider/src/SdpaKernelPlan.cpp Outdated
@AnaghaRaoAMD AnaghaRaoAMD merged commit 1f85499 into users/dahawkin/hipdnn-aiter-ck-spda-poc Mar 23, 2026
6 checks passed
@AnaghaRaoAMD AnaghaRaoAMD deleted the user/anarao/kernelLoadDispatch branch March 23, 2026 17:32
DarylHawkinsAMD pushed a commit that referenced this pull request Mar 27, 2026
## Motivation

Implements the complete kernel loading and dispatch pipeline for the
fwd_hd128_bf16_rtne.co ASM kernel, including workspace size calculation.

## Technical Details

**Kernel Execution Implementation**

1. **SdpaFwdPlan** - Kernel execution state:
   - Stores kernel module/function handles, tensor UIDs and metadata
     (dims, strides), and attention scale
   - execute(): Populates fmha_fwd_v3_args (656 bytes) and launches via
     hipModuleLaunchKernel()
   - Grid dimensions: [ceil(S_q/256), H_q, B]
   - Block dimensions: [512, 1, 1] (fixed for this kernel)
   - Strides: Converted from elements to bytes (stride x 2 for BF16)
   - Module lifecycle: Loaded on plan build, unloaded in destructor

2. **SdpaFwdPlanBuilder** - Plan creation:
   - buildPlan(): Loads kernel via hipModuleLoad() and extracts tensor
     metadata from graph
   - Parses Q/K/V tensor dimensions and strides from SDPA graph node
   - Computes attention scale (default: 1/sqrt(D_qk))

3. **Workspace**: Forward-only inference kernel uses 64KB LDS internally
   and requires no external workspace allocation; getWorkspaceSize()
   returns 0. LSE (log-sum-exp) buffer is an optional output tensor
   (stats_tensor_uid), not workspace.

Builds on workspace sizing from PR #5626 and cleanup from PR #5632.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants