Skip to content

CUDA Plugin EP: Core Implementation#27816

Merged
tianleiwu merged 52 commits intomainfrom
tlwu/20260320/cuda_plugin
Mar 31, 2026
Merged

CUDA Plugin EP: Core Implementation#27816
tianleiwu merged 52 commits intomainfrom
tlwu/20260320/cuda_plugin

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu commented Mar 23, 2026

Description

This PR adds a standalone CUDA Plugin Execution Provider (CudaPluginExecutionProvider) built as a dynamically loadable shared library (libonnxruntime_providers_cuda_plugin.so) on top of the ORT EP Plugin API. The implementation reuses the existing CUDA kernel stack through adapter/shim layers (force-included headers and macro-based registration overrides), eliminating the need to maintain a parallel copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally deferred until the plugin-facing EP API exposes the required session callbacks.

Summary of Changes

Build system and CMake

File Change
cmake/CMakeLists.txt Adds onnxruntime_BUILD_CUDA_EP_AS_PLUGIN build option, records plugin build info, and includes the plugin-specific CMake file.
cmake/onnxruntime_providers_cuda_plugin.cmake New. Defines the plugin shared-library target: collects .cc/.cu sources from core/providers/cuda/ and contrib_ops/cuda/, applies exclusion filters for incompatible files (tunable, controlflow, registration tables), force-includes adapter headers, and links CUDA/cuDNN/ORT components.
cmake/onnxruntime_providers_cuda.cmake Minor additions to expose include paths needed by plugin builds.
cmake/onnxruntime_unittests.cmake Enables dynamic plugin EP usage in provider tests and fills in missing CUDA include/link settings for the plugin configuration.
cmake/external/cuda_configuration.cmake Adds CUDA configuration support for the plugin build path.

Plugin runtime implementation (new files)

File Purpose
plugin/cuda_ep_factory.cc/.h Implements OrtEpFactory — device enumeration, session-option parsing, allocator registration, kernel registry creation, and all static C-compatible plugin callbacks. Thread-safe lazy kernel registry initialization.
plugin/cuda_ep.cc/.h Plugin-side CUDA EP object deriving from ep::adapter::Ep. Carries session-specific Config (NHWC preference, TF32, cuDNN algorithm selection, convolution workspace, attention kernels).
plugin/cuda_allocator_plugin.cc/.h Plugin allocators for device and pinned memory, exposed through the EP API.
plugin/cuda_stream_plugin.cc/.h Plugin-owned CUDA stream, cuBLAS, cuBLASLt, and cuDNN handle management. Provides two stream adapter modes (PluginStreamShim for .cc, OrtStreamAdapter for .cu/.cc contexts).
plugin/cuda_data_transfer_plugin.cc/.h Data transfer bridge for host↔device copies used by plugin-backed tensors and Python bindings.
plugin/cuda_memcpy_plugin.cc MemcpyToHost / MemcpyFromHost kernel implementations for the plugin path.
plugin/cuda_controlflow_plugin.cc/.cu/.h Plugin-native If, Loop, and Scan wrappers that delegate to OrtEpApi control-flow hooks instead of inheriting from in-tree CPU base implementations.
plugin/cuda_plugin_ep.cc Exports the DLL entry points (OrtCreateEpFactory / OrtReleaseEpFactory) used by ORT to create and release the CUDA EP factory.
plugin/cuda_kernel_adapter.h Core shim (1088 lines). Provides CudaKernel base class, error-return macros, type helpers (ToCudaType), handle-management abstractions, and stream adapters. Force-included in all plugin .cc files to transparently adapt existing kernel code.
plugin/cuda_plugin_kernels.cu/.h Aggregates self-registered kernel definitions via PluginKernelCollector macro overrides, replacing the centralized registration tables used in the bundled build.
plugin/cuda_plugin_utils.h Shared utility helpers for the plugin (logging, error checking, config parsing).
plugin/provider_api_shims.cc Stub implementations for shared-provider bridge functions that are not needed in the plugin path.
plugin/cuda_plugin_ep_symbols.def Windows symbol export definitions for the plugin DLL.

EP adapter and API extensions

File Change
include/onnxruntime/ep/api.h Makes plugin API initialization thread-safe; preserves access to ORT, EP, and model editor API tables during plugin loading.
include/onnxruntime/ep/adapter/node.h Adds node metadata accessors (operator domain, optional-output handling) needed by reused CUDA kernels.
include/onnxruntime/ep/adapter/op_kernel.h Adds RequiredInput/RequiredOutput helpers and adapter fixes so existing CUDA kernels run against plugin adapter contexts.
include/onnxruntime/ep/adapter/op_kernel_info.h Extends adapter kernel-info with attribute and config accessors required by migrated kernels.
include/onnxruntime/ep/adapter/allocator.h Minor allocator adapter adjustments for plugin compatibility.
include/onnxruntime/ep/adapter/kernel_def_builder.h Adds kernel definition builder hooks for plugin registration.
include/onnxruntime/core/framework/tensor.h Restores a plugin-only Tensor::Create compatibility path for kernels relying on the older static factory form.
onnxruntime/core/providers/shared_library/provider_api.h Turns the shared-provider bridge into a no-op for plugin builds so the EP adapter facade owns type resolution.

CUDA kernel compatibility migration

  • Adapts ~80 core CUDA and contrib CUDA kernel source files to compile under the plugin build via macro-based registration overrides and targeted compatibility fixes (not operator rewrites).
  • Moves or templates reusable helper logic in shared CPU/CUDA headers (ConstantOfShapeBase, PadBase, SliceBase, SplitBase, ScatterND, UpsampleBase, DeformConvAttributes) so kernels compile in adapter mode.
  • Key contrib kernel adaptations: attention variants (MHA, GQA, paged, sparse, packed), skip-layer-norm, group-norm, MoE, fused-conv, inverse, bias-dropout, matmul-nbits, qordered ops.
  • Key core kernel adaptations: softmax, topk, conv/conv-transpose, batch-norm, instance-norm, pool, RNN, reduction, einsum, matmul, cumsum, identity, pad, split, scatter-nd, slice, upsample, tile, unsqueeze, gather-nd, concat, dropout, non-max-suppression.

Python integration

File Change
onnxruntime/python/onnxruntime_pybind_module.cc Extends get_available_providers() to surface dynamically registered plugin EPs discovered from OrtEpDevice enumeration.
onnxruntime/python/onnxruntime_pybind_state.cc Allows Python session creation to instantiate providers from registered plugin EP devices, including device_id selection, instead of only built-in or legacy dynamic-load EP paths.
onnxruntime/python/onnxruntime_pybind_schema.cc Adds schema query support for plugin-registered operators.

Testing and validation

File Change
test/python/transformers/test_cuda_plugin_ep.py New (1861 lines). Comprehensive test suite covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validation.
test/python/transformers/cuda_plugin_ep_helper.py New (192 lines). Utility for transparently routing existing tests to the plugin EP.
test/python/transformers/test_gqa.py Fixes total_sequence_length tensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout); routes tests through plugin EP.
test/python/transformers/test_moe_cuda.py Routes through plugin EP when available.
test/framework/dynamic_plugin_ep_test.cc New (120 lines). C++ unit test exercising dynamic plugin EP loading and device enumeration.
test/unittest_util/base_tester.cc Routes CUDA test requests to CudaPluginExecutionProvider when registered, allowing existing CUDA provider tests to exercise the plugin path.
tools/ci_build/cuda_plugin_parity_report.py New (737 lines). Comparison script that produces a parity report of ops in bundled-only vs. plugin-only vs. both builds, via static parsing or runtime registry interrogation.

Documentation

File Change
docs/cuda_plugin_ep/cuda_plugin_ep_design.md New (990 lines). Plugin architecture, build/deployment flow, operator exclusions, adapter design, and the decision to defer CUDA Graph support.
docs/cuda_plugin_ep/QUICK_START.md New (108 lines). Build instructions, C++ and Python usage examples, and known limitations.

Other

File Change
tools/python/gen_opkernel_doc.py Extended to generate documentation for plugin-registered kernels.
orttraining/.../reduction_ops.cc Minor compatibility fix for training reduction ops under the plugin build configuration.

Testing

  • Build: Configure with --build_cuda_ep_as_plugin (or onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON); verify libonnxruntime_providers_cuda_plugin.so is produced alongside existing CUDA provider artifacts.
  • C++ unit tests: Run onnxruntime_provider_testBaseTester routes CUDA coverage through CudaPluginExecutionProvider. Run the new dynamic_plugin_ep_test for load/enumerate validation.
  • Python tests: Register the plugin library, confirm onnxruntime.get_available_providers() includes CudaPluginExecutionProvider, and run test_cuda_plugin_ep.py (5-stage suite: registration → ONNX ops → NHWC → contrib ops → op validation).
  • Parity report: Run tools/ci_build/cuda_plugin_parity_report.py to verify kernel coverage parity between bundled and plugin builds.
  • Backward compatibility: Verify unchanged behavior for the in-tree CUDA EP build path (onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF).
  • Known limitation: CUDA graph support remains disabled in the plugin path and is documented as deferred.

Motivation and Context

The CUDA EP is currently compiled into the ORT runtime binary, tightly coupling its release cycle to the core runtime. This PR creates a path to decouple CUDA EP delivery by implementing it as a standalone plugin using the EP Plugin API. The key design tradeoff is reusing the existing ~100+ CUDA kernel implementations through force-include adapter headers and macro-based registration overrides, rather than rewriting them. This approach validates the plugin EP against current CUDA coverage without maintaining a second kernel stack, at the cost of introducing adapter/shim complexity. CUDA Graph support is explicitly deferred until the EP Plugin API can represent the capture/replay lifecycle.

Related: PR #27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is squash-merged into this branch.

Checklist

  • Tests added/updated
  • Documentation updated (if applicable)
  • No breaking changes (or documented in description)
  • CI passes

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a standalone, dynamically loaded CUDA Execution Provider plugin and updates core CUDA kernels/adapters, Python bindings, and unit test routing to support running existing CUDA coverage against the plugin EP.

Changes:

  • Add CUDA Plugin EP implementation (factory, stream/handles, allocator, data transfer, control flow wrappers, kernel registry creation).
  • Update Python bindings to surface dynamically registered EPs and create EP instances using session options + logger.
  • Adjust a large set of CUDA kernel codepaths to be dual-compatible (framework vs plugin) via stream/scratch-buffer access changes and build guards.

Reviewed changes

Copilot reviewed 108 out of 108 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
onnxruntime/test/unittest_util/base_tester.cc Routes CUDA tests to the dynamic CUDA plugin EP when configured.
onnxruntime/python/onnxruntime_pybind_state.cc Allows creating EP factories from registered plugin EP devices; passes session options/logger into factory CreateProvider.
onnxruntime/python/onnxruntime_pybind_module.cc Adds dynamically registered EP names to get_available_providers().
onnxruntime/core/providers/shared_library/provider_api.h Makes provider_api a no-op for plugin build to avoid SHARED_PROVIDER conflicts.
onnxruntime/core/providers/cuda/tensor/upsample.{h,cc} Reworks antialias lookup-table handling and stream access patterns.
onnxruntime/core/providers/cuda/tensor/tile.cc Adds (currently unused) plugin helper for memcpy fast-path detection.
onnxruntime/core/providers/cuda/reduction/reduction_ops.{h,cc} Refactors ReduceComputeCore to accept raw cudaStream_t + alloc_stream and adds scratch allocation helper.
onnxruntime/core/providers/cuda/plugin/* New CUDA plugin EP core implementation (stream, factory, allocator, transfer, kernels, control flow).
onnxruntime/core/providers/cuda/plugin/cuda_controlflow_plugin.{cc,cu,h} Implements If/Loop/Scan wrappers and a custom GPU transpose helper for Scan.
onnxruntime/core/providers/cuda/plugin/cuda_data_transfer_plugin.cc Implements plugin tensor copies (CPU↔GPU/GPU↔GPU).
cmake/onnxruntime_providers_cuda_plugin.cmake Adds build target for onnxruntime_providers_cuda_plugin and source filtering.
cmake/onnxruntime_providers_cuda.cmake Excludes plugin directory from regular CUDA EP build.
cmake/CMakeLists.txt Adds build option to enable building CUDA EP as plugin.
include/onnxruntime/ep/* and include/onnxruntime/core/framework/tensor.h Extends adapter API surface for plugin compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_controlflow_plugin.cu
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_controlflow_plugin.cu Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_data_transfer_plugin.cc Outdated
Comment thread onnxruntime/core/providers/cuda/tensor/upsample.cc
Comment thread onnxruntime/core/providers/cuda/tensor/tile.cc
Comment thread onnxruntime/core/providers/cuda/reduction/reduction_ops.cc
Comment thread onnxruntime/core/providers/cuda/reduction/reduction_ops.cc Outdated
Comment thread docs/cuda_plugin_ep/cuda_ops_for_plugin_ep.md Outdated
Comment thread onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.h Outdated
Comment thread onnxruntime/core/providers/cuda/cuda_kernel.h
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 107 out of 107 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_controlflow_plugin.cu Outdated
Comment thread onnxruntime/python/onnxruntime_pybind_module.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc
Comment thread onnxruntime/core/providers/cuda/nn/conv_transpose.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_controlflow_plugin.cu Outdated
Comment thread cmake/onnxruntime_providers_cuda_plugin.cmake Outdated
Comment thread cmake/onnxruntime_providers_cuda_plugin.cmake
Comment thread onnxruntime/contrib_ops/cuda/sparse/sparse_attention.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc Outdated
Comment thread tools/ci_build/cuda_plugin_parity_report.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Comment thread onnxruntime/test/python/transformers/test_cuda_plugin_ep.py Fixed
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 144 out of 144 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 145 out of 145 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/onnxruntime/ep/adapter/op_kernel.h
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_controlflow_plugin.cu
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep.cc
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_stream_plugin.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc Outdated
Comment thread onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h Outdated
@tianleiwu tianleiwu requested a review from yuslepukhin March 30, 2026 23:50
@tianleiwu tianleiwu enabled auto-merge (squash) March 31, 2026 18:44
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are not blockers, but things to note if you choose to address them.

LOGS_DEFAULT in plugin path — Logging is partially addressed. The comments indicate ORT logger routing, but ClampCudnnBatchNormEpsilon in cudnn_common.h still has a #ifndef BUILD_CUDA_EP_AS_PLUGIN guard that silences a warning. This is a minor observability gap for debugging, not a correctness issue.

cuDNN Frontend print() unavailable in plugin — Conv/ConvTranspose error messages use #ifdef BUILD_CUDA_EP_AS_PLUGIN to omit the cuDNN frontend graph JSON dump. Acceptable — avoids a link dependency — but reduces debuggability for plugin-path cuDNN failures.

@tianleiwu tianleiwu merged commit 879f659 into main Mar 31, 2026
101 of 104 checks passed
@tianleiwu tianleiwu deleted the tlwu/20260320/cuda_plugin branch March 31, 2026 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants