CUDA Plugin EP: Core Implementation by tianleiwu · Pull Request #27816 · microsoft/onnxruntime

tianleiwu · 2026-03-23T20:41:20Z

Description

This PR adds a standalone CUDA Plugin Execution Provider (CudaPluginExecutionProvider) built as a dynamically loadable shared library (libonnxruntime_providers_cuda_plugin.so) on top of the ORT EP Plugin API. The implementation reuses the existing CUDA kernel stack through adapter/shim layers (force-included headers and macro-based registration overrides), eliminating the need to maintain a parallel copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally deferred until the plugin-facing EP API exposes the required session callbacks.

Summary of Changes

Build system and CMake

File	Change
`cmake/CMakeLists.txt`	Adds `onnxruntime_BUILD_CUDA_EP_AS_PLUGIN` build option, records plugin build info, and includes the plugin-specific CMake file.
`cmake/onnxruntime_providers_cuda_plugin.cmake`	New. Defines the plugin shared-library target: collects `.cc`/`.cu` sources from `core/providers/cuda/` and `contrib_ops/cuda/`, applies exclusion filters for incompatible files (tunable, controlflow, registration tables), force-includes adapter headers, and links CUDA/cuDNN/ORT components.
`cmake/onnxruntime_providers_cuda.cmake`	Minor additions to expose include paths needed by plugin builds.
`cmake/onnxruntime_unittests.cmake`	Enables dynamic plugin EP usage in provider tests and fills in missing CUDA include/link settings for the plugin configuration.
`cmake/external/cuda_configuration.cmake`	Adds CUDA configuration support for the plugin build path.

Plugin runtime implementation (new files)

File	Purpose
`plugin/cuda_ep_factory.cc/.h`	Implements `OrtEpFactory` — device enumeration, session-option parsing, allocator registration, kernel registry creation, and all static C-compatible plugin callbacks. Thread-safe lazy kernel registry initialization.
`plugin/cuda_ep.cc/.h`	Plugin-side CUDA EP object deriving from `ep::adapter::Ep`. Carries session-specific `Config` (NHWC preference, TF32, cuDNN algorithm selection, convolution workspace, attention kernels).
`plugin/cuda_allocator_plugin.cc/.h`	Plugin allocators for device and pinned memory, exposed through the EP API.
`plugin/cuda_stream_plugin.cc/.h`	Plugin-owned CUDA stream, cuBLAS, cuBLASLt, and cuDNN handle management. Provides two stream adapter modes (`PluginStreamShim` for `.cc`, `OrtStreamAdapter` for `.cu`/`.cc` contexts).
`plugin/cuda_data_transfer_plugin.cc/.h`	Data transfer bridge for host↔device copies used by plugin-backed tensors and Python bindings.
`plugin/cuda_memcpy_plugin.cc`	MemcpyToHost / MemcpyFromHost kernel implementations for the plugin path.
`plugin/cuda_controlflow_plugin.cc/.cu/.h`	Plugin-native `If`, `Loop`, and `Scan` wrappers that delegate to `OrtEpApi` control-flow hooks instead of inheriting from in-tree CPU base implementations.
`plugin/cuda_plugin_ep.cc`	Exports the DLL entry points (`OrtCreateEpFactory` / `OrtReleaseEpFactory`) used by ORT to create and release the CUDA EP factory.
`plugin/cuda_kernel_adapter.h`	Core shim (1088 lines). Provides `CudaKernel` base class, error-return macros, type helpers (`ToCudaType`), handle-management abstractions, and stream adapters. Force-included in all plugin `.cc` files to transparently adapt existing kernel code.
`plugin/cuda_plugin_kernels.cu/.h`	Aggregates self-registered kernel definitions via `PluginKernelCollector` macro overrides, replacing the centralized registration tables used in the bundled build.
`plugin/cuda_plugin_utils.h`	Shared utility helpers for the plugin (logging, error checking, config parsing).
`plugin/provider_api_shims.cc`	Stub implementations for shared-provider bridge functions that are not needed in the plugin path.
`plugin/cuda_plugin_ep_symbols.def`	Windows symbol export definitions for the plugin DLL.

EP adapter and API extensions

File	Change
`include/onnxruntime/ep/api.h`	Makes plugin API initialization thread-safe; preserves access to ORT, EP, and model editor API tables during plugin loading.
`include/onnxruntime/ep/adapter/node.h`	Adds node metadata accessors (operator domain, optional-output handling) needed by reused CUDA kernels.
`include/onnxruntime/ep/adapter/op_kernel.h`	Adds `RequiredInput`/`RequiredOutput` helpers and adapter fixes so existing CUDA kernels run against plugin adapter contexts.
`include/onnxruntime/ep/adapter/op_kernel_info.h`	Extends adapter kernel-info with attribute and config accessors required by migrated kernels.
`include/onnxruntime/ep/adapter/allocator.h`	Minor allocator adapter adjustments for plugin compatibility.
`include/onnxruntime/ep/adapter/kernel_def_builder.h`	Adds kernel definition builder hooks for plugin registration.
`include/onnxruntime/core/framework/tensor.h`	Restores a plugin-only `Tensor::Create` compatibility path for kernels relying on the older static factory form.
`onnxruntime/core/providers/shared_library/provider_api.h`	Turns the shared-provider bridge into a no-op for plugin builds so the EP adapter facade owns type resolution.

CUDA kernel compatibility migration

Adapts ~80 core CUDA and contrib CUDA kernel source files to compile under the plugin build via macro-based registration overrides and targeted compatibility fixes (not operator rewrites).
Moves or templates reusable helper logic in shared CPU/CUDA headers (ConstantOfShapeBase, PadBase, SliceBase, SplitBase, ScatterND, UpsampleBase, DeformConvAttributes) so kernels compile in adapter mode.
Key contrib kernel adaptations: attention variants (MHA, GQA, paged, sparse, packed), skip-layer-norm, group-norm, MoE, fused-conv, inverse, bias-dropout, matmul-nbits, qordered ops.
Key core kernel adaptations: softmax, topk, conv/conv-transpose, batch-norm, instance-norm, pool, RNN, reduction, einsum, matmul, cumsum, identity, pad, split, scatter-nd, slice, upsample, tile, unsqueeze, gather-nd, concat, dropout, non-max-suppression.

Python integration

File	Change
`onnxruntime/python/onnxruntime_pybind_module.cc`	Extends `get_available_providers()` to surface dynamically registered plugin EPs discovered from `OrtEpDevice` enumeration.
`onnxruntime/python/onnxruntime_pybind_state.cc`	Allows Python session creation to instantiate providers from registered plugin EP devices, including `device_id` selection, instead of only built-in or legacy dynamic-load EP paths.
`onnxruntime/python/onnxruntime_pybind_schema.cc`	Adds schema query support for plugin-registered operators.

Testing and validation

File	Change
`test/python/transformers/test_cuda_plugin_ep.py`	New (1861 lines). Comprehensive test suite covering 5 stages: registration, ONNX ops, NHWC layout preference, contrib ops, and op-level validation.
`test/python/transformers/cuda_plugin_ep_helper.py`	New (192 lines). Utility for transparently routing existing tests to the plugin EP.
`test/python/transformers/test_gqa.py`	Fixes `total_sequence_length` tensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout); routes tests through plugin EP.
`test/python/transformers/test_moe_cuda.py`	Routes through plugin EP when available.
`test/framework/dynamic_plugin_ep_test.cc`	New (120 lines). C++ unit test exercising dynamic plugin EP loading and device enumeration.
`test/unittest_util/base_tester.cc`	Routes CUDA test requests to `CudaPluginExecutionProvider` when registered, allowing existing CUDA provider tests to exercise the plugin path.
`tools/ci_build/cuda_plugin_parity_report.py`	New (737 lines). Comparison script that produces a parity report of ops in bundled-only vs. plugin-only vs. both builds, via static parsing or runtime registry interrogation.

Documentation

File	Change
`docs/cuda_plugin_ep/cuda_plugin_ep_design.md`	New (990 lines). Plugin architecture, build/deployment flow, operator exclusions, adapter design, and the decision to defer CUDA Graph support.
`docs/cuda_plugin_ep/QUICK_START.md`	New (108 lines). Build instructions, C++ and Python usage examples, and known limitations.

Other

File	Change
`tools/python/gen_opkernel_doc.py`	Extended to generate documentation for plugin-registered kernels.
`orttraining/.../reduction_ops.cc`	Minor compatibility fix for training reduction ops under the plugin build configuration.

Testing

Build: Configure with --build_cuda_ep_as_plugin (or onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON); verify libonnxruntime_providers_cuda_plugin.so is produced alongside existing CUDA provider artifacts.
C++ unit tests: Run onnxruntime_provider_test — BaseTester routes CUDA coverage through CudaPluginExecutionProvider. Run the new dynamic_plugin_ep_test for load/enumerate validation.
Python tests: Register the plugin library, confirm onnxruntime.get_available_providers() includes CudaPluginExecutionProvider, and run test_cuda_plugin_ep.py (5-stage suite: registration → ONNX ops → NHWC → contrib ops → op validation).
Parity report: Run tools/ci_build/cuda_plugin_parity_report.py to verify kernel coverage parity between bundled and plugin builds.
Backward compatibility: Verify unchanged behavior for the in-tree CUDA EP build path (onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF).
Known limitation: CUDA graph support remains disabled in the plugin path and is documented as deferred.

Motivation and Context

The CUDA EP is currently compiled into the ORT runtime binary, tightly coupling its release cycle to the core runtime. This PR creates a path to decouple CUDA EP delivery by implementing it as a standalone plugin using the EP Plugin API. The key design tradeoff is reusing the existing ~100+ CUDA kernel implementations through force-include adapter headers and macro-based registration overrides, rather than rewriting them. This approach validates the plugin EP against current CUDA coverage without maintaining a second kernel stack, at the cost of introducing adapter/shim complexity. CUDA Graph support is explicitly deferred until the EP Plugin API can represent the capture/replay lifecycle.

Related: PR #27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is squash-merged into this branch.

Checklist

Tests added/updated
Documentation updated (if applicable)
No breaking changes (or documented in description)
CI passes

Copilot

Pull request overview

This PR introduces a standalone, dynamically loaded CUDA Execution Provider plugin and updates core CUDA kernels/adapters, Python bindings, and unit test routing to support running existing CUDA coverage against the plugin EP.

Changes:

Add CUDA Plugin EP implementation (factory, stream/handles, allocator, data transfer, control flow wrappers, kernel registry creation).
Update Python bindings to surface dynamically registered EPs and create EP instances using session options + logger.
Adjust a large set of CUDA kernel codepaths to be dual-compatible (framework vs plugin) via stream/scratch-buffer access changes and build guards.

Reviewed changes

Copilot reviewed 108 out of 108 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
onnxruntime/test/unittest_util/base_tester.cc	Routes CUDA tests to the dynamic CUDA plugin EP when configured.
onnxruntime/python/onnxruntime_pybind_state.cc	Allows creating EP factories from registered plugin EP devices; passes session options/logger into factory CreateProvider.
onnxruntime/python/onnxruntime_pybind_module.cc	Adds dynamically registered EP names to `get_available_providers()`.
onnxruntime/core/providers/shared_library/provider_api.h	Makes provider_api a no-op for plugin build to avoid SHARED_PROVIDER conflicts.
onnxruntime/core/providers/cuda/tensor/upsample.{h,cc}	Reworks antialias lookup-table handling and stream access patterns.
onnxruntime/core/providers/cuda/tensor/tile.cc	Adds (currently unused) plugin helper for memcpy fast-path detection.
onnxruntime/core/providers/cuda/reduction/reduction_ops.{h,cc}	Refactors ReduceComputeCore to accept raw cudaStream_t + alloc_stream and adds scratch allocation helper.
onnxruntime/core/providers/cuda/plugin/*	New CUDA plugin EP core implementation (stream, factory, allocator, transfer, kernels, control flow).
onnxruntime/core/providers/cuda/plugin/cuda_controlflow_plugin.{cc,cu,h}	Implements If/Loop/Scan wrappers and a custom GPU transpose helper for Scan.
onnxruntime/core/providers/cuda/plugin/cuda_data_transfer_plugin.cc	Implements plugin tensor copies (CPU↔GPU/GPU↔GPU).
cmake/onnxruntime_providers_cuda_plugin.cmake	Adds build target for `onnxruntime_providers_cuda_plugin` and source filtering.
cmake/onnxruntime_providers_cuda.cmake	Excludes plugin directory from regular CUDA EP build.
cmake/CMakeLists.txt	Adds build option to enable building CUDA EP as plugin.
include/onnxruntime/ep/* and include/onnxruntime/core/framework/tensor.h	Extends adapter API surface for plugin compatibility.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 107 out of 107 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ugin

Copilot

Pull request overview

Copilot reviewed 144 out of 144 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…ugin

Copilot

Pull request overview

Copilot reviewed 145 out of 145 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yuslepukhin

There are not blockers, but things to note if you choose to address them.

LOGS_DEFAULT in plugin path — Logging is partially addressed. The comments indicate ORT logger routing, but ClampCudnnBatchNormEpsilon in cudnn_common.h still has a #ifndef BUILD_CUDA_EP_AS_PLUGIN guard that silences a warning. This is a minor observability gap for debugging, not a correctness issue.

cuDNN Frontend print() unavailable in plugin — Conv/ConvTranspose error messages use #ifdef BUILD_CUDA_EP_AS_PLUGIN to omit the cuDNN frontend graph JSON dump. Acceptable — avoids a link dependency — but reduces debuggability for plugin-path cuDNN failures.

tianleiwu added 4 commits March 20, 2026 16:42

Cuda Plug EP Core

83946f3

Add doc

1aab300

ops

f97bbe4

remove cuda graph

267bb06

tianleiwu mentioned this pull request Mar 23, 2026

CUDA Plugin EP: Test Coverage & Bug Fixes #27817

Merged

5 tasks

yuslepukhin requested a review from Copilot March 23, 2026 20:44

Copilot started reviewing on behalf of yuslepukhin March 23, 2026 20:45 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

yuslepukhin reviewed Mar 23, 2026

View reviewed changes

Comment thread docs/cuda_plugin_ep/cuda_ops_for_plugin_ep.md Outdated

yuslepukhin reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.h Outdated

yuslepukhin reviewed Mar 23, 2026

View reviewed changes

Comment thread onnxruntime/core/providers/cuda/cuda_kernel.h

tianleiwu added 2 commits March 23, 2026 16:02

review feedback

e61f27a

refactoring

44cf955

tianleiwu requested a review from Copilot March 23, 2026 23:23

Copilot started reviewing on behalf of tianleiwu March 23, 2026 23:25 View session

Copilot AI reviewed Mar 23, 2026

View reviewed changes

Copilot feedback

eb329a4