CUDA Plugin EP: Core Implementation#27816
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces a standalone, dynamically loaded CUDA Execution Provider plugin and updates core CUDA kernels/adapters, Python bindings, and unit test routing to support running existing CUDA coverage against the plugin EP.
Changes:
- Add CUDA Plugin EP implementation (factory, stream/handles, allocator, data transfer, control flow wrappers, kernel registry creation).
- Update Python bindings to surface dynamically registered EPs and create EP instances using session options + logger.
- Adjust a large set of CUDA kernel codepaths to be dual-compatible (framework vs plugin) via stream/scratch-buffer access changes and build guards.
Reviewed changes
Copilot reviewed 108 out of 108 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/unittest_util/base_tester.cc | Routes CUDA tests to the dynamic CUDA plugin EP when configured. |
| onnxruntime/python/onnxruntime_pybind_state.cc | Allows creating EP factories from registered plugin EP devices; passes session options/logger into factory CreateProvider. |
| onnxruntime/python/onnxruntime_pybind_module.cc | Adds dynamically registered EP names to get_available_providers(). |
| onnxruntime/core/providers/shared_library/provider_api.h | Makes provider_api a no-op for plugin build to avoid SHARED_PROVIDER conflicts. |
| onnxruntime/core/providers/cuda/tensor/upsample.{h,cc} | Reworks antialias lookup-table handling and stream access patterns. |
| onnxruntime/core/providers/cuda/tensor/tile.cc | Adds (currently unused) plugin helper for memcpy fast-path detection. |
| onnxruntime/core/providers/cuda/reduction/reduction_ops.{h,cc} | Refactors ReduceComputeCore to accept raw cudaStream_t + alloc_stream and adds scratch allocation helper. |
| onnxruntime/core/providers/cuda/plugin/* | New CUDA plugin EP core implementation (stream, factory, allocator, transfer, kernels, control flow). |
| onnxruntime/core/providers/cuda/plugin/cuda_controlflow_plugin.{cc,cu,h} | Implements If/Loop/Scan wrappers and a custom GPU transpose helper for Scan. |
| onnxruntime/core/providers/cuda/plugin/cuda_data_transfer_plugin.cc | Implements plugin tensor copies (CPU↔GPU/GPU↔GPU). |
| cmake/onnxruntime_providers_cuda_plugin.cmake | Adds build target for onnxruntime_providers_cuda_plugin and source filtering. |
| cmake/onnxruntime_providers_cuda.cmake | Excludes plugin directory from regular CUDA EP build. |
| cmake/CMakeLists.txt | Adds build option to enable building CUDA EP as plugin. |
| include/onnxruntime/ep/* and include/onnxruntime/core/framework/tensor.h | Extends adapter API surface for plugin compatibility. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 107 out of 107 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 144 out of 144 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7c1d851 to
8471fec
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 145 out of 145 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
yuslepukhin
left a comment
There was a problem hiding this comment.
There are not blockers, but things to note if you choose to address them.
LOGS_DEFAULT in plugin path — Logging is partially addressed. The comments indicate ORT logger routing, but ClampCudnnBatchNormEpsilon in cudnn_common.h still has a #ifndef BUILD_CUDA_EP_AS_PLUGIN guard that silences a warning. This is a minor observability gap for debugging, not a correctness issue.
cuDNN Frontend print() unavailable in plugin — Conv/ConvTranspose error messages use #ifdef BUILD_CUDA_EP_AS_PLUGIN to omit the cuDNN frontend graph JSON dump. Acceptable — avoids a link dependency — but reduces debuggability for plugin-path cuDNN failures.
Description
This PR adds a standalone CUDA Plugin Execution Provider (
CudaPluginExecutionProvider) built as a dynamically loadable shared library (libonnxruntime_providers_cuda_plugin.so) on top of the ORT EP Plugin API. The implementation reuses the existing CUDA kernel stack through adapter/shim layers (force-included headers and macro-based registration overrides), eliminating the need to maintain a parallel copy of 100+ CUDA kernels. CUDA Graph capture/replay is intentionally deferred until the plugin-facing EP API exposes the required session callbacks.Summary of Changes
Build system and CMake
cmake/CMakeLists.txtonnxruntime_BUILD_CUDA_EP_AS_PLUGINbuild option, records plugin build info, and includes the plugin-specific CMake file.cmake/onnxruntime_providers_cuda_plugin.cmake.cc/.cusources fromcore/providers/cuda/andcontrib_ops/cuda/, applies exclusion filters for incompatible files (tunable, controlflow, registration tables), force-includes adapter headers, and links CUDA/cuDNN/ORT components.cmake/onnxruntime_providers_cuda.cmakecmake/onnxruntime_unittests.cmakecmake/external/cuda_configuration.cmakePlugin runtime implementation (new files)
plugin/cuda_ep_factory.cc/.hOrtEpFactory— device enumeration, session-option parsing, allocator registration, kernel registry creation, and all static C-compatible plugin callbacks. Thread-safe lazy kernel registry initialization.plugin/cuda_ep.cc/.hep::adapter::Ep. Carries session-specificConfig(NHWC preference, TF32, cuDNN algorithm selection, convolution workspace, attention kernels).plugin/cuda_allocator_plugin.cc/.hplugin/cuda_stream_plugin.cc/.hPluginStreamShimfor.cc,OrtStreamAdapterfor.cu/.cccontexts).plugin/cuda_data_transfer_plugin.cc/.hplugin/cuda_memcpy_plugin.ccplugin/cuda_controlflow_plugin.cc/.cu/.hIf,Loop, andScanwrappers that delegate toOrtEpApicontrol-flow hooks instead of inheriting from in-tree CPU base implementations.plugin/cuda_plugin_ep.ccOrtCreateEpFactory/OrtReleaseEpFactory) used by ORT to create and release the CUDA EP factory.plugin/cuda_kernel_adapter.hCudaKernelbase class, error-return macros, type helpers (ToCudaType), handle-management abstractions, and stream adapters. Force-included in all plugin.ccfiles to transparently adapt existing kernel code.plugin/cuda_plugin_kernels.cu/.hPluginKernelCollectormacro overrides, replacing the centralized registration tables used in the bundled build.plugin/cuda_plugin_utils.hplugin/provider_api_shims.ccplugin/cuda_plugin_ep_symbols.defEP adapter and API extensions
include/onnxruntime/ep/api.hinclude/onnxruntime/ep/adapter/node.hinclude/onnxruntime/ep/adapter/op_kernel.hRequiredInput/RequiredOutputhelpers and adapter fixes so existing CUDA kernels run against plugin adapter contexts.include/onnxruntime/ep/adapter/op_kernel_info.hinclude/onnxruntime/ep/adapter/allocator.hinclude/onnxruntime/ep/adapter/kernel_def_builder.hinclude/onnxruntime/core/framework/tensor.hTensor::Createcompatibility path for kernels relying on the older static factory form.onnxruntime/core/providers/shared_library/provider_api.hCUDA kernel compatibility migration
ConstantOfShapeBase,PadBase,SliceBase,SplitBase,ScatterND,UpsampleBase,DeformConvAttributes) so kernels compile in adapter mode.Python integration
onnxruntime/python/onnxruntime_pybind_module.ccget_available_providers()to surface dynamically registered plugin EPs discovered fromOrtEpDeviceenumeration.onnxruntime/python/onnxruntime_pybind_state.ccdevice_idselection, instead of only built-in or legacy dynamic-load EP paths.onnxruntime/python/onnxruntime_pybind_schema.ccTesting and validation
test/python/transformers/test_cuda_plugin_ep.pytest/python/transformers/cuda_plugin_ep_helper.pytest/python/transformers/test_gqa.pytotal_sequence_lengthtensor placement from CUDA to CPU (was causing failures under the plugin EP's stricter memory layout); routes tests through plugin EP.test/python/transformers/test_moe_cuda.pytest/framework/dynamic_plugin_ep_test.cctest/unittest_util/base_tester.ccCudaPluginExecutionProviderwhen registered, allowing existing CUDA provider tests to exercise the plugin path.tools/ci_build/cuda_plugin_parity_report.pyDocumentation
docs/cuda_plugin_ep/cuda_plugin_ep_design.mddocs/cuda_plugin_ep/QUICK_START.mdOther
tools/python/gen_opkernel_doc.pyorttraining/.../reduction_ops.ccTesting
--build_cuda_ep_as_plugin(oronnxruntime_BUILD_CUDA_EP_AS_PLUGIN=ON); verifylibonnxruntime_providers_cuda_plugin.sois produced alongside existing CUDA provider artifacts.onnxruntime_provider_test—BaseTesterroutes CUDA coverage throughCudaPluginExecutionProvider. Run the newdynamic_plugin_ep_testfor load/enumerate validation.onnxruntime.get_available_providers()includesCudaPluginExecutionProvider, and runtest_cuda_plugin_ep.py(5-stage suite: registration → ONNX ops → NHWC → contrib ops → op validation).tools/ci_build/cuda_plugin_parity_report.pyto verify kernel coverage parity between bundled and plugin builds.onnxruntime_BUILD_CUDA_EP_AS_PLUGIN=OFF).Motivation and Context
The CUDA EP is currently compiled into the ORT runtime binary, tightly coupling its release cycle to the core runtime. This PR creates a path to decouple CUDA EP delivery by implementing it as a standalone plugin using the EP Plugin API. The key design tradeoff is reusing the existing ~100+ CUDA kernel implementations through force-include adapter headers and macro-based registration overrides, rather than rewriting them. This approach validates the plugin EP against current CUDA coverage without maintaining a second kernel stack, at the cost of introducing adapter/shim complexity. CUDA Graph support is explicitly deferred until the EP Plugin API can represent the capture/replay lifecycle.
Related: PR #27817 (CUDA Plugin EP: Test Coverage & Bug Fixes) is squash-merged into this branch.
Checklist