Annotation based partitioning along with resource accounting#27595
Annotation based partitioning along with resource accounting#27595
Conversation
Add session options string and parsing code along with the unit test Introduce layering configuration Refine LayeringRuleMatcher and add tests Add OrtEpDevice matching logic and tests Change the Matcher interface to match one rule against pontentially many devices Add matching again tranditional EPs Create LayeringIndex Add LayeringIndex and tests Adjust config parsing to detect errrors Adjust Create sig Implement WeightsSizeBasedAccountant
Duplicate layering annotations for AddNode in L1 transformers.
since the the Update after layout trnasformation may rely on them. Also [ RUN ] AttentionTest.Attention3DDefault GPU Compute Capability: SM 6.1 (value: 610) Assertion failed: data.IsUnfused(), file D:\dev\ort_trans\onnxruntime\contrib_ops\cuda\bert\attention_prepare_qkv.cu, line 318 This may be related to uninitialized memory.
and add SessionState partitioning test for layered execution. Add layering configuration file for tiny_gpt2_beamsearch and a script to annotate the model by layers.
instance based on a set of nodes. This is used by the graph partitioner to create a filtered graph viewer. Adjust implementation of the Graph_GetViewer.
Add a no-threashold and no-stat option for the accountant.
There was a problem hiding this comment.
Pull request overview
This PR adds layering annotations to ONNX Runtime graphs and uses them to guide graph partitioning across execution providers, alongside enhancements to resource accounting (including an initializer-based fallback when pre-recorded stats aren’t provided).
Changes:
- Introduces node-level layering annotations (loaded from NodeProto metadata
"layer_ann") and aLayeringIndexto map annotations/rules to EP assignments during partitioning. - Extends the graph partitioner APIs to accept an optional
LayeringIndexand filters EP capability queries accordingly, with logic to “unassign” nodes not claimed. - Improves resource accounting to support threshold updates and an initializer-based counting fallback; adds tests and a Python tool for annotating models.
Reviewed changes
Copilot reviewed 33 out of 34 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/test/testdata/layering/tiny_gpt2_beamsearch_layering.txt | Adds test data for layering/annotation scenarios. |
| onnxruntime/test/framework/tensorutils_test.cc | Adds unit tests for extracting layer_ann from NodeProto metadata. |
| onnxruntime/test/framework/session_state_test.cc | Updates partitioning test helper to pass LayeringIndex; adds layering partitioning test. |
| onnxruntime/test/framework/layering_annotations_test.cc | Adds comprehensive unit tests for rule parsing/matching and LayeringIndex behavior. |
| onnxruntime/python/tools/layering/layer_annotate.py | Adds a Python tool to apply layer_ann metadata to ONNX nodes (recurses into subgraphs). |
| onnxruntime/core/session/onnxruntime_c_api.cc | Refactors Graph_GetGraphView subgraph IO detection and node handling. |
| onnxruntime/core/session/inference_session.cc | Builds and passes LayeringIndex from session options; clears annotations post-partitioning to save memory. |
| onnxruntime/core/providers/cuda/cuda_execution_provider.cc | Improves threshold handling/logging for resource-aware CUDA capability selection. |
| onnxruntime/core/optimizer/utils.h | Declares DuplicateNodeAnnotation helper for propagating annotations in transforms/fusions. |
| onnxruntime/core/optimizer/utils.cc | Implements DuplicateNodeAnnotation. |
| onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc | Exposes layering annotation get/set in optimizer API; copies annotation when copying nodes. |
| onnxruntime/core/optimizer/transpose_optimization/optimizer_api.h | Extends NodeRef API with layering annotation get/set. |
| onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc | Propagates annotations to newly created nodes during transpose optimization rewrites. |
| onnxruntime/core/optimizer/reshape_fusion.cc | Copies annotation onto fused reshape node. |
| onnxruntime/core/optimizer/qdq_transformer/where_dummy_dq.cc | Copies annotation to inserted dummy DQ node. |
| onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.cc | Copies annotation to inserted Q/DQ and helper nodes. |
| onnxruntime/core/optimizer/qdq_transformer/qdq_propagation.cc | Copies annotation to inserted Q/DQ nodes. |
| onnxruntime/core/optimizer/qdq_transformer/ensure_unique_dq_for_node_unit.cc | Copies annotation when duplicating DQ nodes. |
| onnxruntime/core/optimizer/matmul_add_fusion.cc | Copies annotation to inserted reshape/gemm fusion nodes. |
| onnxruntime/core/optimizer/embed_layer_norm_fusion.cc | Copies annotation to inserted Cast and EmbedLayerNorm fusion node. |
| onnxruntime/core/graph/graph_utils.h | Adds CreateFilteredIndexedGraph helper for building filtered GraphViewer inputs/outputs. |
| onnxruntime/core/graph/graph_utils.cc | Implements CreateFilteredIndexedGraph. |
| onnxruntime/core/graph/graph.cc | Adds Graph::RemoveAllLayeringAnnotations and loads node annotations from NodeProto metadata. |
| onnxruntime/core/framework/tensorprotoutils.h | Adds kNodeProtoLayerAnnotation constant and annotation extraction helper declaration. |
| onnxruntime/core/framework/tensorprotoutils.cc | Implements GetNodeProtoLayeringAnnotation. |
| onnxruntime/core/framework/resource_accountant.cc | Refactors accountant creation; adds initializer-based fallback resource counting. |
| onnxruntime/core/framework/layering_annotations.h | Adds layering rule parsing/matching and LayeringIndex API. |
| onnxruntime/core/framework/layering_annotations.cc | Implements rule parsing, EP matching heuristics, graph indexing, and update/unassign logic. |
| onnxruntime/core/framework/graph_partitioner.h | Extends GraphPartitioner::Partition signature to accept LayeringIndex*. |
| onnxruntime/core/framework/graph_partitioner.cc | Integrates layering-aware filtering into EP capability queries and assignment reset. |
| include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h | Documents new session config session.layer_assignment_settings and updates resource partitioning docs. |
| include/onnxruntime/core/graph/graph.h | Adds Node layering annotation storage/accessors and Graph::RemoveAllLayeringAnnotations declaration. |
| include/onnxruntime/core/framework/resource_accountant.h | Adds SetThreshold, makes ComputeResourceCount non-const, and moves CreateAccountants to a free function. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 37 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 37 changed files in this pull request and generated 6 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Adjust warning Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Adjust ordering Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 37 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 36 out of 37 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
tianleiwu
left a comment
There was a problem hiding this comment.
Incomplete optimizer coverage: While many optimizers are updated, the codebase has dozens of optimizers under onnxruntime/core/optimizer/. A grep for AddNode or graph.AddNode patterns not covered by this PR would be prudent to ensure no optimizer is creating nodes without propagating annotations. Missing even one optimizer would cause annotation loss for affected nodes, leading to incorrect partitioning in layered mode.
…new AddNode(), fix sesstion state test
This pull request introduces support for node "layering annotations" and improves resource accounting and memory management during graph partitioning in ONNX Runtime. The changes add new mechanisms for annotating nodes, filtering nodes by annotation during partitioning, and efficiently accounting for resources in fused nodes. Several APIs are extended to support these features, and new configuration options are introduced to guide layer assignment.
Layering annotations & partitioning:
layering_annotation_member and associated getter/setter/clear methods to theNodeclass, allowing nodes to be annotated for layer assignment. Also added a method to clear these annotations after partitioning to save memory. (include/onnxruntime/core/graph/graph.h) [1] [2] [3]LayeringIndex, ensuring only nodes matching the current execution provider's assignment are considered during partitioning. (onnxruntime/core/framework/graph_partitioner.cc) [1] [2] [3] [4] [5] [6]kOrtSessionOptionsLayerAssignmentSettingsto configure layer assignment using annotation prefixes per device. (include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h)Resource accounting improvements:
IResourceAccountantinterface to allow resetting and committing pending weights per node, and updated resource accounting logic to correctly sum and commit costs for all constituent nodes in fused nodes, preventing double-counting or undercounting. (include/onnxruntime/core/framework/resource_accountant.h,include/onnxruntime/core/graph/indexed_sub_graph.h,onnxruntime/core/framework/graph_partitioner.cc) [1] [2] [3]API and code organization:
Graphclass and related APIs to propagate layering annotations during function inlining and to provide a method for removing all layering annotations after partitioning. (include/onnxruntime/core/graph/graph.h) [1] [2]CreateAccountantsfunction out of theNodeStatsRecorderclass to the namespace level for clarity. (include/onnxruntime/core/framework/resource_accountant.h)These changes enable more flexible and memory-efficient graph partitioning, particularly for scenarios involving hardware-specific layer assignments and dynamic resource constraints.