Skip to content

Annotation based partitioning along with resource accounting#27595

Merged
tianleiwu merged 63 commits intomainfrom
yuslepukhin/layering
Mar 30, 2026
Merged

Annotation based partitioning along with resource accounting#27595
tianleiwu merged 63 commits intomainfrom
yuslepukhin/layering

Conversation

@yuslepukhin
Copy link
Copy Markdown
Member

@yuslepukhin yuslepukhin commented Mar 9, 2026

This pull request introduces support for node "layering annotations" and improves resource accounting and memory management during graph partitioning in ONNX Runtime. The changes add new mechanisms for annotating nodes, filtering nodes by annotation during partitioning, and efficiently accounting for resources in fused nodes. Several APIs are extended to support these features, and new configuration options are introduced to guide layer assignment.

Layering annotations & partitioning:

  • Added layering_annotation_ member and associated getter/setter/clear methods to the Node class, allowing nodes to be annotated for layer assignment. Also added a method to clear these annotations after partitioning to save memory. (include/onnxruntime/core/graph/graph.h) [1] [2] [3]
  • Extended the graph partitioning logic to support filtering nodes by their layering annotation using a LayeringIndex, ensuring only nodes matching the current execution provider's assignment are considered during partitioning. (onnxruntime/core/framework/graph_partitioner.cc) [1] [2] [3] [4] [5] [6]
  • Added a new session option kOrtSessionOptionsLayerAssignmentSettings to configure layer assignment using annotation prefixes per device. (include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h)

Resource accounting improvements:

  • Improved the IResourceAccountant interface to allow resetting and committing pending weights per node, and updated resource accounting logic to correctly sum and commit costs for all constituent nodes in fused nodes, preventing double-counting or undercounting. (include/onnxruntime/core/framework/resource_accountant.h, include/onnxruntime/core/graph/indexed_sub_graph.h, onnxruntime/core/framework/graph_partitioner.cc) [1] [2] [3]

API and code organization:

  • Updated the Graph class and related APIs to propagate layering annotations during function inlining and to provide a method for removing all layering annotations after partitioning. (include/onnxruntime/core/graph/graph.h) [1] [2]
  • Moved the CreateAccountants function out of the NodeStatsRecorder class to the namespace level for clarity. (include/onnxruntime/core/framework/resource_accountant.h)

These changes enable more flexible and memory-efficient graph partitioning, particularly for scenarios involving hardware-specific layer assignments and dynamic resource constraints.

Add session options string and parsing code along with the unit test
Introduce layering configuration
Refine LayeringRuleMatcher and add tests
Add OrtEpDevice matching logic and tests
Change the Matcher interface to match one rule against pontentially many devices
Add matching again tranditional EPs
Create LayeringIndex
Add LayeringIndex and tests
Adjust config parsing to detect errrors
Adjust Create sig
Implement WeightsSizeBasedAccountant
Duplicate layering annotations for AddNode in L1 transformers.
  since the the Update after layout trnasformation may rely on them.
Also
[ RUN      ] AttentionTest.Attention3DDefault
GPU Compute Capability: SM 6.1 (value: 610)
Assertion failed: data.IsUnfused(), file D:\dev\ort_trans\onnxruntime\contrib_ops\cuda\bert\attention_prepare_qkv.cu, line 318
This may be related to uninitialized memory.
  and add SessionState partitioning test for layered execution.
  Add layering configuration file for tiny_gpt2_beamsearch
  and a script to annotate the model by layers.
instance based on a set of nodes.
This is used by the graph partitioner to create a filtered graph viewer.
Adjust implementation of the Graph_GetViewer.
  Add a no-threashold and no-stat option for the accountant.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds layering annotations to ONNX Runtime graphs and uses them to guide graph partitioning across execution providers, alongside enhancements to resource accounting (including an initializer-based fallback when pre-recorded stats aren’t provided).

Changes:

  • Introduces node-level layering annotations (loaded from NodeProto metadata "layer_ann") and a LayeringIndex to map annotations/rules to EP assignments during partitioning.
  • Extends the graph partitioner APIs to accept an optional LayeringIndex and filters EP capability queries accordingly, with logic to “unassign” nodes not claimed.
  • Improves resource accounting to support threshold updates and an initializer-based counting fallback; adds tests and a Python tool for annotating models.

Reviewed changes

Copilot reviewed 33 out of 34 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
onnxruntime/test/testdata/layering/tiny_gpt2_beamsearch_layering.txt Adds test data for layering/annotation scenarios.
onnxruntime/test/framework/tensorutils_test.cc Adds unit tests for extracting layer_ann from NodeProto metadata.
onnxruntime/test/framework/session_state_test.cc Updates partitioning test helper to pass LayeringIndex; adds layering partitioning test.
onnxruntime/test/framework/layering_annotations_test.cc Adds comprehensive unit tests for rule parsing/matching and LayeringIndex behavior.
onnxruntime/python/tools/layering/layer_annotate.py Adds a Python tool to apply layer_ann metadata to ONNX nodes (recurses into subgraphs).
onnxruntime/core/session/onnxruntime_c_api.cc Refactors Graph_GetGraphView subgraph IO detection and node handling.
onnxruntime/core/session/inference_session.cc Builds and passes LayeringIndex from session options; clears annotations post-partitioning to save memory.
onnxruntime/core/providers/cuda/cuda_execution_provider.cc Improves threshold handling/logging for resource-aware CUDA capability selection.
onnxruntime/core/optimizer/utils.h Declares DuplicateNodeAnnotation helper for propagating annotations in transforms/fusions.
onnxruntime/core/optimizer/utils.cc Implements DuplicateNodeAnnotation.
onnxruntime/core/optimizer/transpose_optimization/ort_optimizer_api_impl.cc Exposes layering annotation get/set in optimizer API; copies annotation when copying nodes.
onnxruntime/core/optimizer/transpose_optimization/optimizer_api.h Extends NodeRef API with layering annotation get/set.
onnxruntime/core/optimizer/transpose_optimization/onnx_transpose_optimization.cc Propagates annotations to newly created nodes during transpose optimization rewrites.
onnxruntime/core/optimizer/reshape_fusion.cc Copies annotation onto fused reshape node.
onnxruntime/core/optimizer/qdq_transformer/where_dummy_dq.cc Copies annotation to inserted dummy DQ node.
onnxruntime/core/optimizer/qdq_transformer/weight_bias_quantization.cc Copies annotation to inserted Q/DQ and helper nodes.
onnxruntime/core/optimizer/qdq_transformer/qdq_propagation.cc Copies annotation to inserted Q/DQ nodes.
onnxruntime/core/optimizer/qdq_transformer/ensure_unique_dq_for_node_unit.cc Copies annotation when duplicating DQ nodes.
onnxruntime/core/optimizer/matmul_add_fusion.cc Copies annotation to inserted reshape/gemm fusion nodes.
onnxruntime/core/optimizer/embed_layer_norm_fusion.cc Copies annotation to inserted Cast and EmbedLayerNorm fusion node.
onnxruntime/core/graph/graph_utils.h Adds CreateFilteredIndexedGraph helper for building filtered GraphViewer inputs/outputs.
onnxruntime/core/graph/graph_utils.cc Implements CreateFilteredIndexedGraph.
onnxruntime/core/graph/graph.cc Adds Graph::RemoveAllLayeringAnnotations and loads node annotations from NodeProto metadata.
onnxruntime/core/framework/tensorprotoutils.h Adds kNodeProtoLayerAnnotation constant and annotation extraction helper declaration.
onnxruntime/core/framework/tensorprotoutils.cc Implements GetNodeProtoLayeringAnnotation.
onnxruntime/core/framework/resource_accountant.cc Refactors accountant creation; adds initializer-based fallback resource counting.
onnxruntime/core/framework/layering_annotations.h Adds layering rule parsing/matching and LayeringIndex API.
onnxruntime/core/framework/layering_annotations.cc Implements rule parsing, EP matching heuristics, graph indexing, and update/unassign logic.
onnxruntime/core/framework/graph_partitioner.h Extends GraphPartitioner::Partition signature to accept LayeringIndex*.
onnxruntime/core/framework/graph_partitioner.cc Integrates layering-aware filtering into EP capability queries and assignment reset.
include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h Documents new session config session.layer_assignment_settings and updates resource partitioning docs.
include/onnxruntime/core/graph/graph.h Adds Node layering annotation storage/accessors and Graph::RemoveAllLayeringAnnotations declaration.
include/onnxruntime/core/framework/resource_accountant.h Adds SetThreshold, makes ComputeResourceCount non-const, and moves CreateAccountants to a free function.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

yuslepukhin and others added 4 commits March 27, 2026 11:32
Adjust warning

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Adjust ordering

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 36 out of 37 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@tianleiwu tianleiwu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete optimizer coverage: While many optimizers are updated, the codebase has dozens of optimizers under onnxruntime/core/optimizer/. A grep for AddNode or graph.AddNode patterns not covered by this PR would be prudent to ensure no optimizer is creating nodes without propagating annotations. Missing even one optimizer would cause annotation loss for affected nodes, leading to incorrect partitioning in layered mode.

@tianleiwu tianleiwu merged commit f4bdbb8 into main Mar 30, 2026
104 of 110 checks passed
@tianleiwu tianleiwu deleted the yuslepukhin/layering branch March 30, 2026 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants