Fix flaky ResourceAwarePartitioning tests by generating node stats dynamically by tianleiwu · Pull Request #27877 · microsoft/onnxruntime

tianleiwu · 2026-03-27T07:36:52Z

Problem

The SessionStateTest.TestResourceAwarePartitioning_CPUOffloaded (and potentially _LargeLimit) tests are flaky on the Linux CUDA CI pipeline. The root cause is a stale static stats file (tiny_gpt2_beamsearch_node_stats.txt) that contains pre-baked node name hashes.

How the stats file works

The resource-aware partitioning feature uses IResourceAccountant::MakeUniqueNodeName(node) to generate node identifiers of the form {node_name}_{MurmurHash3(input_defs + output_defs)}. At runtime, SizeTAccountant::ComputeResourceCount() looks up each node's name in a hash map loaded from the stats file to determine memory cost.

Why it breaks

When graph optimizers on main change node input/output def names (e.g., through fusion or layout transformation), the MurmurHash3 hashes change. This causes a mismatch between the names in the static stats file and the names generated at runtime. Unmatched nodes return 0 cost, which lowers the total below the hard-coded threshold, causing the _CPUOffloaded test to fail because no nodes get offloaded to CPU.

For example, 6 out of 60 nodes were failing to match, dropping the runtime total from 5,550,436 bytes (in the stats file) to 4,470,500 bytes — below the 5,120,000-byte threshold used by the test.

This is a pre-existing issue on main, not caused by any specific PR.

Solution

Instead of relying on a pre-baked static stats file, generate the node stats dynamically at test time:

CollectNodeNames() — Recursively walks the graph and all subgraphs, calling IResourceAccountant::MakeUniqueNodeName() for each node. This guarantees the names always match what the runtime will produce.
GenerateDynamicNodeStatsFile() — Loads the model, resolves the graph, collects node names, and writes a temporary CSV stats file with a uniform cost per node. Returns the total cost so tests can set thresholds relative to the actual total.
Tests compute thresholds dynamically:
- _LargeLimit: threshold = 2× total cost → all nodes stay on CUDA
- _CPUOffloaded: threshold = 0.5× total cost → some nodes must be offloaded to CPU
Stats files are written to the system temp directory (std::filesystem::temp_directory_path()) instead of testdata/transformers/, avoiding assumptions about the working directory being writable. This works because LoadNodeAllocationStats uses std::filesystem::path::operator/=, which replaces the path when the filename is absolute.
Temp files are cleaned up via std::filesystem::remove() after each test.

Changes

onnxruntime/test/framework/session_state_test.cc:
- Added #include <fstream> and #include "core/framework/resource_accountant.h"
- Added helper CollectNodeNames() (recursive graph walker)
- Added helper GenerateDynamicNodeStatsFile() (generates temp stats file)
- Rewrote TestResourceAwarePartitioning_LargeLimit to use dynamic stats
- Rewrote TestResourceAwarePartitioning_CPUOffloaded to use dynamic stats
- TestResourceAwarePartitioning_NoLimit left unchanged (does not use a stats file)

Testing

These tests require a Linux CUDA build to run:

./onnxruntime_test_all --gtest_filter="SessionStateTest.TestResourceAwarePartitioning*"

yuslepukhin · 2026-03-27T18:15:08Z

These changes were incorporated into #27595

Fix TestResourceAwarePartitioning

be8493a

tianleiwu requested a review from yuslepukhin March 27, 2026 07:36

yuslepukhin closed this Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky ResourceAwarePartitioning tests by generating node stats dynamically#27877

Fix flaky ResourceAwarePartitioning tests by generating node stats dynamically#27877
tianleiwu wants to merge 1 commit intomainfrom
tlwu/20260327/fix_TestResourceAwarePartitioning

tianleiwu commented Mar 27, 2026

Uh oh!

yuslepukhin commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tianleiwu commented Mar 27, 2026

Problem

How the stats file works

Why it breaks

Solution

Changes

Testing

Uh oh!

yuslepukhin commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants