Skip to content

Fix flaky ResourceAwarePartitioning tests by generating node stats dynamically#27877

Closed
tianleiwu wants to merge 1 commit intomainfrom
tlwu/20260327/fix_TestResourceAwarePartitioning
Closed

Fix flaky ResourceAwarePartitioning tests by generating node stats dynamically#27877
tianleiwu wants to merge 1 commit intomainfrom
tlwu/20260327/fix_TestResourceAwarePartitioning

Conversation

@tianleiwu
Copy link
Copy Markdown
Contributor

Problem

The SessionStateTest.TestResourceAwarePartitioning_CPUOffloaded (and potentially _LargeLimit) tests are flaky on the Linux CUDA CI pipeline. The root cause is a stale static stats file (tiny_gpt2_beamsearch_node_stats.txt) that contains pre-baked node name hashes.

How the stats file works

The resource-aware partitioning feature uses IResourceAccountant::MakeUniqueNodeName(node) to generate node identifiers of the form {node_name}_{MurmurHash3(input_defs + output_defs)}. At runtime, SizeTAccountant::ComputeResourceCount() looks up each node's name in a hash map loaded from the stats file to determine memory cost.

Why it breaks

When graph optimizers on main change node input/output def names (e.g., through fusion or layout transformation), the MurmurHash3 hashes change. This causes a mismatch between the names in the static stats file and the names generated at runtime. Unmatched nodes return 0 cost, which lowers the total below the hard-coded threshold, causing the _CPUOffloaded test to fail because no nodes get offloaded to CPU.

For example, 6 out of 60 nodes were failing to match, dropping the runtime total from 5,550,436 bytes (in the stats file) to 4,470,500 bytes — below the 5,120,000-byte threshold used by the test.

This is a pre-existing issue on main, not caused by any specific PR.

Solution

Instead of relying on a pre-baked static stats file, generate the node stats dynamically at test time:

  1. CollectNodeNames() — Recursively walks the graph and all subgraphs, calling IResourceAccountant::MakeUniqueNodeName() for each node. This guarantees the names always match what the runtime will produce.

  2. GenerateDynamicNodeStatsFile() — Loads the model, resolves the graph, collects node names, and writes a temporary CSV stats file with a uniform cost per node. Returns the total cost so tests can set thresholds relative to the actual total.

  3. Tests compute thresholds dynamically:

    • _LargeLimit: threshold = 2× total cost → all nodes stay on CUDA
    • _CPUOffloaded: threshold = 0.5× total cost → some nodes must be offloaded to CPU
  4. Stats files are written to the system temp directory (std::filesystem::temp_directory_path()) instead of testdata/transformers/, avoiding assumptions about the working directory being writable. This works because LoadNodeAllocationStats uses std::filesystem::path::operator/=, which replaces the path when the filename is absolute.

  5. Temp files are cleaned up via std::filesystem::remove() after each test.

Changes

  • onnxruntime/test/framework/session_state_test.cc:
    • Added #include <fstream> and #include "core/framework/resource_accountant.h"
    • Added helper CollectNodeNames() (recursive graph walker)
    • Added helper GenerateDynamicNodeStatsFile() (generates temp stats file)
    • Rewrote TestResourceAwarePartitioning_LargeLimit to use dynamic stats
    • Rewrote TestResourceAwarePartitioning_CPUOffloaded to use dynamic stats
    • TestResourceAwarePartitioning_NoLimit left unchanged (does not use a stats file)

Testing

These tests require a Linux CUDA build to run:

./onnxruntime_test_all --gtest_filter="SessionStateTest.TestResourceAwarePartitioning*"

@tianleiwu tianleiwu requested a review from yuslepukhin March 27, 2026 07:36
@yuslepukhin
Copy link
Copy Markdown
Member

These changes were incorporated into #27595

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants