Fix flaky ResourceAwarePartitioning tests by generating node stats dynamically#27877
Closed
Fix flaky ResourceAwarePartitioning tests by generating node stats dynamically#27877
Conversation
Member
|
These changes were incorporated into #27595 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The
SessionStateTest.TestResourceAwarePartitioning_CPUOffloaded(and potentially_LargeLimit) tests are flaky on the Linux CUDA CI pipeline. The root cause is a stale static stats file (tiny_gpt2_beamsearch_node_stats.txt) that contains pre-baked node name hashes.How the stats file works
The resource-aware partitioning feature uses
IResourceAccountant::MakeUniqueNodeName(node)to generate node identifiers of the form{node_name}_{MurmurHash3(input_defs + output_defs)}. At runtime,SizeTAccountant::ComputeResourceCount()looks up each node's name in a hash map loaded from the stats file to determine memory cost.Why it breaks
When graph optimizers on
mainchange node input/output def names (e.g., through fusion or layout transformation), the MurmurHash3 hashes change. This causes a mismatch between the names in the static stats file and the names generated at runtime. Unmatched nodes return 0 cost, which lowers the total below the hard-coded threshold, causing the_CPUOffloadedtest to fail because no nodes get offloaded to CPU.For example, 6 out of 60 nodes were failing to match, dropping the runtime total from 5,550,436 bytes (in the stats file) to 4,470,500 bytes — below the 5,120,000-byte threshold used by the test.
This is a pre-existing issue on
main, not caused by any specific PR.Solution
Instead of relying on a pre-baked static stats file, generate the node stats dynamically at test time:
CollectNodeNames()— Recursively walks the graph and all subgraphs, callingIResourceAccountant::MakeUniqueNodeName()for each node. This guarantees the names always match what the runtime will produce.GenerateDynamicNodeStatsFile()— Loads the model, resolves the graph, collects node names, and writes a temporary CSV stats file with a uniform cost per node. Returns the total cost so tests can set thresholds relative to the actual total.Tests compute thresholds dynamically:
_LargeLimit: threshold = 2× total cost → all nodes stay on CUDA_CPUOffloaded: threshold = 0.5× total cost → some nodes must be offloaded to CPUStats files are written to the system temp directory (
std::filesystem::temp_directory_path()) instead oftestdata/transformers/, avoiding assumptions about the working directory being writable. This works becauseLoadNodeAllocationStatsusesstd::filesystem::path::operator/=, which replaces the path when the filename is absolute.Temp files are cleaned up via
std::filesystem::remove()after each test.Changes
onnxruntime/test/framework/session_state_test.cc:#include <fstream>and#include "core/framework/resource_accountant.h"CollectNodeNames()(recursive graph walker)GenerateDynamicNodeStatsFile()(generates temp stats file)TestResourceAwarePartitioning_LargeLimitto use dynamic statsTestResourceAwarePartitioning_CPUOffloadedto use dynamic statsTestResourceAwarePartitioning_NoLimitleft unchanged (does not use a stats file)Testing
These tests require a Linux CUDA build to run:
./onnxruntime_test_all --gtest_filter="SessionStateTest.TestResourceAwarePartitioning*"