Bug: Incorrect cross-rank GPU attribution during multinode merged trace analysis, indefinite hangs

### Problem
Engineers on my team were unable to use TraceLens to process merged trace files from multi-node workload runs.

### Investigation
When PyTorch profiles a single rank, each GPU kernel launch is assigned a correlation ID unique within that session. When traces from K ranks are merged into one file, each rank's correlation IDs restart from the same numeric range, *causing collisions*. `add_gpu_ops_to_tree` uses these correlation IDs to link CPU runtime events to their GPU kernels, and was vulnerable to these collisions in two ways.  

First, `_get_graph_gpu_events` looked up GPU kernels by correlation ID alone, so in a merged trace every graph launch event claimed kernels from all K ranks sharing that ID, producing *incorrect* attribution.

Second, the per-kernel ancestor walk had no mechanism to deduplicate GPU events that appeared as children of multiple runtime parents, causing $O(K^2 \times N_{gpu})$ complexity during propagation.

Together, these are causing issues such as incorrect cross-rank GPU attribution and indefinite hangs when processing merged trace files from PyTorch profiling of multi-node workloads.

### Status
I have a PR open (#577) with a description of my fixes to address the issues my team was facing. With these changes, a large multi-node merged PyTorch trace file that previously ran indefinitely now runs in 235.9 seconds. However, I would like to encourage some discussion about my observations so that we can arrive at the best solution. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Incorrect cross-rank GPU attribution during multinode merged trace analysis, indefinite hangs #581

Problem

Investigation

Status

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Bug: Incorrect cross-rank GPU attribution during multinode merged trace analysis, indefinite hangs #581

Description

Problem

Investigation

Status

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions