Skip to content

Bug: Incorrect cross-rank GPU attribution during multinode merged trace analysis, indefinite hangs #581

@brieflynn

Description

@brieflynn

Problem

Engineers on my team were unable to use TraceLens to process merged trace files from multi-node workload runs.

Investigation

When PyTorch profiles a single rank, each GPU kernel launch is assigned a correlation ID unique within that session. When traces from K ranks are merged into one file, each rank's correlation IDs restart from the same numeric range, causing collisions. add_gpu_ops_to_tree uses these correlation IDs to link CPU runtime events to their GPU kernels, and was vulnerable to these collisions in two ways.

First, _get_graph_gpu_events looked up GPU kernels by correlation ID alone, so in a merged trace every graph launch event claimed kernels from all K ranks sharing that ID, producing incorrect attribution.

Second, the per-kernel ancestor walk had no mechanism to deduplicate GPU events that appeared as children of multiple runtime parents, causing $O(K^2 \times N_{gpu})$ complexity during propagation.

Together, these are causing issues such as incorrect cross-rank GPU attribution and indefinite hangs when processing merged trace files from PyTorch profiling of multi-node workloads.

Status

I have a PR open (#577) with a description of my fixes to address the issues my team was facing. With these changes, a large multi-node merged PyTorch trace file that previously ran indefinitely now runs in 235.9 seconds. However, I would like to encourage some discussion about my observations so that we can arrive at the best solution.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions