Skip to content

[PyTorch Debug] Support tensor dump#2645

Open
pggPL wants to merge 36 commits intoNVIDIA:mainfrom
pggPL:inpsect_tensor_dump_support
Open

[PyTorch Debug] Support tensor dump#2645
pggPL wants to merge 36 commits intoNVIDIA:mainfrom
pggPL:inpsect_tensor_dump_support

Conversation

@pggPL
Copy link
Collaborator

@pggPL pggPL commented Feb 3, 2026

Description

This PR introduces a new debug feature focused on offline analysis of tensors.
The motivation is to make it easier to inspect and analyze intermediate tensors outside of runtime, especially during quantization debugging.

The new `DumpTensors` feature allows saving:

  • high-precision tensors (before quantization),
  • quantized tensors (after quantization).

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Added new debug feature: `transformer_engine.debug.features.dump_tensors.DumpTensors`.
  • Added support for dumping high-precision and quantized tensors via `inspect_tensor`.
  • Added/updated tests in `tests/pytorch/debug/test_log.py` for DumpTensors sanity flow.
  • Updated debug documentation/API listing to include `DumpTensors` in `docs/debug/3_api_features.rst`.
  • Fixed robustness issues found in review:
    • logger re-initialization across debug sessions,
    • dump test validation timing (before temp directory cleanup).

Checklist

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

pggPL added 2 commits February 3, 2026 08:54
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL pggPL changed the title [Debug] Support tensor dump [PyTorch Debug] Support tensor dump Feb 3, 2026
pre-commit-ci bot and others added 6 commits February 3, 2026 10:45
Signed-off-by: root <pgadzinski@nvidia.com>
Signed-off-by: root <pgadzinski@nvidia.com>
Signed-off-by: root <pgadzinski@nvidia.com>
@pggPL pggPL marked this pull request as ready for review March 5, 2026 10:44
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Mar 5, 2026

Greptile Summary

This PR introduces the DumpTensors debug feature, which enables saving high-precision and quantized tensors to disk for offline analysis. It also includes a TensorLogger singleton, a sanity test, documentation updates, and a bug fix in api.py that prevents KeyError when new kwargs (quantizer, rowwise_quantized_tensor, columnwise_quantized_tensor) are not present in kwargs_copy for older features.

The implementation has been significantly refined compared to earlier drafts — the previous complex internals extraction logic (_get_quantized_internals, _unpack_uint4_codes, _decode_uint4_e2m1_to_float) has been removed in favour of a clean, simpler design that saves the QuantizedTensor object directly. Many concerns from earlier review rounds have been addressed (e.g., detach().clone() for snapshot safety, ValueError instead of AssertionError, empty-dump log message, weights_only=False documentation).

Key remaining issues:

  • Python 3.9 incompatibility: Optional[torch.Tensor | QuantizedTensor] (PEP 604 union syntax) will raise TypeError at import time on Python 3.9. The fix is Optional[Union[torch.Tensor, QuantizedTensor]] with Union added to the typing import.
  • Filename collision risk via dot→underscore sanitization: Sanitizing . to _ makes block.0.attn and block_0_attn produce the same filename. Using - as the replacement character avoids this since PyTorch module names never contain hyphens.
  • inspect_tensor_enabled is not short-circuited: When both high_precision_tensor and quantized_tensor are False, the feature still returns True based on frequency checks, causing unnecessary dispatch and log noise at every matching step and layer.
  • Misleading dump_dict type annotation: Dict[str, torch.Tensor] should be Dict[str, Union[torch.Tensor, QuantizedTensor]].
  • Docstring load example missing map_location: Loading CUDA-saved tensors without map_location='cpu' will fail in CPU-only offline environments — the primary stated use case.

Confidence Score: 2/5

  • Not safe to merge as-is: the PEP 604 union syntax will raise TypeError on Python 3.9, and the dot-to-underscore sanitization silently overwrites dumps for typical hierarchical PyTorch module names.
  • Two of the flagged issues are likely to cause real problems in practice: the Python 3.9 incompatibility is a hard import error that breaks the feature for any user on that Python version, and the filename collision for dot-separated layer names is a data-integrity issue for the primary use case (offline analysis of large models with hierarchical names like transformer.layer.0.attention). The feature's core logic is otherwise sound — the singleton lifecycle, snapshot correctness (detach+clone), and round-trip serialisation all work correctly.
  • transformer_engine/debug/features/dump_tensors.py — Python version compatibility fix and sanitization character choice needed before merge.

Important Files Changed

Filename Overview
transformer_engine/debug/features/dump_tensors.py New DumpTensors feature with several open issues: PEP 604 union syntax breaks Python 3.9, incorrect dump_dict type annotation, inspect_tensor_enabled doesn't short-circuit when neither tensor type is enabled, and dot-to-underscore sanitization creates filename collision risk for PyTorch hierarchical layer names.
tests/pytorch/debug/test_log.py New DumpTensors sanity test looks solid: validates file creation, checks QuantizedTensor round-trip type preservation, uses torch.equal for exact comparison, and documents weights_only=False requirement. Assertions happen inside the debug_session context so the temp directory is valid at check time.
transformer_engine/debug/features/api.py Correct bug fix: pop(k) → pop(k, None) prevents KeyError when new DumpTensors kwargs (quantizer, rowwise_quantized_tensor, columnwise_quantized_tensor) are absent from kwargs_copy for older features that don't accept them.
transformer_engine/debug/features/log_fp8_tensor_stats.py Minor import reordering moves import transformer_engine_torch as tex between two groups of from nvdlfw_inspect imports, breaking PEP 8 import grouping convention. No functional change.
docs/debug/3_api_features.rst Adds DumpTensors to the debug API documentation listing. Still missing a trailing newline at end of file.

Sequence Diagram

sequenceDiagram
    participant Caller as Training Loop
    participant API as TransformerEngineAPI
    participant DT as DumpTensors
    participant TL as TensorLogger (singleton)
    participant FS as File System

    Caller->>API: inspect_tensor(layer_name, tensor_name, iteration, tensor, rowwise_qt, columnwise_qt)
    API->>DT: inspect_tensor_enabled(config, layer_name, tensor_name, iteration)
    DT-->>API: (run_current=True, next_iter)
    API->>DT: inspect_tensor(config, layer_name, tensor_name, ...)
    DT->>DT: validate rowwise == columnwise (or one is None)
    DT->>DT: pick quantized_tensor (rowwise ?? columnwise)
    DT->>TL: ensure_initialized(root_log_dir)
    TL->>FS: makedirs(tensor_dumps/rank_N/)
    DT->>DT: build dump_dict {high_precision, quantized}
    Note over DT: tensor.detach().clone()
    Note over DT: quantized_tensor.detach().clone()
    DT->>TL: save_tensor(dump_dict, layer_name, tensor_name, iteration)
    TL->>TL: _sanitize_name(layer_name), _sanitize_name(tensor_name)
    TL->>FS: makedirs(iter_{iteration:06d}/)
    TL->>FS: torch.save(dump_dict, layer_tensor.pt)
    DT->>API: log_message("Dumped ...")
Loading

Last reviewed commit: "[pre-commit.ci] auto..."

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
pggPL and others added 4 commits March 5, 2026 10:57
Signed-off-by: root <pgadzinski@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
pggPL and others added 3 commits March 5, 2026 13:13
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Signed-off-by: root <pgadzinski@nvidia.com>
pggPL and others added 2 commits March 5, 2026 14:03
Signed-off-by: root <pgadzinski@nvidia.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Mar 5, 2026

/te-ci pytorch

pggPL and others added 2 commits March 10, 2026 11:51
Drop the dump_quantized_internals config option, the _get_quantized_internals
method, and all helper functions for extracting scales/raw data from
Float8Tensor, Float8BlockwiseQTensor, MXFP8Tensor, and NVFP4Tensor.

Remove corresponding tests: test_dump_tensors_nvfp4_unpacked_codes and
NVFP4_DUMP_TENSORS_CONFIG, and scale/data assertions from test_dump_tensors_sanity.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
- Add dot ('.') to _sanitize_name to handle common PyTorch dotted layer
  names like 'encoder.layer.0.attention'
- Add docstring note about pickle dependency for the 'quantized' key
- Add comment explaining weights_only=False in test
- Remove redundant local RecipeState import in test_nvfp4_numeric

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
pggPL and others added 2 commits March 10, 2026 12:08
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Avoids relying on stale self.rank when ensure_initialized is called
before initialize() has set the rank. Consistent with how nvdlfw_inspect
logger resolves rank.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
pggPL and others added 2 commits March 10, 2026 12:38
Detach both high_precision and quantized tensors before saving to avoid
serializing the autograd graph. For QuantizedTensor this is a zero-copy
view (make_like), so no extra GPU allocation.

Add filename format assertion to test_dump_tensors_sanity to catch
regressions in _sanitize_name or the naming convention.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
pggPL and others added 2 commits March 10, 2026 13:33
Log a message when no tensors are available to dump so the user
has an explicit signal that no file was written.

Assert that the quantized key round-trips as a QuantizedTensor
to catch regressions in detach() or serialisation path.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Mar 10, 2026

/te-ci pytorch

Copy link
Collaborator

@negvet negvet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! Overall LGTM, minor comments

pggPL and others added 3 commits March 19, 2026 14:44
…st and MSE example

- Organize dumps into per-iteration subdirectories (iter_000000/) to keep
  file count manageable per directory.
- Remove unused self.rank attribute from TensorLogger.
- Add torch.allclose assertion in test to verify serialization correctness.
- Add docstring example showing how to load dumps and compute MSE.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Made-with: Cursor
Using tensor.detach() creates a view sharing the same underlying
storage. If any in-place operation modifies the tensor after the
dump, the saved data would be silently corrupted. Use .clone()
to ensure the dump captures an independent copy of the data.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…nd-trip

The saved tensor is an exact bit-for-bit copy (detach().clone()), so
torch.equal is the correct check. torch.allclose with its default
tolerances could mask a genuine dtype conversion or precision loss
introduced by a future change to the serialisation path.

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
pggPL and others added 2 commits March 20, 2026 13:10
…ll_feature backward compat pop

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
@pggPL
Copy link
Collaborator Author

pggPL commented Mar 20, 2026

/te-ci pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants