[PyTorch Debug] Support tensor dump by pggPL · Pull Request #2645 · NVIDIA/TransformerEngine

pggPL · 2026-02-03T10:44:04Z

Description

This PR introduces a new debug feature focused on offline analysis of tensors.
The motivation is to make it easier to inspect and analyze intermediate tensors outside of runtime, especially during quantization debugging.

The new `DumpTensors` feature allows saving:

high-precision tensors (before quantization),
quantized tensors (after quantization).

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Added new debug feature: `transformer_engine.debug.features.dump_tensors.DumpTensors`.
Added support for dumping high-precision and quantized tensors via `inspect_tensor`.
Added/updated tests in `tests/pytorch/debug/test_log.py` for DumpTensors sanity flow.
Updated debug documentation/API listing to include `DumpTensors` in `docs/debug/3_api_features.rst`.
Fixed robustness issues found in review:
- logger re-initialization across debug sessions,
- dump test validation timing (before temp directory cleanup).

Checklist

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: root <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

Signed-off-by: root <pgadzinski@nvidia.com>

greptile-apps · 2026-03-05T10:48:13Z

Greptile Summary

This PR introduces the DumpTensors debug feature, which enables saving high-precision and quantized tensors to disk for offline analysis. It also includes a TensorLogger singleton, a sanity test, documentation updates, and a bug fix in api.py that prevents KeyError when new kwargs (quantizer, rowwise_quantized_tensor, columnwise_quantized_tensor) are not present in kwargs_copy for older features.

The implementation has been significantly refined compared to earlier drafts — the previous complex internals extraction logic (_get_quantized_internals, _unpack_uint4_codes, _decode_uint4_e2m1_to_float) has been removed in favour of a clean, simpler design that saves the QuantizedTensor object directly. Many concerns from earlier review rounds have been addressed (e.g., detach().clone() for snapshot safety, ValueError instead of AssertionError, empty-dump log message, weights_only=False documentation).

Key remaining issues:

Python 3.9 incompatibility: Optional[torch.Tensor | QuantizedTensor] (PEP 604 union syntax) will raise TypeError at import time on Python 3.9. The fix is Optional[Union[torch.Tensor, QuantizedTensor]] with Union added to the typing import.
Filename collision risk via dot→underscore sanitization: Sanitizing . to _ makes block.0.attn and block_0_attn produce the same filename. Using - as the replacement character avoids this since PyTorch module names never contain hyphens.
inspect_tensor_enabled is not short-circuited: When both high_precision_tensor and quantized_tensor are False, the feature still returns True based on frequency checks, causing unnecessary dispatch and log noise at every matching step and layer.
Misleading dump_dict type annotation: Dict[str, torch.Tensor] should be Dict[str, Union[torch.Tensor, QuantizedTensor]].
Docstring load example missing map_location: Loading CUDA-saved tensors without map_location='cpu' will fail in CPU-only offline environments — the primary stated use case.

Confidence Score: 2/5

Not safe to merge as-is: the PEP 604 union syntax will raise TypeError on Python 3.9, and the dot-to-underscore sanitization silently overwrites dumps for typical hierarchical PyTorch module names.
Two of the flagged issues are likely to cause real problems in practice: the Python 3.9 incompatibility is a hard import error that breaks the feature for any user on that Python version, and the filename collision for dot-separated layer names is a data-integrity issue for the primary use case (offline analysis of large models with hierarchical names like transformer.layer.0.attention). The feature's core logic is otherwise sound — the singleton lifecycle, snapshot correctness (detach+clone), and round-trip serialisation all work correctly.
transformer_engine/debug/features/dump_tensors.py — Python version compatibility fix and sanitization character choice needed before merge.

Important Files Changed

Filename	Overview
transformer_engine/debug/features/dump_tensors.py	New DumpTensors feature with several open issues: PEP 604 union syntax breaks Python 3.9, incorrect dump_dict type annotation, inspect_tensor_enabled doesn't short-circuit when neither tensor type is enabled, and dot-to-underscore sanitization creates filename collision risk for PyTorch hierarchical layer names.
tests/pytorch/debug/test_log.py	New DumpTensors sanity test looks solid: validates file creation, checks QuantizedTensor round-trip type preservation, uses torch.equal for exact comparison, and documents weights_only=False requirement. Assertions happen inside the debug_session context so the temp directory is valid at check time.
transformer_engine/debug/features/api.py	Correct bug fix: pop(k) → pop(k, None) prevents KeyError when new DumpTensors kwargs (quantizer, rowwise_quantized_tensor, columnwise_quantized_tensor) are absent from kwargs_copy for older features that don't accept them.
transformer_engine/debug/features/log_fp8_tensor_stats.py	Minor import reordering moves `import transformer_engine_torch as tex` between two groups of `from nvdlfw_inspect` imports, breaking PEP 8 import grouping convention. No functional change.
docs/debug/3_api_features.rst	Adds DumpTensors to the debug API documentation listing. Still missing a trailing newline at end of file.