Conversation
for more information, see https://pre-commit.ci
Signed-off-by: root <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: root <pgadzinski@nvidia.com>
Greptile SummaryThis PR introduces the The implementation has been significantly refined compared to earlier drafts — the previous complex internals extraction logic ( Key remaining issues:
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller as Training Loop
participant API as TransformerEngineAPI
participant DT as DumpTensors
participant TL as TensorLogger (singleton)
participant FS as File System
Caller->>API: inspect_tensor(layer_name, tensor_name, iteration, tensor, rowwise_qt, columnwise_qt)
API->>DT: inspect_tensor_enabled(config, layer_name, tensor_name, iteration)
DT-->>API: (run_current=True, next_iter)
API->>DT: inspect_tensor(config, layer_name, tensor_name, ...)
DT->>DT: validate rowwise == columnwise (or one is None)
DT->>DT: pick quantized_tensor (rowwise ?? columnwise)
DT->>TL: ensure_initialized(root_log_dir)
TL->>FS: makedirs(tensor_dumps/rank_N/)
DT->>DT: build dump_dict {high_precision, quantized}
Note over DT: tensor.detach().clone()
Note over DT: quantized_tensor.detach().clone()
DT->>TL: save_tensor(dump_dict, layer_name, tensor_name, iteration)
TL->>TL: _sanitize_name(layer_name), _sanitize_name(tensor_name)
TL->>FS: makedirs(iter_{iteration:06d}/)
TL->>FS: torch.save(dump_dict, layer_tensor.pt)
DT->>API: log_message("Dumped ...")
Last reviewed commit: "[pre-commit.ci] auto..." |
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
for more information, see https://pre-commit.ci
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
for more information, see https://pre-commit.ci
|
/te-ci pytorch |
Drop the dump_quantized_internals config option, the _get_quantized_internals method, and all helper functions for extracting scales/raw data from Float8Tensor, Float8BlockwiseQTensor, MXFP8Tensor, and NVFP4Tensor. Remove corresponding tests: test_dump_tensors_nvfp4_unpacked_codes and NVFP4_DUMP_TENSORS_CONFIG, and scale/data assertions from test_dump_tensors_sanity. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
- Add dot ('.') to _sanitize_name to handle common PyTorch dotted layer
names like 'encoder.layer.0.attention'
- Add docstring note about pickle dependency for the 'quantized' key
- Add comment explaining weights_only=False in test
- Remove redundant local RecipeState import in test_nvfp4_numeric
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Avoids relying on stale self.rank when ensure_initialized is called before initialize() has set the rank. Consistent with how nvdlfw_inspect logger resolves rank. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
Detach both high_precision and quantized tensors before saving to avoid serializing the autograd graph. For QuantizedTensor this is a zero-copy view (make_like), so no extra GPU allocation. Add filename format assertion to test_dump_tensors_sanity to catch regressions in _sanitize_name or the naming convention. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
Log a message when no tensors are available to dump so the user has an explicit signal that no file was written. Assert that the quantized key round-trips as a QuantizedTensor to catch regressions in detach() or serialisation path. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
|
/te-ci pytorch |
negvet
left a comment
There was a problem hiding this comment.
Thanks for the contribution! Overall LGTM, minor comments
…st and MSE example - Organize dumps into per-iteration subdirectories (iter_000000/) to keep file count manageable per directory. - Remove unused self.rank attribute from TensorLogger. - Add torch.allclose assertion in test to verify serialization correctness. - Add docstring example showing how to load dumps and compute MSE. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Made-with: Cursor
for more information, see https://pre-commit.ci
Using tensor.detach() creates a view sharing the same underlying storage. If any in-place operation modifies the tensor after the dump, the saved data would be silently corrupted. Use .clone() to ensure the dump captures an independent copy of the data. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…nd-trip The saved tensor is an exact bit-for-bit copy (detach().clone()), so torch.equal is the correct check. torch.allclose with its default tolerances could mask a genuine dtype conversion or precision loss introduced by a future change to the serialisation path. Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
…ll_feature backward compat pop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci pytorch |
Description
This PR introduces a new debug feature focused on offline analysis of tensors.
The motivation is to make it easier to inspect and analyze intermediate tensors outside of runtime, especially during quantization debugging.
The new `DumpTensors` feature allows saving:
Type of change
Changes
Please list the changes introduced in this PR:
Checklist