Add comment explaining nesting torch.autocast (#1000)

tohtana · stas00 · web-flow · commit a43031e5ce5e · 2026-03-04T09:43:17.000-08:00
* Add comment explaining outer torch.autocast in bf16_master_weight example

The outer autocast covers loss_fn which runs outside engine.forward().
The nested autocast on the model forward is harmless.

Signed-off-by: Masahiro Tanaka &lt;mtanaka@anyscale.com&gt;

* Update training/bf16_master_weight/train.py

Co-authored-by: Stas Bekman &lt;stas00@users.noreply.github.com&gt;

---------

Signed-off-by: Masahiro Tanaka &lt;mtanaka@anyscale.com&gt;
Co-authored-by: Stas Bekman &lt;stas00@users.noreply.github.com&gt;
diff --git a/training/bf16_master_weight/train.py b/training/bf16_master_weight/train.py
@@ -292,7 +292,11 @@ def main():
             input_ids = torch.randint(0, actual_vocab_size, (args.batch_size, args.seq_length), device=device)
             labels = torch.randint(0, actual_vocab_size, (args.batch_size, args.seq_length), device=device)
 
-        # Forward pass with optional autocast
+        # Forward pass with an optional autocast.
+        # DeepSpeed already applies torch.autocast inside engine.forward(), but
+        # we wrap the entire forward+loss block so that loss_fn also runs under
+        # autocast.  The nested autocast on engine.forward() is harmless —
+        # PyTorch's torch.autocast is idempotent when nested with the same dtype.
         if use_autocast:
             with torch.autocast(device_type="cuda", dtype=autocast_dtype):
                 logits = model_engine(input_ids)