Traceback (most recent call last):
File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/run_train.py", line 199, in run_train
trainer.run(cf, devices)
File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/train/trainer.py", line 368, in run
self.train(mini_epoch)
File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/train/trainer.py", line 472, in train
self.grad_scaler.scale(loss).backward()
File "/users/jkuehner/CODE/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
torch.autograd.backward(
File "/users/jkuehner/CODE/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/users/jkuehner/CODE/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.
Parameter at index 5 with name forecast_engine.net.fe_blocks.0.layers.4.bias has been marked as ready twice. This means that multiple autograd engine hooks have fired for this particular parameter during this iteration.
File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/run_train.py", line 199, in run_train
trainer.run(cf, devices)
File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/train/trainer.py", line 368, in run
self.train(mini_epoch)
File "/users/jkuehner/CODE/WeatherGenerator/src/weathergen/train/trainer.py", line 472, in train
self.grad_scaler.scale(loss).backward()
File "/users/jkuehner/CODE/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/_tensor.py", line 626, in backward
torch.autograd.backward(
File "/users/jkuehner/CODE/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/autograd/__init__.py", line 347, in backward
_engine_run_backward(
File "/users/jkuehner/CODE/WeatherGenerator/.venv/lib/python3.12/site-packages/torch/autograd/graph.py", line 823, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
git checkout mk/mh/diffusion-single-sample
uv run --offline torchrun --nproc_per_node 4 src/weathergen/run_train.py train --base-config config/config_diffusion.yml
What happened?
On multiple nodes, setting "with_fsdp: false" raises following error:
What are the steps to reproduce the bug?
In
config/config_diffusion.ymlsetwith_fsdp: Falseandvalidate_before_training: FalseHedgedoc link to logs and more information. This ticket is public, do not attach files directly.
No response