(pid=gcs_server) [2025-12-30 16:28:52,683 E 88920 88920] (gcs_server) gcs_server.cc:303: Failed to establish connection to the event+metrics exporter agent. Events and metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(raylet) [2025-12-30 16:28:55,156 E 89285 89285] (raylet) main.cc:1032: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(pid=89391) [2025-12-30 16:28:56,819 E 89391 89782] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
[2025-12-30 16:28:57,010 E 88542 89388] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14
(main_task pid=90685) Detected robot platform from environment: ALOHA
(main_task pid=90685) Using ALOHA constants:
(main_task pid=90685) NUM_ACTIONS_CHUNK = 25
(main_task pid=90685) No CUDA runtime is found, using CUDA_HOME='/usr/local/cuda'
(main_task pid=90685) dataset len: 1000
(main_task pid=90685) dataset len: 256
(main_task pid=90685) Size of train dataloader: 15
(main_task pid=90685) Size of val dataloader: 32
(pid=92031) Detected robot platform from environment: ALOHA
(pid=92031) Using ALOHA constants:
(pid=92031) NUM_ACTIONS_CHUNK = 25
(raylet) A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Lease ID: Worker ID: Node ID: Worker IP address: <my ip> Worker port: <my port> Worker PID: <my pid> Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. Some common causes include: (1) the process was killed by the OOM killer due to high memory usage, (2) ray stop --force was called, or (3) the worker crashed unexpectedly due to SIGSEGV or another unexpected error.
(bundle_reservation_check_func pid=91825) [2025-12-30 16:29:33,885 E 91825 91969] core_worker_process.cc:842: Failed to establish connection to the metrics exporter agent. Metrics will not be exported. Exporter agent status: RpcError: Running out of retries to initialize the metrics agent. rpc_code: 14 [repeated 20x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
Error executing job with overrides: ['data.task_suite_name=robotwin2_place_container_plate', 'data.num_trials_per_task=1000', 'data.n_samples=8', 'data.filter_accuracy=True', 'data.accuracy_lower_bound=0.1', 'data.accuracy_upper_bound=0.9', 'data.oversample_factor=1', 'data.train_batch_size=64', 'data.val_batch_size=8', 'data.max_prompt_length=256', 'data.max_response_length=128', 'actor_rollout_ref.model.path=/home/myname/xlk/SimpleVLA-RL/checkpoints/robotwin2_model', 'actor_rollout_ref.model.vla=openvla-oft', 'actor_rollout_ref.actor.ppo_micro_batch_size=2', 'actor_rollout_ref.actor.fsdp_config.param_offload=False', 'actor_rollout_ref.actor.fsdp_config.grad_offload=False', 'actor_rollout_ref.actor.fsdp_config.optimizer_offload=True', 'actor_rollout_ref.rollout.gpu_memory_utilization=0.4', 'actor_rollout_ref.rollout.val_micro_batch_size=2', 'actor_rollout_ref.rollout.log_prob_micro_batch_size=16', 'actor_rollout_ref.ref.log_prob_micro_batch_size=16', 'actor_rollout_ref.ref.fsdp_config.param_offload=True', 'trainer.logger=[console,wandb]', 'trainer.project_name=SimpleVLA-RL', 'trainer.experiment_name=robotwin2_place_container_plate_seed1k_sft_aloha_25chunks_10k_eval', 'trainer.default_local_dir=/home/myname/xlk/SimpleVLA-RL/results/SimpleVLA-RL/robotwin2_place_container_plate_seed1k_sft_aloha_25chunks_10k_eval', 'trainer.n_gpus_per_node=2', 'trainer.nnodes=1', 'trainer.val_only=True', 'algorithm.adv_estimator=grpo', 'trainer.runtime_env=/home/myname/xlk/SimpleVLA-RL/align.json', 'trainer.wandb_mode=online', 'trainer.val_before_train=True']
Traceback (most recent call last):
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 212, in <module>
main()
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
_run_hydra(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
_run_app(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
run_and_report(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
raise ex
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
return func()
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
lambda: hydra.run(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
_ = ret.return_value
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
raise self._return_value
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
ret.return_value = task_function(task_cfg)
File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 116, in main
ray.get(main_task.remote(config))
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/worker.py", line 2967, in get
values, debugger_breakpoint = worker.get_objects(
File "/home/myname/.conda/envs/xlk_simplevla/lib/python3.10/site-packages/ray/_private/worker.py", line 1015, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(AssertionError): ray::main_task() (pid=90685, ip=10.244.43.63)
File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/main_ppo.py", line 207, in main_task
trainer.init_workers()
File "/home/myname/xlk/SimpleVLA-RL/verl/trainer/ppo/ray_trainer.py", line 447, in init_workers
wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
File "/home/myname/xlk/SimpleVLA-RL/verl/single_controller/ray/base.py", line 197, in __init__
self._init_with_resource_pool(resource_pool=resource_pool,
File "/home/myname/xlk/SimpleVLA-RL/verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: J3cCBZ_register_center in []
(raylet) A worker died or was killed while executing a task by an unexpected system error.
Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file.
Possible causes: (1) OOM killer, (2) ray stop --force, (3) SIGSEGV.
ray.exceptions.RayTaskError(AssertionError): ray::main_task() (pid=5904, ip=10.244.82.123)
File "verl/trainer/main_ppo.py", line 207, in main_task
trainer.init_workers()
File "verl/trainer/ppo/ray_trainer.py", line 447, in init_workers
wg_dict = self.ray_worker_group_cls(resource_pool=resource_pool, ray_cls_with_init=worker_dict_cls)
...
File "verl/single_controller/ray/base.py", line 274, in _init_with_resource_pool
assert register_center_actor is not None, f"failed to get register_center_actor: {self.name_prefix}_register_center in {list_named_actors(all_namespaces=True)}"
AssertionError: failed to get register_center_actor: GfZJZT_register_center in []
I have spent considerable time trying to debug this issue by adjusting GPU memory utilization, batch sizes, and FSDP offloading settings, but the worker crash during initialization remains persistent.
Since this is happening during the provided evaluation script on a standard 8-GPU setup, I would greatly appreciate it if you could provide some guidance or suggestions on how to resolve this.
Description
I encountered a persistent crash when running the evaluation script
bash examples/run_openvla_oft_rl_twin2.shfor OpenVLA-OFT on the Robotwin platform. The trainer fails during worker initialization because a Ray worker process exits unexpectedly, causing theregister_center_actorto be missing from the registry.Environment
bash examples/run_openvla_oft_rl_twin2.shtorch: 2.4.0cuda: 12.2tensorflow: 2.15.0verl: 0.2.0.post2ray: 2.52.1Reproduction Script
Full Error Logs
Error Summary
The execution fails during
trainer.init_workers(). The root cause is that a Ray worker dies unexpectedly, which prevents theregister_center_actorfrom being found.1. Ray Worker Exit:
2. Resulting Traceback:
Request for Assistance
I have spent considerable time trying to debug this issue by adjusting GPU memory utilization, batch sizes, and FSDP offloading settings, but the worker crash during initialization remains persistent.
Since this is happening during the provided evaluation script on a standard 8-GPU setup, I would greatly appreciate it if you could provide some guidance or suggestions on how to resolve this.
Thank you very much for your time and for this project!