Hello,
I’m testing whether a Ray Cluster can return files of trials running on a Worker. Files (eg tensor board and etc) are returned to source node, but not the checkpoint folder. Reference case: If I run my script without specifying an address ie ray.init(), I get all the files, including checkpoint files.
My Ray Cluster is setup manually.
Head:
ray start --head --port=6379 --num-cpus=0
Worker:
ray start --address='<IP>:6379' --redis-password='<PASS>'
I run my script from Head node.
When the trials are running I can see them on the dashboard. All trials are running on the worker because the head is “num-cpus=0”.
The checkpoint folder for every trial is left on the worker node, inside ~/ray_results/. Every trial shows the following error messages on the Head node:
2021-04-10 15:26:32,590 ERROR trial_runner.py:783 -- Trial A3C_MyAgentEnv_d1678_00000: Error handling checkpoint /home/rick.lan/ray_results/test-ray-cluster/A3C_MyAgentEnv_d1678_00000_0_lr=0.01_2021-04-10_15-24-18/checkpoint_20/checkpoint-20
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 778, in _process_trial_save
checkpoint=trial.saving_to)
File "/opt/conda/lib/python3.7/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
callback.on_checkpoint(**info)
File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
self._sync_trial_checkpoint(trial, checkpoint)
File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 430, in _sync_trial_checkpoint
trial, checkpoint.value))
ray.tune.error.TuneError: Trial A3C_MyAgentEnv_d1678_00000: Checkpoint path /home/rick.lan/ray_results/test-ray-cluster/A3C_MyAgentEnv_d1678_00000_0_lr=0.01_2021-04-10_15-24-18/checkpoint_20/checkpoint-20 not found after successful sync down.
Using Ray 1.2.0.
I’m using ray Tune this way:
results = tune.run(
args.run,
name=args.name,
config=config,
stop=stop,
checkpoint_at_end = True,
)
Edit: code highlighting