Using Ray Tune in a Ray Cluster, checkpoints not synced back to source node

Hello,

I’m testing whether a Ray Cluster can return files of trials running on a Worker. Files (eg tensor board and etc) are returned to source node, but not the checkpoint folder. Reference case: If I run my script without specifying an address ie ray.init(), I get all the files, including checkpoint files.

My Ray Cluster is setup manually.
Head:

ray start --head --port=6379 --num-cpus=0

Worker:

ray start --address='<IP>:6379' --redis-password='<PASS>'

I run my script from Head node.
When the trials are running I can see them on the dashboard. All trials are running on the worker because the head is “num-cpus=0”.

The checkpoint folder for every trial is left on the worker node, inside ~/ray_results/. Every trial shows the following error messages on the Head node:

2021-04-10 15:26:32,590 ERROR trial_runner.py:783 -- Trial A3C_MyAgentEnv_d1678_00000: Error handling checkpoint /home/rick.lan/ray_results/test-ray-cluster/A3C_MyAgentEnv_d1678_00000_0_lr=0.01_2021-04-10_15-24-18/checkpoint_20/checkpoint-20
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 778, in _process_trial_save
    checkpoint=trial.saving_to)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
    callback.on_checkpoint(**info)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
    self._sync_trial_checkpoint(trial, checkpoint)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 430, in _sync_trial_checkpoint
    trial, checkpoint.value))
ray.tune.error.TuneError: Trial A3C_MyAgentEnv_d1678_00000: Checkpoint path /home/rick.lan/ray_results/test-ray-cluster/A3C_MyAgentEnv_d1678_00000_0_lr=0.01_2021-04-10_15-24-18/checkpoint_20/checkpoint-20 not found after successful sync down.

Using Ray 1.2.0.

I’m using ray Tune this way:

results = tune.run(
  args.run, 
  name=args.name,
  config=config, 
  stop=stop, 
  checkpoint_at_end = True,
)

Edit: code highlighting

1 Like

I am able to duplicate both the good and bad behavior using the following code from the documentation:

https://docs.ray.io/en/releases-1.2.0/tune/examples/mnist_pytorch.html?highlight=mnist_pytorch#mnist-pytorch

I add “checkpoint_at_end = True”:

    config={
        "lr": tune.loguniform(1e-4, 1e-2),
        "momentum": tune.uniform(0.1, 0.9),
    },
    checkpoint_at_end=True, # <----- My change
    )

print("Best config is:", analysis.best_config)

Run command:

python ./script.py --ray-address='auto'

Error messages:

2021-04-10 16:34:40,000 ERROR trial_runner.py:783 -- Trial train_mnist_8d669_00021: Error handling checkpoint /home/rick.lan/$
ay_results/exp/train_mnist_8d669_00021_21_lr=0.0040155,momentum=0.86611_2021-04-10_16-34-18/checkpoint_-1/
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 778, in _process_trial_save
    checkpoint=trial.saving_to)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
    callback.on_checkpoint(**info)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
    self._sync_trial_checkpoint(trial, checkpoint)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 430, in _sync_trial_checkpoint
    trial, checkpoint.value))
ray.tune.error.TuneError: Trial train_mnist_8d669_00021: Checkpoint path /home/rick.lan/ray_results/exp/train_mnist_8d669_000$
1_21_lr=0.0040155,momentum=0.86611_2021-04-10_16-34-18/checkpoint_-1/ not found after successful sync down.

As a side note, if I run the script locally. The checkpoint_-1 folders are empty. I think they are supposed to be.

What happens if you disable checkpoint_at_end?

Does your training script have tune.checkpoint inside?

If checkpoint_at_end == False, running both scripts do not show the error messages. Of course the checkpoint files are not there. Like before the usual files do come back:

$ ls
events.out.tfevents.1618109586  params.json  params.pkl  progress.csv  result.json

No.

I did notice that on the Worker, the trial folders all are have quotes, but not on the source node.

'train_mnist_fa29f_00045_45_lr=0.0076456,momentum=0.32658_2021-04-11_02-46-12'
'train_mnist_fa29f_00046_46_lr=0.0025491,momentum=0.17656_2021-04-11_02-46-12'
'train_mnist_fa29f_00047_47_lr=0.00011224,momentum=0.76445_2021-04-11_02-46-13'
'train_mnist_fa29f_00048_48_lr=0.00096124,momentum=0.74367_2021-04-11_02-46-13'
'train_mnist_fa29f_00049_49_lr=0.00010617,momentum=0.11767_2021-04-11_02-46-13'

could it be that when path joining for checkpoint folders, that is messed up?

I’m running Debian 9 and python 3.7.8.

@rliaw I need the checkpoints transfer back at the head node. What are workarounds I could try? Thanks!

Can you please try adding checkpointing support as shown here:

https://docs.ray.io/en/master/tune/examples/pbt_convnet_function_example.html

(ignore the usage of the scheduler).

Thank you @rliaw for referring to an example. The example uses a custom(?) trainer. I’m using a plain A3C trainer from RLlib:

results = tune.run(
  args.run, # is "A3C" 
  name=args.name, # is "experiment_A3C"
  config=config, 
  stop=stop, 
  checkpoint_at_end = True,
)

@sven1977 Do you any insights? Thanks!

Is it possible that there is no SSH access across your different nodes?

Could you save the checkpoint files to NFS instead?

There are SSH access both directions. Same user name, using RSA keys to authenticate.

Is what I’m trying to do won’t work without NFS?

Well, it should work without NFS. However, it seems like there’s some issue with syncing back the checkpoints back to the head node.

This may be because you’ve started the ray node manually (as opposed to using the ray cluster launcher). Your options I think are:

  1. use NFS or S3 (via durable trainable)
  2. setup your cluster with the ray cluster launcher
  3. implement your own syncing function (Execution (tune.run, tune.Experiment) — Ray v2.0.0.dev0)

Let me know if you have any other questions