Using Ray Tune in a Ray Cluster, checkpoints not synced back to source node

RickLan · April 10, 2021, 3:58pm

Hello,

I’m testing whether a Ray Cluster can return files of trials running on a Worker. Files (eg tensor board and etc) are returned to source node, but not the checkpoint folder. Reference case: If I run my script without specifying an address ie ray.init(), I get all the files, including checkpoint files.

My Ray Cluster is setup manually.
Head:

ray start --head --port=6379 --num-cpus=0

Worker:

ray start --address='<IP>:6379' --redis-password='<PASS>'

I run my script from Head node.
When the trials are running I can see them on the dashboard. All trials are running on the worker because the head is “num-cpus=0”.

The checkpoint folder for every trial is left on the worker node, inside ~/ray_results/. Every trial shows the following error messages on the Head node:

2021-04-10 15:26:32,590 ERROR trial_runner.py:783 -- Trial A3C_MyAgentEnv_d1678_00000: Error handling checkpoint /home/rick.lan/ray_results/test-ray-cluster/A3C_MyAgentEnv_d1678_00000_0_lr=0.01_2021-04-10_15-24-18/checkpoint_20/checkpoint-20
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 778, in _process_trial_save
    checkpoint=trial.saving_to)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
    callback.on_checkpoint(**info)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
    self._sync_trial_checkpoint(trial, checkpoint)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 430, in _sync_trial_checkpoint
    trial, checkpoint.value))
ray.tune.error.TuneError: Trial A3C_MyAgentEnv_d1678_00000: Checkpoint path /home/rick.lan/ray_results/test-ray-cluster/A3C_MyAgentEnv_d1678_00000_0_lr=0.01_2021-04-10_15-24-18/checkpoint_20/checkpoint-20 not found after successful sync down.

Using Ray 1.2.0.

I’m using ray Tune this way:

results = tune.run(
  args.run, 
  name=args.name,
  config=config, 
  stop=stop, 
  checkpoint_at_end = True,
)

Edit: code highlighting

RickLan · April 10, 2021, 4:41pm

I am able to duplicate both the good and bad behavior using the following code from the documentation:

https://docs.ray.io/en/releases-1.2.0/tune/examples/mnist_pytorch.html?highlight=mnist_pytorch#mnist-pytorch

I add “checkpoint_at_end = True”:

    config={
        "lr": tune.loguniform(1e-4, 1e-2),
        "momentum": tune.uniform(0.1, 0.9),
    },
    checkpoint_at_end=True, # <----- My change
    )

print("Best config is:", analysis.best_config)

Run command:

python ./script.py --ray-address='auto'

Error messages:

2021-04-10 16:34:40,000 ERROR trial_runner.py:783 -- Trial train_mnist_8d669_00021: Error handling checkpoint /home/rick.lan/$
ay_results/exp/train_mnist_8d669_00021_21_lr=0.0040155,momentum=0.86611_2021-04-10_16-34-18/checkpoint_-1/
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/trial_runner.py", line 778, in _process_trial_save
    checkpoint=trial.saving_to)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/callback.py", line 204, in on_checkpoint
    callback.on_checkpoint(**info)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 455, in on_checkpoint
    self._sync_trial_checkpoint(trial, checkpoint)
  File "/opt/conda/lib/python3.7/site-packages/ray/tune/syncer.py", line 430, in _sync_trial_checkpoint
    trial, checkpoint.value))
ray.tune.error.TuneError: Trial train_mnist_8d669_00021: Checkpoint path /home/rick.lan/ray_results/exp/train_mnist_8d669_000$
1_21_lr=0.0040155,momentum=0.86611_2021-04-10_16-34-18/checkpoint_-1/ not found after successful sync down.

As a side note, if I run the script locally. The checkpoint_-1 folders are empty. I think they are supposed to be.

rliaw · April 10, 2021, 5:03pm

What happens if you disable checkpoint_at_end?

Does your training script have tune.checkpoint inside?

RickLan · April 11, 2021, 3:03am

If checkpoint_at_end == False, running both scripts do not show the error messages. Of course the checkpoint files are not there. Like before the usual files do come back:

$ ls
events.out.tfevents.1618109586  params.json  params.pkl  progress.csv  result.json

No.

I did notice that on the Worker, the trial folders all are have quotes, but not on the source node.

'train_mnist_fa29f_00045_45_lr=0.0076456,momentum=0.32658_2021-04-11_02-46-12'
'train_mnist_fa29f_00046_46_lr=0.0025491,momentum=0.17656_2021-04-11_02-46-12'
'train_mnist_fa29f_00047_47_lr=0.00011224,momentum=0.76445_2021-04-11_02-46-13'
'train_mnist_fa29f_00048_48_lr=0.00096124,momentum=0.74367_2021-04-11_02-46-13'
'train_mnist_fa29f_00049_49_lr=0.00010617,momentum=0.11767_2021-04-11_02-46-13'

could it be that when path joining for checkpoint folders, that is messed up?

I’m running Debian 9 and python 3.7.8.

RickLan · April 14, 2021, 3:26pm

@rliaw I need the checkpoints transfer back at the head node. What are workarounds I could try? Thanks!

rliaw · April 14, 2021, 4:05pm

Can you please try adding checkpointing support as shown here:

https://docs.ray.io/en/master/tune/examples/pbt_convnet_function_example.html

(ignore the usage of the scheduler).

RickLan · April 17, 2021, 11:38pm

Thank you @rliaw for referring to an example. The example uses a custom(?) trainer. I’m using a plain A3C trainer from RLlib:

results = tune.run(
  args.run, # is "A3C" 
  name=args.name, # is "experiment_A3C"
  config=config, 
  stop=stop, 
  checkpoint_at_end = True,
)

@sven1977 Do you any insights? Thanks!

rliaw · April 19, 2021, 3:24am

Is it possible that there is no SSH access across your different nodes?

Could you save the checkpoint files to NFS instead?

RickLan · April 22, 2021, 6:00am

There are SSH access both directions. Same user name, using RSA keys to authenticate.

Is what I’m trying to do won’t work without NFS?

rliaw · April 22, 2021, 7:00am

Well, it should work without NFS. However, it seems like there’s some issue with syncing back the checkpoints back to the head node.

This may be because you’ve started the ray node manually (as opposed to using the ray cluster launcher). Your options I think are:

use NFS or S3 (via durable trainable)
setup your cluster with the ray cluster launcher
implement your own syncing function (Execution (tune.run, tune.Experiment) — Ray v2.0.0.dev0)

Let me know if you have any other questions

Topic		Replies	Views
Ray Tune on GCP cluster: checkpoint not found after successful sync down Ray Tune	10	1195	April 22, 2021
Cannot save checkpoint with Ray Tune Ray Tune	1	429	January 23, 2022
Sync down not happening when using cloud checkpointing Ray Tune	6	621	July 23, 2021
Cannot find checkpoint when gpus_per_trial > 0 Ray Tune	8	637	February 28, 2023
Ray Tune x SLURM - Problem with checkpoints	5	399	March 15, 2023

Using Ray Tune in a Ray Cluster, checkpoints not synced back to source node

Related topics