BOHB restarting from scratch rather than last checkpoint with GCP syncing

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi! I am experiencing some issues with TuneBOHB restarting from scratch rather than the last checkpoint.

I run the provided bohb_example.py from the Ray Tune documentation, and add the following

sync_config = tune.SyncConfig(
        upload_dir="gs://path/to/bucket",
)

And then change the tune.run accordingly, adding a gpu resource and the sync config.

analysis = tune.run(
        MyTrainableClass,
        name="bohb_test_2",
        config=config,
        resources_per_trial={"cpu": 1, "gpu": 1},
        scheduler=bohb_hyperband,
        search_alg=bohb_search,
        num_samples=10,
        stop={"training_iteration": 100},
        metric="episode_reward_mean",
        mode="max",
        sync_config=sync_config,
)

Running the script then results in the following error

2022-03-24 13:31:33,617 ERROR worker.py:84 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::MyTrainableClass.restore_from_object() (pid=3348, ip=10.132.0.37, repr=<bohb_example.MyTrainableClass object at 0x7fcd74934f10>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: MyTrainableClass
        actor_id: 00a7f10f4c41bb83e908f38d01000000
        pid: 3148
        namespace: 1d48f0ea-41b0-4996-977a-5a395ef4a183
        ip: 10.132.0.37
The actor is dead because its worker process has died. Worker exit type: PLACEMENT_GROUP_REMOVED

Followed by this:

2022-03-24 13:31:33,717 INFO hyperband.py:453 -- Restoring from a previous point in time. Previous=4; Now=1

So rather than restarting from training iteration 4, it restarts from iteration 1. (4 is just an example). This would make sense if the checkpoint did not exist. However, the checkpoint exists both locally and in GCP.

When I run the script without syncing to GCP, the aforementioned error does not occur, and when needed, Tune is indeed able to restart the training from the last checkpoint.

I run everything on a GCP Compute Engine (a VM with Debian 10) and a connected NVIDIA Tesla K80 GPU with

hpbandster v 0.7.4
ray        v 1.11.0

Further investigation show that when the trial reaches it’s milestone as per this printout:

Bracket(Max Size (n)=3, Milestone (r)=27, completed=8.9%): {PAUSED: 1, RUNNING: 2}

A new experiment is started for the same trial, rather than continuing from the milestone as seen from the progress.csv file:

training_iteration trial_id experiment_id
1 2c23d664 22471147c2164f4089b328e859490325
1 2c23d664 caffca68824d4f69af93b458c2498831
2 2c23d664 caffca68824d4f69af93b458c2498831
3 2c23d664 caffca68824d4f69af93b458c2498831
1 2c23d664 bdf72789f5834cc68d6be31805db2c1c
2 2c23d664 bdf72789f5834cc68d6be31805db2c1c
3 2c23d664 bdf72789f5834cc68d6be31805db2c1c
4 2c23d664 bdf72789f5834cc68d6be31805db2c1c
5 2c23d664 bdf72789f5834cc68d6be31805db2c1c
6 2c23d664 bdf72789f5834cc68d6be31805db2c1c
7 2c23d664 bdf72789f5834cc68d6be31805db2c1c
8 2c23d664 bdf72789f5834cc68d6be31805db2c1c
9 2c23d664 bdf72789f5834cc68d6be31805db2c1c
1 2c23d664 894266f01b244e03b65f8543c5b7c11f
2 2c23d664 894266f01b244e03b65f8543c5b7c11f
3 2c23d664 894266f01b244e03b65f8543c5b7c11f
4 2c23d664 894266f01b244e03b65f8543c5b7c11f
5 2c23d664 894266f01b244e03b65f8543c5b7c11f
6 2c23d664 894266f01b244e03b65f8543c5b7c11f
7 2c23d664 894266f01b244e03b65f8543c5b7c11f
8 2c23d664 894266f01b244e03b65f8543c5b7c11f
9 2c23d664 894266f01b244e03b65f8543c5b7c11f
10 2c23d664 894266f01b244e03b65f8543c5b7c11f
11 2c23d664 894266f01b244e03b65f8543c5b7c11f
12 2c23d664 894266f01b244e03b65f8543c5b7c11f
13 2c23d664 894266f01b244e03b65f8543c5b7c11f
14 2c23d664 894266f01b244e03b65f8543c5b7c11f
15 2c23d664 894266f01b244e03b65f8543c5b7c11f
16 2c23d664 894266f01b244e03b65f8543c5b7c11f
17 2c23d664 894266f01b244e03b65f8543c5b7c11f
18 2c23d664 894266f01b244e03b65f8543c5b7c11f
19 2c23d664 894266f01b244e03b65f8543c5b7c11f
20 2c23d664 894266f01b244e03b65f8543c5b7c11f
21 2c23d664 894266f01b244e03b65f8543c5b7c11f
22 2c23d664 894266f01b244e03b65f8543c5b7c11f
23 2c23d664 894266f01b244e03b65f8543c5b7c11f
24 2c23d664 894266f01b244e03b65f8543c5b7c11f
25 2c23d664 894266f01b244e03b65f8543c5b7c11f
26 2c23d664 894266f01b244e03b65f8543c5b7c11f
27 2c23d664 894266f01b244e03b65f8543c5b7c11f
1 2c23d664 3037b295da1b4120a6ee082b1929e50b

I’m assuming this behaviour is not intended?

I dont know if this related stack trace might help as well:

(optimize_nc pid=2023) 2022-03-23 17:08:37,312  ERROR worker.py:430 -- SystemExit was raised from the worker.
(optimize_nc pid=2023) Traceback (most recent call last):
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 774, in ray._raylet.task_execution_handler
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 595, in ray._raylet.execute_task
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 633, in ray._raylet.execute_task
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 640, in ray._raylet.execute_task
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 644, in ray._raylet.execute_task
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 593, in ray._raylet.execute_task.function_executor
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/_private/function_manager.py", line 648, in actor_method_executor
(optimize_nc pid=2023)     return method(__ray_actor, *args, **kwargs)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(optimize_nc pid=2023)     return method(self, *_args, **_kwargs)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 457, in save_to_object
(optimize_nc pid=2023)     checkpoint_path = self.save()
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(optimize_nc pid=2023)     return method(self, *_args, **_kwargs)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 452, in save
(optimize_nc pid=2023)     self._maybe_save_to_cloud(parent_dir)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(optimize_nc pid=2023)     return method(self, *_args, **_kwargs)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 425, in _maybe_save_to_cloud
(optimize_nc pid=2023)     self.storage_client.wait()
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/sync_client.py", line 287, in wait
(optimize_nc pid=2023)     _, error_msg = self.cmd_process.communicate()
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/subprocess.py", line 1018, in communicate
(optimize_nc pid=2023)     stderr = self.stderr.read()
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 427, in sigterm_handler
(optimize_nc pid=2023)     sys.exit(1)
(optimize_nc pid=2023) SystemExit: 1

Hm yes, it looks like the sync to gcloud is not working properly or maybe is not installed at all. On the remote node, can you try if gsutil ls gs://path/to/bucket works?

Yes, the command works, and the existing subdirectories are listed.

Looking into update_trial_stats(self, trial: Trial, result: Dict) in hyperband.py when the error occurs it looks like the trial object is at a later point in time than the result object.

E.g. using training_iteration as time measurement, the trial object would be at training_iteration: 27 and the result object would be at training_iteration: 1

I have only noticed this behaviour when syncing with GCP, and it only happens when the trial is being restored after reaching a milestone.