BOHB restarting from scratch rather than last checkpoint with GCP syncing

knutve · March 24, 2022, 2:02pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi! I am experiencing some issues with TuneBOHB restarting from scratch rather than the last checkpoint.

I run the provided bohb_example.py from the Ray Tune documentation, and add the following

sync_config = tune.SyncConfig(
        upload_dir="gs://path/to/bucket",
)

And then change the tune.run accordingly, adding a gpu resource and the sync config.

analysis = tune.run(
        MyTrainableClass,
        name="bohb_test_2",
        config=config,
        resources_per_trial={"cpu": 1, "gpu": 1},
        scheduler=bohb_hyperband,
        search_alg=bohb_search,
        num_samples=10,
        stop={"training_iteration": 100},
        metric="episode_reward_mean",
        mode="max",
        sync_config=sync_config,
)

Running the script then results in the following error

2022-03-24 13:31:33,617 ERROR worker.py:84 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::MyTrainableClass.restore_from_object() (pid=3348, ip=10.132.0.37, repr=<bohb_example.MyTrainableClass object at 0x7fcd74934f10>)
  At least one of the input arguments for this task could not be computed:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: MyTrainableClass
        actor_id: 00a7f10f4c41bb83e908f38d01000000
        pid: 3148
        namespace: 1d48f0ea-41b0-4996-977a-5a395ef4a183
        ip: 10.132.0.37
The actor is dead because its worker process has died. Worker exit type: PLACEMENT_GROUP_REMOVED

Followed by this:

2022-03-24 13:31:33,717 INFO hyperband.py:453 -- Restoring from a previous point in time. Previous=4; Now=1

So rather than restarting from training iteration 4, it restarts from iteration 1. (4 is just an example). This would make sense if the checkpoint did not exist. However, the checkpoint exists both locally and in GCP.

When I run the script without syncing to GCP, the aforementioned error does not occur, and when needed, Tune is indeed able to restart the training from the last checkpoint.

I run everything on a GCP Compute Engine (a VM with Debian 10) and a connected NVIDIA Tesla K80 GPU with

hpbandster v 0.7.4
ray        v 1.11.0

knutve · March 24, 2022, 5:54pm

Further investigation show that when the trial reaches it’s milestone as per this printout:

Bracket(Max Size (n)=3, Milestone (r)=27, completed=8.9%): {PAUSED: 1, RUNNING: 2}

A new experiment is started for the same trial, rather than continuing from the milestone as seen from the progress.csv file:

training_iteration	trial_id	experiment_id
1	2c23d664	22471147c2164f4089b328e859490325
1	2c23d664	caffca68824d4f69af93b458c2498831
2	2c23d664	caffca68824d4f69af93b458c2498831
3	2c23d664	caffca68824d4f69af93b458c2498831
1	2c23d664	bdf72789f5834cc68d6be31805db2c1c
2	2c23d664	bdf72789f5834cc68d6be31805db2c1c
3	2c23d664	bdf72789f5834cc68d6be31805db2c1c
4	2c23d664	bdf72789f5834cc68d6be31805db2c1c
5	2c23d664	bdf72789f5834cc68d6be31805db2c1c
6	2c23d664	bdf72789f5834cc68d6be31805db2c1c
7	2c23d664	bdf72789f5834cc68d6be31805db2c1c
8	2c23d664	bdf72789f5834cc68d6be31805db2c1c
9	2c23d664	bdf72789f5834cc68d6be31805db2c1c
1	2c23d664	894266f01b244e03b65f8543c5b7c11f
2	2c23d664	894266f01b244e03b65f8543c5b7c11f
3	2c23d664	894266f01b244e03b65f8543c5b7c11f
4	2c23d664	894266f01b244e03b65f8543c5b7c11f
5	2c23d664	894266f01b244e03b65f8543c5b7c11f
6	2c23d664	894266f01b244e03b65f8543c5b7c11f
7	2c23d664	894266f01b244e03b65f8543c5b7c11f
8	2c23d664	894266f01b244e03b65f8543c5b7c11f
9	2c23d664	894266f01b244e03b65f8543c5b7c11f
10	2c23d664	894266f01b244e03b65f8543c5b7c11f
11	2c23d664	894266f01b244e03b65f8543c5b7c11f
12	2c23d664	894266f01b244e03b65f8543c5b7c11f
13	2c23d664	894266f01b244e03b65f8543c5b7c11f
14	2c23d664	894266f01b244e03b65f8543c5b7c11f
15	2c23d664	894266f01b244e03b65f8543c5b7c11f
16	2c23d664	894266f01b244e03b65f8543c5b7c11f
17	2c23d664	894266f01b244e03b65f8543c5b7c11f
18	2c23d664	894266f01b244e03b65f8543c5b7c11f
19	2c23d664	894266f01b244e03b65f8543c5b7c11f
20	2c23d664	894266f01b244e03b65f8543c5b7c11f
21	2c23d664	894266f01b244e03b65f8543c5b7c11f
22	2c23d664	894266f01b244e03b65f8543c5b7c11f
23	2c23d664	894266f01b244e03b65f8543c5b7c11f
24	2c23d664	894266f01b244e03b65f8543c5b7c11f
25	2c23d664	894266f01b244e03b65f8543c5b7c11f
26	2c23d664	894266f01b244e03b65f8543c5b7c11f
27	2c23d664	894266f01b244e03b65f8543c5b7c11f
1	2c23d664	3037b295da1b4120a6ee082b1929e50b

I’m assuming this behaviour is not intended?

knutve · March 24, 2022, 6:03pm

I dont know if this related stack trace might help as well:

(optimize_nc pid=2023) 2022-03-23 17:08:37,312  ERROR worker.py:430 -- SystemExit was raised from the worker.
(optimize_nc pid=2023) Traceback (most recent call last):
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 774, in ray._raylet.task_execution_handler
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 595, in ray._raylet.execute_task
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 633, in ray._raylet.execute_task
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 640, in ray._raylet.execute_task
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 644, in ray._raylet.execute_task
(optimize_nc pid=2023)   File "python/ray/_raylet.pyx", line 593, in ray._raylet.execute_task.function_executor
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/_private/function_manager.py", line 648, in actor_method_executor
(optimize_nc pid=2023)     return method(__ray_actor, *args, **kwargs)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(optimize_nc pid=2023)     return method(self, *_args, **_kwargs)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 457, in save_to_object
(optimize_nc pid=2023)     checkpoint_path = self.save()
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(optimize_nc pid=2023)     return method(self, *_args, **_kwargs)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/function_runner.py", line 452, in save
(optimize_nc pid=2023)     self._maybe_save_to_cloud(parent_dir)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/util/tracing/tracing_helper.py", line 451, in _resume_span
(optimize_nc pid=2023)     return method(self, *_args, **_kwargs)
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/trainable.py", line 425, in _maybe_save_to_cloud
(optimize_nc pid=2023)     self.storage_client.wait()
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/tune/sync_client.py", line 287, in wait
(optimize_nc pid=2023)     _, error_msg = self.cmd_process.communicate()
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/subprocess.py", line 1018, in communicate
(optimize_nc pid=2023)     stderr = self.stderr.read()
(optimize_nc pid=2023)   File "/opt/conda/lib/python3.8/site-packages/ray/worker.py", line 427, in sigterm_handler
(optimize_nc pid=2023)     sys.exit(1)
(optimize_nc pid=2023) SystemExit: 1

kai · March 25, 2022, 2:07pm

Hm yes, it looks like the sync to gcloud is not working properly or maybe is not installed at all. On the remote node, can you try if gsutil ls gs://path/to/bucket works?

knutve · March 25, 2022, 2:12pm

Yes, the command works, and the existing subdirectories are listed.

knutve · March 25, 2022, 2:30pm

Looking into update_trial_stats(self, trial: Trial, result: Dict) in hyperband.py when the error occurs it looks like the trial object is at a later point in time than the result object.

E.g. using training_iteration as time measurement, the trial object would be at training_iteration: 27 and the result object would be at training_iteration: 1

I have only noticed this behaviour when syncing with GCP, and it only happens when the trial is being restored after reaching a milestone.

Topic		Replies	Views
Many paused jobs without progress when using TuneBOHB Ray Tune	3	281	August 28, 2024
BOHB Scheduler Restarting Experiments without providing checkpoint_dir Ray Tune	11	500	May 7, 2021
Ray (Tune) v2.8 - Instability with workers on GCP Ray Tune	2	140	September 12, 2024
Ray Tune error resuming training from an AIR checkpoint	0	199	September 6, 2023
Trouble starting Tune job on local machine	1	304	August 2, 2023

BOHB restarting from scratch rather than last checkpoint with GCP syncing

Related topics