How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi! I am experiencing some issues with TuneBOHB restarting from scratch rather than the last checkpoint.
I run the provided bohb_example.py from the Ray Tune documentation, and add the following
sync_config = tune.SyncConfig(
upload_dir="gs://path/to/bucket",
)
And then change the tune.run accordingly, adding a gpu resource and the sync config.
analysis = tune.run(
MyTrainableClass,
name="bohb_test_2",
config=config,
resources_per_trial={"cpu": 1, "gpu": 1},
scheduler=bohb_hyperband,
search_alg=bohb_search,
num_samples=10,
stop={"training_iteration": 100},
metric="episode_reward_mean",
mode="max",
sync_config=sync_config,
)
Running the script then results in the following error
2022-03-24 13:31:33,617 ERROR worker.py:84 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::MyTrainableClass.restore_from_object() (pid=3348, ip=10.132.0.37, repr=<bohb_example.MyTrainableClass object at 0x7fcd74934f10>)
At least one of the input arguments for this task could not be computed:
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: MyTrainableClass
actor_id: 00a7f10f4c41bb83e908f38d01000000
pid: 3148
namespace: 1d48f0ea-41b0-4996-977a-5a395ef4a183
ip: 10.132.0.37
The actor is dead because its worker process has died. Worker exit type: PLACEMENT_GROUP_REMOVED
Followed by this:
2022-03-24 13:31:33,717 INFO hyperband.py:453 -- Restoring from a previous point in time. Previous=4; Now=1
So rather than restarting from training iteration 4, it restarts from iteration 1. (4 is just an example). This would make sense if the checkpoint did not exist. However, the checkpoint exists both locally and in GCP.
When I run the script without syncing to GCP, the aforementioned error does not occur, and when needed, Tune is indeed able to restart the training from the last checkpoint.
I run everything on a GCP Compute Engine (a VM with Debian 10) and a connected NVIDIA Tesla K80 GPU with
hpbandster v 0.7.4
ray v 1.11.0