Hello everyone,
The Problem:
when running a population based training (PBT) with ray tune, i get the following error massage:
ERROR trial_runner.py:856 -- Trial train_model_46427_00002: Error processing result.
Traceback (most recent call last):
File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\tune\trial_runner.py", line 854, in _process_trial_save
checkpoint_value = self.trial_executor.fetch_result(trial)
File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\tune\ray_trial_executor.py", line 489, in fetch_result
result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\worker.py", line 1452, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::ImplicitFunc.save() (pid=6664, ip=192.168.178.20)
File "python\ray\_raylet.pyx", line 482, in ray._raylet.execute_task
File "python\ray\_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\function_manager.py", line 553, in actor_method_executor
return method(actor, *args, **kwargs)
File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\tune\function_runner.py", line 434, in save
checkpoint_path = TrainableUtil.process_checkpoint(
File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\tune\trainable.py", line 41, in process_checkpoint
raise ValueError(
ValueError: The returned checkpoint path must be within the given checkpoint dir C:\Users\Dirk/ray_results\2021-01-05_11-24-15iiap2bxm\checkpoint_4: C:\Users\Dirk\ray_results\2021-01-05_11-24-15iiap2bxm\checkpoint_4
However, I only get this error on some trials during the optimization, but others are trained without any issues.
My Code:
https://drive.google.com/drive/folders/19PfCL4MIU65Raq83kCLCpk1LOiBUyXOG?usp=sharing
Information about my setup:
Windows 10
pycharm 2020.3
python 3.8
ray 1.0.1.post1
pytorch 1.7
Does anyone have an idea about that strange error? I think it has something to do with the pertubation and checkpointing, but I can’t find anything about it.