ValueError: The returned checkpoint path must be within the given checkpoint dir

Hello everyone,

The Problem:

when running a population based training (PBT) with ray tune, i get the following error massage:

ERROR trial_runner.py:856 -- Trial train_model_46427_00002: Error processing result.
Traceback (most recent call last):
  File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\tune\trial_runner.py", line 854, in _process_trial_save
    checkpoint_value = self.trial_executor.fetch_result(trial)
  File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\tune\ray_trial_executor.py", line 489, in fetch_result
    result = ray.get(trial_future[0], timeout=DEFAULT_GET_TIMEOUT)
  File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\worker.py", line 1452, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(ValueError): ray::ImplicitFunc.save() (pid=6664, ip=192.168.178.20)
  File "python\ray\_raylet.pyx", line 482, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 436, in ray._raylet.execute_task.function_executor
  File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\function_manager.py", line 553, in actor_method_executor
    return method(actor, *args, **kwargs)
  File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\tune\function_runner.py", line 434, in save
    checkpoint_path = TrainableUtil.process_checkpoint(
  File "C:\Users\Dirk\anaconda3\lib\site-packages\ray\tune\trainable.py", line 41, in process_checkpoint
    raise ValueError(
ValueError: The returned checkpoint path must be within the given checkpoint dir C:\Users\Dirk/ray_results\2021-01-05_11-24-15iiap2bxm\checkpoint_4: C:\Users\Dirk\ray_results\2021-01-05_11-24-15iiap2bxm\checkpoint_4

However, I only get this error on some trials during the optimization, but others are trained without any issues.

My Code:
https://drive.google.com/drive/folders/19PfCL4MIU65Raq83kCLCpk1LOiBUyXOG?usp=sharing

Information about my setup:
Windows 10
pycharm 2020.3
python 3.8
ray 1.0.1.post1
pytorch 1.7

Does anyone have an idea about that strange error? I think it has something to do with the pertubation and checkpointing, but I can’t find anything about it.

hey, I can’t access your code – can you post your tune.run command and also the snippets where you call tune.report and tune.checkpoint?

I put the code into a gist:

Can you see it now?

The checkpointing can be found in the train_model function. However I am not using tune.checkpoint, but tune.checkpoint_dir with torch.save…

You use Windows 10 so maybe check if your paths are ok.

C:\Users\Dirk/ray_results\2021-01-05_11-24-15iiap2bxm\checkpoint_4

In this path are slashes and backslashes.

Peter

Hey Peter,

I was also curious about that, but the paths are managed by ray tune itself. Furthermore this seems not to be a problem for some trials during optimization, becaus they are trained till the end.

I suggested this possible problems because many times I had problems with paths when switching between ubuntu and windows 10 anaconda.

Ah, I think this is a bug on the Ray side. Could you file an issue on Github?

Hey Richard, I opened an issue on github!