Issue with Checkpointing in Ray 2.9.1 on Windows 11 while Training PPO Algorithm

hermmanhender · January 29, 2024, 9:34pm

I am currently working with Ray 2.9.1 on Windows 11 to train a Proximal Policy Optimization (PPO) algorithm. However, I have encountered an issue with the Checkpointing mechanism that was not present in the previous versions.

The error message I am encountering is as follows:

62024-01-29 22:20:07,251 ERROR tune_controller.py:1374 -- Trial task failed for trial PPO_EPEnv_e6ca1616
Traceback (most recent call last):
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\air\execution\_internal\event_manager.py", line 110, in resolve_future
    result = ray.get(future)
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\_private\auto_init_hook.py", line 22, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\_private\client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\_private\worker.py", line 2624, in get
    raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(FileNotFoundError): ray::PPO.save() (pid=11124, ip=127.0.0.1, actor_id=984bb80f54af807c18b1405e01000000, repr=PPO)
  File "python\ray\_raylet.pyx", line 1813, in ray._raylet.execute_task
  File "python\ray\_raylet.pyx", line 1754, in ray._raylet.execute_task.function_executor
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\_private\function_manager.py", line 726, in actor_method_executor
    return method(__ray_actor, *args, **kwargs)
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\util\tracing\tracing_helper.py", line 467, in _resume_span
    return method(self, *_args, **_kwargs)
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\tune\trainable\trainable.py", line 480, in save
    persisted_checkpoint = self._storage.persist_current_checkpoint(
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\train\_internal\storage.py", line 558, in persist_current_checkpoint
    _pyarrow_fs_copy_files(
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\ray\train\_internal\storage.py", line 110, in _pyarrow_fs_copy_files
    return pyarrow.fs.copy_files(
  File "C:\Users\grhen\anaconda3\envs\ray291\lib\site-packages\pyarrow\fs.py", line 244, in copy_files
    _copy_files_selector(source_fs, source_sel,
  File "pyarrow\_fs.pyx", line 1229, in pyarrow._fs._copy_files_selector
  File "pyarrow\error.pxi", line 110, in pyarrow.lib.check_status
FileNotFoundError: [WinError 206] Cannot create directory 'C:/Users/grhen/ray_results/PPO_2024-01-29_22-11-43/PPO_EPEnv_e6ca1616_1_type=StochasticSampling,disable_action_flattening=False,disable_execution_plan_api=True,disable_initialize_lo_2024-01-29_22-11-43/checkpoint_000000/learner/module_state/default_policy'. Detail: [Windows error 206] The file name or extension is too long.

I am seeking guidance on resolving this issue. Additionally, I am interested in understanding if there is a way to mitigate the excessive information included in the automatically assigned name to the folder. I have attempted to address this by renaming the experiment folder using the air.RunConfig, but this only modify the experiment directory name.

Any assistance or insights regarding how to rectify this matter would be greatly appreciated.

Thank you.
Best regards, Germán

hermmanhender · January 30, 2024, 9:03am

Hi! Again… I solved the problem with the trial_name_creator and trial_dirname_creator config in Tune. Example bellow:

def trial_str_creator(trial):
    return "{}_{}_123".format(trial.trainable_name, trial.trial_id)

tune.Tuner( 
        algorithm,
        tune_config=tune.TuneConfig(
            trial_name_creator=trial_str_creator,
            trial_dirname_creator= trial_str_creator,
        )
)

Topic		Replies	Views
Correct implementation for PPO reset_config() RLlib	1	201	April 7, 2024
PPO from checkpoint Checkpointing, Restoring	0	45	September 10, 2024
Possibly Checkpoint error while running Ray tune	4	1229	December 2, 2022
Structure's sequence length mismatch issue from sgd code for PPO policy RLlib	2	273	January 19, 2024
ValueError when restoring checkpoint with PPO RLlib	1	508	October 20, 2022

Issue with Checkpointing in Ray 2.9.1 on Windows 11 while Training PPO Algorithm

Related topics