How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I am working on a customised gymnasium SAC environment under RLLIB, using Ray Tune for hyper parameter tuning, under Windows 10/11 environment. After around 10,000 steps, regardless of how long the expected episode length is (I tried up to more than a million), the following error messages always appear:
(SAC pid=[id]) 2023-06-24 09:43:58,534 WARNING policy.py:134 – Can not figure out a durable policy name for <class ‘ray.rllib.policy.eager_tf_policy.SACTFPolicy_eager’>. You are probably trying to checkpoint a custom policy. Raw policy class may cause problems when the checkpoint needs to be loaded in the future. To fix this, make sure you add your custom policy in rllib.algorithms.registry.POLICIES.
Followed by:
(SAC pid=[id]) 2023-06-24 09:44:00,449 ERROR syncer.py:466 – Caught sync error: Sync process failed: [WinError 32] Failed copying ‘C:/Users/[username]/ray_results/SAC/SAC_Env_fbfe2_00000_0_train_batch_size=16_[date][time]/checkpoint_000100/.is_checkpoint’ to ‘c:///Users/[username]/ray_results/SAC/SAC_Env_fbfe2_00000_0_train_batch_size=16[date]_[time]/checkpoint_000100/.is_checkpoint’. Detail: [Windows error 32] The process cannot access the file because it is being used by another process.
I have spent a bit of time eliminating the followings as possible causes:
- Policy_model_config and q_model_config under RLLIB’s SACConfig().training: originally had grid searches on fcnet_hidden and fcnet_activation (otherwise I am using default), but I saw the same even if I used default only.
- local_dir under Tune’s RunConfig: a default location yielded the same result.
- checkpoint_frequency under CheckpointConfig, under Tune’s RunConfig: it did not help trying anything between default value to 100,000.
- Tune.SyncConfig parameters under Tune’s RunConfig: I set sync_period and sync_timeout to very high numbers (12,000 to 60,000) but still observed the same.
- The same thing happened whether I used Tensorboard during training process or not.
- Running the environment under administrator mode (I am using VS Code) did not help either.
- Running on single or multiple GPUs did not matter.
I was tempted to use local_mode = True during ray.init(), but I am quite sure Ray knows I am running local (since I always see “Started a local Ray instance” in the beginning), and noting local mode is now deprecated I did not actually try.
It is to me quite clear the syncer is trying to copy the .is_checkpoint to itself, causing the second error and stopping me from completing the training. But since I am running entirely local, I am not sure why syncer comes in to play - I was under the impression that it is for distributed environments.
And on the other hand, the first error message is also perplexing: it pops up even when I use default policies, or it comes up because I am using customized environment?
Will anyone know what the root cause(s) can possibly be? Any help / hint would be much appreciated.