Ray Tune copies checkpoint to the same location when running locally

Teenforever · June 25, 2023, 3:46am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am working on a customised gymnasium SAC environment under RLLIB, using Ray Tune for hyper parameter tuning, under Windows 10/11 environment. After around 10,000 steps, regardless of how long the expected episode length is (I tried up to more than a million), the following error messages always appear:

(SAC pid=[id]) 2023-06-24 09:43:58,534 WARNING policy.py:134 – Can not figure out a durable policy name for <class ‘ray.rllib.policy.eager_tf_policy.SACTFPolicy_eager’>. You are probably trying to checkpoint a custom policy. Raw policy class may cause problems when the checkpoint needs to be loaded in the future. To fix this, make sure you add your custom policy in rllib.algorithms.registry.POLICIES.

Followed by:

(SAC pid=[id]) 2023-06-24 09:44:00,449 ERROR syncer.py:466 – Caught sync error: Sync process failed: [WinError 32] Failed copying ‘C:/Users/[username]/ray_results/SAC/SAC_Env_fbfe2_00000_0_train_batch_size=16_[date][time]/checkpoint_000100/.is_checkpoint’ to ‘c:///Users/[username]/ray_results/SAC/SAC_Env_fbfe2_00000_0_train_batch_size=16[date]_[time]/checkpoint_000100/.is_checkpoint’. Detail: [Windows error 32] The process cannot access the file because it is being used by another process.

I have spent a bit of time eliminating the followings as possible causes:

Policy_model_config and q_model_config under RLLIB’s SACConfig().training: originally had grid searches on fcnet_hidden and fcnet_activation (otherwise I am using default), but I saw the same even if I used default only.
local_dir under Tune’s RunConfig: a default location yielded the same result.
checkpoint_frequency under CheckpointConfig, under Tune’s RunConfig: it did not help trying anything between default value to 100,000.
Tune.SyncConfig parameters under Tune’s RunConfig: I set sync_period and sync_timeout to very high numbers (12,000 to 60,000) but still observed the same.
The same thing happened whether I used Tensorboard during training process or not.
Running the environment under administrator mode (I am using VS Code) did not help either.
Running on single or multiple GPUs did not matter.

I was tempted to use local_mode = True during ray.init(), but I am quite sure Ray knows I am running local (since I always see “Started a local Ray instance” in the beginning), and noting local mode is now deprecated I did not actually try.

It is to me quite clear the syncer is trying to copy the .is_checkpoint to itself, causing the second error and stopping me from completing the training. But since I am running entirely local, I am not sure why syncer comes in to play - I was under the impression that it is for distributed environments.

And on the other hand, the first error message is also perplexing: it pops up even when I use default policies, or it comes up because I am using customized environment?

Will anyone know what the root cause(s) can possibly be? Any help / hint would be much appreciated.

kai · June 26, 2023, 2:01pm

Regarding the Sync error: If you didn’t specify a upload_dir or remote storage_path there shouldn’t be a syncer at all. Can you post how you initialize your Tuner(), including the RunConfig and SyncConfig (if specified)?

In any case, you can pass SyncConfig(syncer=None) to disable syncing. But again, it would be good to know how it currently looks like so we can see what the problem may be. It might be a windows-specific problem.

The sync error you’re seeing is just a logger message though. Training should continue as normal afterwards.

For the rllib error I’ll refer to @arturn or @sven1977 who may be able to help you.

PS: Ray local_mode is something different - it uses a different backend to distribute tasks and is generally only recommended for development and testing.

Teenforever · June 30, 2023, 1:24pm

Tuner was originally initialized like below:

    tuner = tune.Tuner("SAC", 
        param_space = sac_config,
        tune_config = TuneConfig(
            scheduler = ASHAScheduler(metric = "episode_reward_max", mode = "max", grace_period = 5),
            num_samples = 6,
            ),
        run_config = RunConfig(
            name = f"SAC_[project_name]",
            stop = {"timesteps_total": 2000000},
            local_dir = ospath.dirname(__file__), #more on this below #1
            storage_path = ospath.dirname(__file__), #again more on this below #1
            verbose = 1,
            log_to_file = True,
            checkpoint_config = CheckpointConfig(
                checkpoint_frequency = 100,
                num_to_keep = 5,
                checkpoint_score_attribute = "episode_reward_max",
                checkpoint_score_order = "max",
                checkpoint_at_end = True,
                ),
            sync_config = tune.SyncConfig(
              syncer = "auto", 
              ),
            ),
        )

And, sac_config is:

sac_config = SACConfig().environment("[env_name]")
sac_config = sac_config.framework(
    framework ="tf2",
    eager_tracing= False,
    )
sac_config = sac_config.debugging(seed = seed)
policy_model_config = {
    "fcnet_hiddens": [32, 32],
    "fcnet_activation": "tanh", 
}
q_model_config = {
    "fcnet_hiddens": [32, 32],
    "fcnet_activation": "relu",
}
sac_config = sac_config.training(
    clip_actions = True,
    gamma = 0.99, 
    optimization_config = {
        "actor_learning_rate": 1e-5, 
        "critic_learning_rate": 1e-2, 
        "entropy_learning_rate": 1e-2, 
    },
    policy_model_config = policy_model_config,
    q_model_config = q_model_config,
    store_buffer_in_checkpoints = True,
    target_network_update_freq = 1,
    tau = 0.005,
    train_batch_size = 16,
    twin_q = True,
    num_steps_sampled_before_learning_starts = 20000,
)
sac_config = sac_config.resources(num_gpus = 1, num_cpus_per_worker = 1, num_gpus_per_worker = 1

Undertook a few tests based on your reply:

I did specify both local_dir and storage_path earlier (when synced = "auto"), because if I did not, a ray_result folder is created in C:\users[username]\ray_results and I prefer having it in my project folder (excuse my OCD).
When I tried SyncConfig(syncer=None) and have only local_dir set, Ray churned out a UserWarning saying I should use RunConfig.storage_path, then a ValueError: upload_dir enables syncing to cloud storage, but syncer=None disables syncing. Either remove the upload_dir, or set syncer to 'auto' or a custom syncer. No training could happen before Ray was terminated.
When I tried SyncConfig(syncer=None) and have only storage_path set, Ray gave ValueError: upload_dir enables syncing to cloud storage, but syncer=None disables syncing. Either remove the upload_dir, or set syncer to 'auto' or a custom syncer. Again, no training could happen before Ray was terminated.
Then I left local_dir and storage_path as default, with SyncConfig(syncer=None), the workers could train for a while, but later terminated at around 10,000 steps with the following error message:

Failure # 1 (occurred at [date]_[time])
[36mray::SAC.save()[39m (pid=26308, ip=127.0.0.1, actor_id=cf8caf47aab9584028d46e7101000000, repr=SAC)
File “python\ray_raylet.pyx”, line 1434, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 1438, in ray._raylet.execute_task
File “python\ray_raylet.pyx”, line 1378, in ray._raylet.execute_task.function_executor
File “C:\Users[username]\AppData\Local\Programs\Python\Python39\lib\site-packages\ray_private\function_manager.py”, line 724, in actor_method_executor
return method(__ray_actor, *args, **kwargs)
File “C:\Users[username]\AppData\Local\Programs\Python\Python39\lib\site-packages\ray\util\tracing\tracing_helper.py”, line 464, in _resume_span
return method(self, *_args, **_kwargs)
File “C:\Users[username]\AppData\Local\Programs\Python\Python39\lib\site-packages\ray\tune\trainable\trainable.py”, line 542, in save
self._maybe_save_to_cloud(checkpoint_dir)
File “C:\Users[username]\AppData\Local\Programs\Python\Python39\lib\site-packages\ray\util\tracing\tracing_helper.py”, line 464, in _resume_span
return method(self, *_args, **_kwargs)
File “C:\Users[username]\AppData\Local\Programs\Python\Python39\lib\site-packages\ray\tune\trainable\trainable.py”, line 659, in _maybe_save_to_cloud
assert syncer
AssertionError

Sushwyzr · August 7, 2023, 3:40pm

Facing an exact syncing error on windows. Tried both ray train and ray tune.

Teenforever · August 8, 2023, 11:53pm

Since my last post I tried using Stable Baselines 3 and the environment works perfectly, so my customized environment is probably not the peril. It is only when I use newer versions (I tried 2.4, 2.6.0 and 2.6.1) of Ray that this error pops up. As @Sushwyzr mentioned, the same error pops up regardless of whether Ray Tune is used, and even if I use WSL environment under Windows it still happens.

I tried to set up a Ray cluster, and I still see a sync error on GetFileInfo(), which closely resembles this potential bug.

For now I downgraded to Ray 2.3.0 from 2.6.1 and it is crunching numbers happily now. @Sushwyzr you may want to try and see if it works.

For the Ray team’s information @kai, to me it is a bug - having dug into the codes, specifying SyncConfig as None will throw an error at this line, and if a default syncer is used, remote directory is the same as the local directory here, regardless of whether a ray cluster is launched. Looks like somewhere in the code the remote path is set to local path which Windows does not like.

Happy to help the team in debugging this - I will also state this on the Github post above.

Sushwyzr · August 9, 2023, 5:04am

Hi @Teenforever , thank you for your response. I am using ray Dataset for my project and looks like Dataset have undergone significant changes since 2.3 to 2.6, which is breaking data processing parts of my project.
This Ieaves me with exploring two options:

Change the data processing module to be compatible with 2.3
Try the cloud storage as moving forward in future releases they are deprecating syncing of checkpoints if the storage directory is not cloud/nfs.https://github.com/ray-project/ray/issues/37177.
Either way, will make a point to update here on how it goes. Thanks again!

kai · August 10, 2023, 1:59pm

Thanks for raising this and following up. This is indeed a bug, and it should be fixed here: [train/tune] Use posix paths throughout library code by krfricke · Pull Request #38319 · ray-project/ray · GitHub

The fix will be included in Ray 2.7.

As a workaround, you should be able to set the storage_path to a relative directory, which will not trigger the buggy code path:

from ray import air, tune

tuner = tune.Tuner(
    train_fn,
    run_config=air.RunConfig(storage_path="./")
)
tuner.fit()

Sushwyzr · August 30, 2023, 7:38am

Hi Kai, Thank you for looking into this. Do you suggest we wait for Ray 2.7? The storage path setting seems to give the same error yet again. I am using ray[air]== 2.6.1

Topic		Replies	Views
WARNING syncer.py:585 -- Last sync command failed: Sync process failed	3	449	August 10, 2023
Ray Tune and Ray Train not working with windows path (storage_path)	2	1001	October 4, 2023
Rsync Error when using Ray Tune Ray Tune	4	1208	December 14, 2020
Entire ray cluster dying unexpectedly Ray Core	11	1109	September 20, 2023
Ray Tune on GCP cluster: checkpoint not found after successful sync down Ray Tune	10	1194	April 22, 2021

Ray Tune copies checkpoint to the same location when running locally

Related topics