Random permission error while checkpointing

Damien · September 19, 2024, 8:10am

I’m getting some random Permission Error (Errno 13) errors when tuning my model.

The error message occurs when ray tries to save the state of the experiment to experiment_state-2024-09-19_09-44-55…json (for example, since the name obviously changes from run to run) in the temp folder.

A few random hypothesis I have eliminated :

Antivirus getting paranoid (disabling it changes nothing)
A potential problem with tensorboard accessing the files
A lack of space on the hard drive (500Gb free space should be more than enough)

Now, the randomness of the error (it happens once every 5 or 6 experiments), leads me to believe there is a race condition involved somewhere. The question being : I am the one who introduced it, or is there a bug somewhere in Ray ?

I have “solved” the problem with a (very very) ugly workaround in tune_controller.py code : around line 342, where the error occurs, I surrounded the code with a while loop to make it retry until it works … and it works

is_exception = True
        while is_exception:
            is_exception = False
            try:
                with open(
                        Path(driver_staging_path, self.experiment_state_file_name),
                        "w",
                ) as f:
                    json.dump(runner_state, f, cls=TuneFunctionEncoder)
            except Exception as e:
                is_exception = True

I’m running on a windows local cluster, with ray[tune] 2.36.0, and here is the code of how I run my training :


trainable_with_resources = tune.with_resources(lambda ray_parameters: _do_train(model_clazz,
                                                                                    train_dataset,
                                                                                    dev_dataset,
                                                                                    label_encoder,
                                                                                    ray_parameters),
                                                   {"cpu": 10, "gpu": 1})

    tuner = Tuner(
        trainable_with_resources,
        tune_config=TuneConfig(
            num_samples=training_description.num_samples,
            scheduler=training_description.scheduler,
            search_alg=training_description.search_algo,
            trial_dirname_creator=_create_trial_dirname,
            max_concurrent_trials= 1
        ),
        run_config=RunConfig(
            storage_path=str(ray_run_path),
            checkpoint_config=CheckpointConfig(
                num_to_keep=1,
                checkpoint_at_end=False,
                checkpoint_score_attribute=training_description.checkpoint_metric,
                checkpoint_score_order=training_description.checkpoint_mode,
            )
        ),
        param_space=param_space
    )

    result: ResultGrid = tuner.fit()

So obviously I’m very unhappy with the only workaround I found, and I’d rather understand what’s going on there ^^’

Thanks in advance to anyone who could give me a clue on what I’m doing wrong … or not !

Nerozud · December 14, 2024, 1:50pm

I have the same problem with ray 2.35.0

Manuel · December 29, 2024, 11:22am

I also have this problem with ray 2.4.
Permission errors randomly interrupt the experiments on my windows VM but work fine on macos.

Topic		Replies	Views
ray.tune.Experiment.from_json() is giving error	0	317	May 16, 2023
Possibly Checkpoint error while running Ray tune	4	1228	December 2, 2022
Trouble with some results from Ray Tune	1	42	August 7, 2024
Observing `OwnerDiedError` intermittently when running concurrent Ray Tune scripts Ray Tune	8	687	June 27, 2022
ERROR: Check failed: resource_pair.second > 0 Ray Tune	2	379	October 18, 2021

Random permission error while checkpointing

Related topics