Random permission error while checkpointing

I’m getting some random Permission Error (Errno 13) errors when tuning my model.

The error message occurs when ray tries to save the state of the experiment to experiment_state-2024-09-19_09-44-55…json (for example, since the name obviously changes from run to run) in the temp folder.

A few random hypothesis I have eliminated :

  • Antivirus getting paranoid (disabling it changes nothing)
  • A potential problem with tensorboard accessing the files
  • A lack of space on the hard drive (500Gb free space should be more than enough)

Now, the randomness of the error (it happens once every 5 or 6 experiments), leads me to believe there is a race condition involved somewhere. The question being : I am the one who introduced it, or is there a bug somewhere in Ray ?

I have “solved” the problem with a (very very) ugly workaround in tune_controller.py code : around line 342, where the error occurs, I surrounded the code with a while loop to make it retry until it works … and it works :sweat:

is_exception = True
        while is_exception:
            is_exception = False
            try:
                with open(
                        Path(driver_staging_path, self.experiment_state_file_name),
                        "w",
                ) as f:
                    json.dump(runner_state, f, cls=TuneFunctionEncoder)
            except Exception as e:
                is_exception = True

I’m running on a windows local cluster, with ray[tune] 2.36.0, and here is the code of how I run my training :


trainable_with_resources = tune.with_resources(lambda ray_parameters: _do_train(model_clazz,
                                                                                    train_dataset,
                                                                                    dev_dataset,
                                                                                    label_encoder,
                                                                                    ray_parameters),
                                                   {"cpu": 10, "gpu": 1})

    tuner = Tuner(
        trainable_with_resources,
        tune_config=TuneConfig(
            num_samples=training_description.num_samples,
            scheduler=training_description.scheduler,
            search_alg=training_description.search_algo,
            trial_dirname_creator=_create_trial_dirname,
            max_concurrent_trials= 1
        ),
        run_config=RunConfig(
            storage_path=str(ray_run_path),
            checkpoint_config=CheckpointConfig(
                num_to_keep=1,
                checkpoint_at_end=False,
                checkpoint_score_attribute=training_description.checkpoint_metric,
                checkpoint_score_order=training_description.checkpoint_mode,
            )
        ),
        param_space=param_space
    )

    result: ResultGrid = tuner.fit()

So obviously I’m very unhappy with the only workaround I found, and I’d rather understand what’s going on there ^^’

Thanks in advance to anyone who could give me a clue on what I’m doing wrong … or not !