I’m getting some random Permission Error (Errno 13) errors when tuning my model.
The error message occurs when ray tries to save the state of the experiment to experiment_state-2024-09-19_09-44-55…json (for example, since the name obviously changes from run to run) in the temp folder.
A few random hypothesis I have eliminated :
- Antivirus getting paranoid (disabling it changes nothing)
- A potential problem with tensorboard accessing the files
- A lack of space on the hard drive (500Gb free space should be more than enough)
Now, the randomness of the error (it happens once every 5 or 6 experiments), leads me to believe there is a race condition involved somewhere. The question being : I am the one who introduced it, or is there a bug somewhere in Ray ?
I have “solved” the problem with a (very very) ugly workaround in tune_controller.py code : around line 342, where the error occurs, I surrounded the code with a while loop to make it retry until it works … and it works
is_exception = True
while is_exception:
is_exception = False
try:
with open(
Path(driver_staging_path, self.experiment_state_file_name),
"w",
) as f:
json.dump(runner_state, f, cls=TuneFunctionEncoder)
except Exception as e:
is_exception = True
I’m running on a windows local cluster, with ray[tune] 2.36.0, and here is the code of how I run my training :
trainable_with_resources = tune.with_resources(lambda ray_parameters: _do_train(model_clazz,
train_dataset,
dev_dataset,
label_encoder,
ray_parameters),
{"cpu": 10, "gpu": 1})
tuner = Tuner(
trainable_with_resources,
tune_config=TuneConfig(
num_samples=training_description.num_samples,
scheduler=training_description.scheduler,
search_alg=training_description.search_algo,
trial_dirname_creator=_create_trial_dirname,
max_concurrent_trials= 1
),
run_config=RunConfig(
storage_path=str(ray_run_path),
checkpoint_config=CheckpointConfig(
num_to_keep=1,
checkpoint_at_end=False,
checkpoint_score_attribute=training_description.checkpoint_metric,
checkpoint_score_order=training_description.checkpoint_mode,
)
),
param_space=param_space
)
result: ResultGrid = tuner.fit()
So obviously I’m very unhappy with the only workaround I found, and I’d rather understand what’s going on there ^^’
Thanks in advance to anyone who could give me a clue on what I’m doing wrong … or not !