The trial fails to load checkpoint and fails with
2022-10-31 21:07:39,962 WARNING trial_runner.py:879 -- Trial Runner checkpointing failed: [Errno 2] No such file or directory: '/scratch/user/sage-multiple-gpu--6010/.tmp_checkpoint' -> '/scratch/user/sage-multiple-gpu-6010/experiment_state-2022-10-31_21-07-04.json'
Here’s how I start the tuner
resources_per_trial = {
"cpu": args.cpus,
"gpu": args.gpus,
}
# trainable = tune.with_parameters(main, callbacks=callbacks)
# Note that you can use any scheduler here as a superclass
class ModifiedFIFO(FIFOScheduler):
def on_trial_add(self, trial_runner, trial):
trial.config["trial_name"] = str(trial)
trial.config["trial_id"] = str(trial.trial_id)
trial.config["custom_dirname"] = str(trial.custom_dirname)
super().on_trial_add(trial_runner, trial)
tune.run(
main,
name=args.name,
local_dir=args.log_dir,
config=config,
resources_per_trial=resources_per_trial,
resume=args.resume,
verbose=1,
scheduler=ModifiedFIFO(),
)
And I run the script with
srun python --gpus 4 --cpus 24 --other --args --for --training
Any thoughts? Thank you.