Error Resuming Trails on Multiple GPU per Trial using PyTorch Lightning

The trial fails to load checkpoint and fails with

2022-10-31 21:07:39,962 WARNING trial_runner.py:879 -- Trial Runner checkpointing failed: [Errno 2] No such file or directory: '/scratch/user/sage-multiple-gpu--6010/.tmp_checkpoint' -> '/scratch/user/sage-multiple-gpu-6010/experiment_state-2022-10-31_21-07-04.json'

Here’s how I start the tuner

    resources_per_trial = {
        "cpu": args.cpus,
        "gpu": args.gpus,
    }
    # trainable = tune.with_parameters(main, callbacks=callbacks)
    # Note that you can use any scheduler here as a superclass
    class ModifiedFIFO(FIFOScheduler):
        def on_trial_add(self, trial_runner, trial):
            trial.config["trial_name"] = str(trial)
            trial.config["trial_id"] = str(trial.trial_id)
            trial.config["custom_dirname"] = str(trial.custom_dirname)
            super().on_trial_add(trial_runner, trial)

    tune.run(
        main,
        name=args.name,
        local_dir=args.log_dir,
        config=config,
        resources_per_trial=resources_per_trial,
        resume=args.resume,
        verbose=1,
        scheduler=ModifiedFIFO(),
    )

And I run the script with

srun python --gpus 4 --cpus 24 --other --args --for --training

Any thoughts? Thank you.