Checkpoints not appearing when tune run through bash script

The bash script runs train.py with input args for config file and results directory.

When the process is run in an interactive slurm session, checkpoints appear in the result directory, and output appears in the logfile. However, when the process is run through a bash script (same command), only the final weights folder is created (never updated, and process seems to run endlessly when it should finish in 3 hours), and results directory remains empty (no checkpoint appear). There is also no logging output except an inital print of the config settings.

What I have tired:

  • changing the permissions of the bash script
  • piping output explicitly to another log file
  • explicitly defining all required and optional args in the bash script command
  • changing the temp directory for ray init to the results directory where I expect temporary checkpoints
  • adding nohup to the bash script command in case the process was quitting after logout

To summarize: the process outputs logging and intermediate checkpoints when run via an interactive slurm session. However when the same process is run via a shell script, there are no updates to logging, intermediate checkpoints, and also no final (best) weights.

ray 1.13.0
torch 1.10.2
python 3.8.5

Please let me know if more info is needed.

Hi @nis, which train.py file are you referring to?

Generally this sounds like a setup issue on the slurm/console side. We don’t do anything in Ray or Ray Tune to special case these things.

One thing I’d look into is which user the script is executed in. Ray Tune defaults to use ~/ray_results as the experiment directory. You can change this to an absolute path with the local_dir argument for tune.run()/air.RunConfig.