Checkpoints not appearing when tune run through bash script

nis · July 15, 2022, 4:32pm

The bash script runs train.py with input args for config file and results directory.

When the process is run in an interactive slurm session, checkpoints appear in the result directory, and output appears in the logfile. However, when the process is run through a bash script (same command), only the final weights folder is created (never updated, and process seems to run endlessly when it should finish in 3 hours), and results directory remains empty (no checkpoint appear). There is also no logging output except an inital print of the config settings.

What I have tired:

changing the permissions of the bash script
piping output explicitly to another log file
explicitly defining all required and optional args in the bash script command
changing the temp directory for ray init to the results directory where I expect temporary checkpoints
adding nohup to the bash script command in case the process was quitting after logout

To summarize: the process outputs logging and intermediate checkpoints when run via an interactive slurm session. However when the same process is run via a shell script, there are no updates to logging, intermediate checkpoints, and also no final (best) weights.

ray 1.13.0
torch 1.10.2
python 3.8.5

Please let me know if more info is needed.

kai · July 26, 2022, 8:31am

Hi @nis, which train.py file are you referring to?

Generally this sounds like a setup issue on the slurm/console side. We don’t do anything in Ray or Ray Tune to special case these things.

One thing I’d look into is which user the script is executed in. Ray Tune defaults to use ~/ray_results as the experiment directory. You can change this to an absolute path with the local_dir argument for tune.run()/air.RunConfig.

Topic		Replies	Views
Ray Tune x SLURM - Problem with checkpoints Ray Libraries (Data, Train, Tune, Serve)	5	294	March 15, 2023
Saving best checkpoint - tune is saving first iterations instead Ray Tune	1	388	October 18, 2021
RAY tune does not save checkpoint information under experiment path Ray Libraries (Data, Train, Tune, Serve)	0	27	April 7, 2024
Tuner not returning best checkpoint Ray Tune	3	309	April 25, 2023
Trial checkpointing Ray Libraries (Data, Train, Tune, Serve)	0	197	June 16, 2023

Checkpoints not appearing when tune run through bash script

Related Topics