Runtime Minimization Sweeps

Aidan_McLaughlin · April 10, 2023, 3:37pm

Priority: Medium

I’m using WANDB to sweep the following:

num_workers
num_gpus
num_envs_per_worker

The sweep inevitably chooses parameters that crash/freeze the process. Sometimes, workers die, CUDA runs out of memory, etc. Ideally, the sweep logs these failures and then tries a different parameter set. However, the run may go on forever.

Is there a native way to tune this that catches and moves on from such crashes? Is there a fancy forked ray process for temporal management that I should implement?

Best,

Aidan

arturn · June 20, 2023, 7:23pm

Hi @Aidan_McLaughlin , this is not implemented and not easy to do since it goes against our goal of being resilient. For example, if a worker dies, most users would want to retry instead of having the whole trial crash.

If I where you, I’d see if the trial in question reports any timesteps. If not (after, say, 10 minutes), kill it and assume that the configuration is not valid.

Topic		Replies	Views
Hyperparameter sweep with Ray	2	615	December 9, 2021
Training on an unstable environment RLlib	1	310	September 4, 2022
[Tune] Use WandBCallback Logger with the FIFO scheduler and multiple runs in parallel Ray Tune	3	326	June 1, 2021
Wandb logger does not work on worker node Ray Tune	0	12	February 8, 2025
Tensorboard folds back on itself (restarts at 0) RLlib	2	273	October 1, 2021

Runtime Minimization Sweeps

Related topics