Runtime Minimization Sweeps

Priority: Medium

I’m using WANDB to sweep the following:

  1. num_workers
  2. num_gpus
  3. num_envs_per_worker

The sweep inevitably chooses parameters that crash/freeze the process. Sometimes, workers die, CUDA runs out of memory, etc. Ideally, the sweep logs these failures and then tries a different parameter set. However, the run may go on forever.

Is there a native way to tune this that catches and moves on from such crashes? Is there a fancy forked ray process for temporal management that I should implement?

Best,

Aidan

Hi @Aidan_McLaughlin , this is not implemented and not easy to do since it goes against our goal of being resilient. For example, if a worker dies, most users would want to retry instead of having the whole trial crash.

If I where you, I’d see if the trial in question reports any timesteps. If not (after, say, 10 minutes), kill it and assume that the configuration is not valid.