Runtime Minimization Sweeps

Priority: Medium

I’m using WANDB to sweep the following:

  1. num_workers
  2. num_gpus
  3. num_envs_per_worker

The sweep inevitably chooses parameters that crash/freeze the process. Sometimes, workers die, CUDA runs out of memory, etc. Ideally, the sweep logs these failures and then tries a different parameter set. However, the run may go on forever.

Is there a native way to tune this that catches and moves on from such crashes? Is there a fancy forked ray process for temporal management that I should implement?