I’m using WANDB to sweep the following:
The sweep inevitably chooses parameters that crash/freeze the process. Sometimes, workers die, CUDA runs out of memory, etc. Ideally, the sweep logs these failures and then tries a different parameter set. However, the run may go on forever.
Is there a native way to tune this that catches and moves on from such crashes? Is there a fancy forked ray process for temporal management that I should implement?