Priority: Medium
I’m using WANDB to sweep the following:
- num_workers
- num_gpus
- num_envs_per_worker
The sweep inevitably chooses parameters that crash/freeze the process. Sometimes, workers die, CUDA runs out of memory, etc. Ideally, the sweep logs these failures and then tries a different parameter set. However, the run may go on forever.
Is there a native way to tune this that catches and moves on from such crashes? Is there a fancy forked ray process for temporal management that I should implement?
Best,
Aidan