Hello,
I’m using Ray TochTrainer() to train my model on a multi-cluster setup. I’m getting some errors and I want to set TORCH_DISTRIBUTED_DEBUG=Info
and also need to somehow pass find_unused_parameters
to Pytorch DDP to do some debugging. I was wondering if it is possible to do it when I use TorchTrainer(). If not, is there any other way that I can debug my code?
thanks,