How to set TORCH_DISTRIBUTED_DEBUG evn var

nimaous · February 11, 2024, 11:04am

Hello,

I’m using Ray TochTrainer() to train my model on a multi-cluster setup. I’m getting some errors and I want to set TORCH_DISTRIBUTED_DEBUG=Info and also need to somehow pass find_unused_parameters to Pytorch DDP to do some debugging. I was wondering if it is possible to do it when I use TorchTrainer(). If not, is there any other way that I can debug my code?

thanks,

Topic		Replies	Views
Any suggestions on how to debug the distributed torch trainer Dashboard, Monitoring & Debugging	7	855	June 9, 2021
How to configure prepare_model Ray Train	4	707	April 3, 2023
Ray + torch.distributed/DDP resource management	1	1047	September 21, 2022
Tune with Function API and torch.multiprocessing.spawn	0	281	February 6, 2024
Failed to read the results for 1 trials	3	492	July 26, 2023

How to set TORCH_DISTRIBUTED_DEBUG evn var

Related topics