Distributed training in PyTorch and init_process_group

vblagoje · August 27, 2021, 9:09am

Hey guys,

I can run single-node distributed training in the PyTorch toy example.

However, in our distributed training setup, we call init_process_group ourselves, and it seems this part is handled by Ray as well thus leading to a double initialization of the process group. The error I get is:
RuntimeError: trying to initialize the default process group twice!

Does Tune have the ability to allow us to hook init_process_group ourselves somehow?

Cheers,
Vladimir

rliaw · August 27, 2021, 4:25pm

If you want to start the init_process_group yourself, perhaps it’s more reasonable to use Ray actors manually?

rliaw · August 27, 2021, 4:26pm

BTW, any reason why you wouldn’t want Ray to manage the init_process_group?

vblagoje · August 27, 2021, 5:07pm

Richard, good points. I’ll investigate over the weekend and report back on Monday.
Cheers,
Vladimir

vblagoje · August 30, 2021, 8:10am

Richard,

Yes, it seems I can let Ray manage init_process_group. Using dist from:

from torch import distributed as dist

I can get a hold of rank of each process and assign it appropriate device. Even the dist.get_world_size() correctly gives int 8. However, as soon as I try to place tensor on certain device I get:

(pid=151214) RuntimeError: CUDA error: invalid device ordinal

Most likely the Ray tune process spawned (is it default_worker.py?) is not somehow making the CUDA device visible to itself or it’s making a subset of “wrong” devices available? Any advice?

Cheers,
Vladimir

matthewdeng · August 30, 2021, 8:09pm

Hey @vblagoje,

Can you provide some additional background info on the overall logic you are trying to perform and use Ray for? Is it distributed training, distributed tuning, or distributed training with distributed tuning?

vblagoje · August 31, 2021, 1:26pm

Hey @matthewdeng,

I am doing single-node distributed language model pre-training. Now, I’d like to do hyper-parameter tuning for pre-training. Following the mnist ddp example I have adapted our script. For starters, I would like to do a single-node distributed training using sequential tuning attempts with Median Stopping Rule. Later on, I’d experiment with other scheduler but for now, simply trying to replicate our setup for single-node distributed language model pre-training.

Thanks in advance,
Vladimir

matthewdeng · August 31, 2021, 8:44pm

Could you share some more details around the need to place a tensor on a specific device, and what your training function code looks like?

To further debug, can you try adding the following line to the start of your training function and sharing the results (for the different workers)?

print(os.environ.pop("CUDA_VISIBLE_DEVICES", None))

Since this line removes the set values, it may also enable the workers to access the “invalid” devices.

vblagoje · September 3, 2021, 11:27am

Hey Matthew,

This actually worked and now the training starts on all processes. However, it soon stops on distributed.barrier call - it times out eventually with stacktrace. Our training code is rather custom and it uses all_reduce and barrier synchronization approach. Investigating further I discovered that Ray uses gloo torch distributed backend by default and barrier is not supported on GPUs. See the supported feature matrix here. Can we switch Ray to use nccl?

Best,
Vladimir

vblagoje · September 3, 2021, 2:22pm

RTFM, thank you guys!

vblagoje · September 7, 2021, 11:24am

Hey @matthewdeng and @rliaw,

I am finally able to run the distributed single machine LM pre-training tuning process, except for the most important part - reporting back the metrics. As soon as I introduce tune.report(metrics) invocation, and I invoke it only for the process with the rank 0, the report callback blocks indefinitely. If I remove just those two lines of code everything works as expected - the trial runs finish without a hiccup, they are sequential as I want them to be, and all x trial runs complete ok as well. Of course, the final report looks like this:

== Status ==
Memory usage on this node: 25.7/440.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/92 CPUs, 0/8 GPUs, 0.0/298.28 GiB heap, 0.0/131.83 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /home/vblagoje/ray_results/WrappedDistributedTorchTrainable_2021-09-07_03-42-10
Number of trials: 5/5 (5 TERMINATED)
+----------------------------------------------+------------+-------+
| Trial name                                   | status     | loc   |
|----------------------------------------------+------------+-------|
| WrappedDistributedTorchTrainable_411cd_00000 | TERMINATED |       |
| WrappedDistributedTorchTrainable_411cd_00001 | TERMINATED |       |
| WrappedDistributedTorchTrainable_411cd_00002 | TERMINATED |       |
| WrappedDistributedTorchTrainable_411cd_00003 | TERMINATED |       |
| WrappedDistributedTorchTrainable_411cd_00004 | TERMINATED |       |
+----------------------------------------------+------------+-------+


2021-09-07 04:17:32,372	INFO tune.py:561 -- Total run time: 2122.22 seconds (2122.02 seconds for the tuning loop).
2021-09-07 04:17:32,380	WARNING experiment_analysis.py:645 -- Could not find best trial. Did you pass the correct `metric` parameter?

matthewdeng · September 7, 2021, 2:53pm

Hey @vblagoje, the reason for this behavior is that under the hood each iteration will wait for all processes to report metrics via tune.report before continuing training (though ultimately only the metrics from the worker process with rank 0 will be propagated up) - you can think of this as a way to ensure that all processes are synchronized.

Would you be able to invoke tune.report on all workers?

vblagoje · September 7, 2021, 6:27pm

Hey @matthewdeng yes I can. Now everything works as expected. Thank you!

Topic		Replies	Views
Ray Tune for single-node distributed training in PyTorch Ray Tune	3	983	August 24, 2021
Ray + torch.distributed/DDP resource management	1	1095	September 21, 2022
Tune with Function API and torch.multiprocessing.spawn	0	288	February 6, 2024
Launching distributed training from within an actor	4	293	August 30, 2023
Ray Tune does not work properly with DDP PyTorch Lightning Ray Tune	8	1631	March 17, 2022

Distributed training in PyTorch and init_process_group

Related topics