Distributed training in PyTorch and init_process_group

Hey guys,

I can run single-node distributed training in the PyTorch toy example.

However, in our distributed training setup, we call init_process_group ourselves, and it seems this part is handled by Ray as well thus leading to a double initialization of the process group. The error I get is:
RuntimeError: trying to initialize the default process group twice!

Does Tune have the ability to allow us to hook init_process_group ourselves somehow?

Cheers,
Vladimir

If you want to start the init_process_group yourself, perhaps it’s more reasonable to use Ray actors manually?

BTW, any reason why you wouldn’t want Ray to manage the init_process_group?

Richard, good points. I’ll investigate over the weekend and report back on Monday.
Cheers,
Vladimir

Richard,

Yes, it seems I can let Ray manage init_process_group. Using dist from:

from torch import distributed as dist

I can get a hold of rank of each process and assign it appropriate device. Even the dist.get_world_size() correctly gives int 8. However, as soon as I try to place tensor on certain device I get:

(pid=151214) RuntimeError: CUDA error: invalid device ordinal

Most likely the Ray tune process spawned (is it default_worker.py?) is not somehow making the CUDA device visible to itself or it’s making a subset of “wrong” devices available? Any advice?

Cheers,
Vladimir

Hey @vblagoje,

Can you provide some additional background info on the overall logic you are trying to perform and use Ray for? Is it distributed training, distributed tuning, or distributed training with distributed tuning?

Hey @matthewdeng,

I am doing single-node distributed language model pre-training. Now, I’d like to do hyper-parameter tuning for pre-training. Following the mnist ddp example I have adapted our script. For starters, I would like to do a single-node distributed training using sequential tuning attempts with Median Stopping Rule. Later on, I’d experiment with other scheduler but for now, simply trying to replicate our setup for single-node distributed language model pre-training.

Thanks in advance,
Vladimir

Could you share some more details around the need to place a tensor on a specific device, and what your training function code looks like?

To further debug, can you try adding the following line to the start of your training function and sharing the results (for the different workers)?

print(os.environ.pop("CUDA_VISIBLE_DEVICES", None))

Since this line removes the set values, it may also enable the workers to access the “invalid” devices.

Hey Matthew,

This actually worked and now the training starts on all processes. However, it soon stops on distributed.barrier call - it times out eventually with stacktrace. Our training code is rather custom and it uses all_reduce and barrier synchronization approach. Investigating further I discovered that Ray uses gloo torch distributed backend by default and barrier is not supported on GPUs. See the supported feature matrix here. Can we switch Ray to use nccl?

Best,
Vladimir

RTFM, thank you guys!

Hey @matthewdeng and @rliaw,

I am finally able to run the distributed single machine LM pre-training tuning process, except for the most important part - reporting back the metrics. As soon as I introduce tune.report(metrics) invocation, and I invoke it only for the process with the rank 0, the report callback blocks indefinitely. If I remove just those two lines of code everything works as expected - the trial runs finish without a hiccup, they are sequential as I want them to be, and all x trial runs complete ok as well. Of course, the final report looks like this:

== Status ==
Memory usage on this node: 25.7/440.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/92 CPUs, 0/8 GPUs, 0.0/298.28 GiB heap, 0.0/131.83 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /home/vblagoje/ray_results/WrappedDistributedTorchTrainable_2021-09-07_03-42-10
Number of trials: 5/5 (5 TERMINATED)
+----------------------------------------------+------------+-------+
| Trial name                                   | status     | loc   |
|----------------------------------------------+------------+-------|
| WrappedDistributedTorchTrainable_411cd_00000 | TERMINATED |       |
| WrappedDistributedTorchTrainable_411cd_00001 | TERMINATED |       |
| WrappedDistributedTorchTrainable_411cd_00002 | TERMINATED |       |
| WrappedDistributedTorchTrainable_411cd_00003 | TERMINATED |       |
| WrappedDistributedTorchTrainable_411cd_00004 | TERMINATED |       |
+----------------------------------------------+------------+-------+


2021-09-07 04:17:32,372	INFO tune.py:561 -- Total run time: 2122.22 seconds (2122.02 seconds for the tuning loop).
2021-09-07 04:17:32,380	WARNING experiment_analysis.py:645 -- Could not find best trial. Did you pass the correct `metric` parameter?

Hey @vblagoje, the reason for this behavior is that under the hood each iteration will wait for all processes to report metrics via tune.report before continuing training (though ultimately only the metrics from the worker process with rank 0 will be propagated up) - you can think of this as a way to ensure that all processes are synchronized.

Would you be able to invoke tune.report on all workers?

Hey @matthewdeng yes I can. Now everything works as expected. Thank you!