I can run single-node distributed training in the PyTorch toy example.
However, in our distributed training setup, we call init_process_group ourselves, and it seems this part is handled by Ray as well thus leading to a double initialization of the process group. The error I get is: RuntimeError: trying to initialize the default process group twice!
Does Tune have the ability to allow us to hook init_process_group ourselves somehow?
Yes, it seems I can let Ray manage init_process_group. Using dist from:
from torch import distributed as dist
I can get a hold of rank of each process and assign it appropriate device. Even the dist.get_world_size() correctly gives int 8. However, as soon as I try to place tensor on certain device I get:
(pid=151214) RuntimeError: CUDA error: invalid device ordinal
Most likely the Ray tune process spawned (is it default_worker.py?) is not somehow making the CUDA device visible to itself or it’s making a subset of “wrong” devices available? Any advice?
Can you provide some additional background info on the overall logic you are trying to perform and use Ray for? Is it distributed training, distributed tuning, or distributed training with distributed tuning?
I am doing single-node distributed language model pre-training. Now, I’d like to do hyper-parameter tuning for pre-training. Following the mnist ddp example I have adapted our script. For starters, I would like to do a single-node distributed training using sequential tuning attempts with Median Stopping Rule. Later on, I’d experiment with other scheduler but for now, simply trying to replicate our setup for single-node distributed language model pre-training.
This actually worked and now the training starts on all processes. However, it soon stops on distributed.barrier call - it times out eventually with stacktrace. Our training code is rather custom and it uses all_reduce and barrier synchronization approach. Investigating further I discovered that Ray uses gloo torch distributed backend by default and barrier is not supported on GPUs. See the supported feature matrix here. Can we switch Ray to use nccl?
I am finally able to run the distributed single machine LM pre-training tuning process, except for the most important part - reporting back the metrics. As soon as I introduce tune.report(metrics) invocation, and I invoke it only for the process with the rank 0, the report callback blocks indefinitely. If I remove just those two lines of code everything works as expected - the trial runs finish without a hiccup, they are sequential as I want them to be, and all x trial runs complete ok as well. Of course, the final report looks like this:
== Status ==
Memory usage on this node: 25.7/440.7 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/92 CPUs, 0/8 GPUs, 0.0/298.28 GiB heap, 0.0/131.83 GiB objects (0.0/1.0 accelerator_type:V100)
Result logdir: /home/vblagoje/ray_results/WrappedDistributedTorchTrainable_2021-09-07_03-42-10
Number of trials: 5/5 (5 TERMINATED)
+----------------------------------------------+------------+-------+
| Trial name | status | loc |
|----------------------------------------------+------------+-------|
| WrappedDistributedTorchTrainable_411cd_00000 | TERMINATED | |
| WrappedDistributedTorchTrainable_411cd_00001 | TERMINATED | |
| WrappedDistributedTorchTrainable_411cd_00002 | TERMINATED | |
| WrappedDistributedTorchTrainable_411cd_00003 | TERMINATED | |
| WrappedDistributedTorchTrainable_411cd_00004 | TERMINATED | |
+----------------------------------------------+------------+-------+
2021-09-07 04:17:32,372 INFO tune.py:561 -- Total run time: 2122.22 seconds (2122.02 seconds for the tuning loop).
2021-09-07 04:17:32,380 WARNING experiment_analysis.py:645 -- Could not find best trial. Did you pass the correct `metric` parameter?
Hey @vblagoje, the reason for this behavior is that under the hood each iteration will wait for all processes to report metrics via tune.report before continuing training (though ultimately only the metrics from the worker process with rank 0 will be propagated up) - you can think of this as a way to ensure that all processes are synchronized.
Would you be able to invoke tune.report on all workers?