[SGD][Tune] Do I need to specify cuda device id in TrainOperator

valiantljk · May 26, 2021, 9:05pm

I’m using RaySGD on multiple GPUs. I wonder in training_operator.py, do I need to set device id?

ray-project/ray/blob/master/python/ray/util/sgd/torch/training_operator.py#L579

    
      
          if not hasattr(self, "criterion"):
              raise RuntimeError("Either set self.criterion in setup function "
                                 "or override this method to implement a custom "
                                 "training loop.")
          model = self.model
          optimizer = self.optimizer
          criterion = self.criterion
          # unpack features into list to support multiple inputs model
          *features, target = batch
          # Create non_blocking tensors for distributed training
          if self.use_gpu:
              features = [
                  feature.cuda(non_blocking=True) for feature in features
              ]
              target = target.cuda(non_blocking=True)
          
          
# Compute output.
          with self.timers.record("fwd"):
              output = model(*features)
              loss = criterion(output, target)

        if self.use_gpu:
            features = [
                feature.cuda(non_blocking=True) for feature in features
            ]
            target = target.cuda(non_blocking=True)

rliaw · May 26, 2021, 9:16pm

No, usually you shouldn’t need to set this - are you trying to use multiple GPUs per model replica?

valiantljk · May 26, 2021, 10:19pm

Okey. No.

How does ray tell which GPU to copy to?
Also, I run into an error:

 Default process group is not initialized

When I use call dist.barrier() during the training loop.
Isn’t the process group initialized by RaySGD?

amogkam · May 26, 2021, 10:31pm

Ray automatically sets the CUDA_VISIBLE_DEVICES environment variable so that each process/worker has a unique visible GPU.

Where are you calling dist.barrier()? Do you have a full stack trace?

valiantljk · May 26, 2021, 10:37pm

I called the dist.barrier() in the driver side, (I probably can re-produce this using the raysgd example), to simulate what I did:
using the transformer as an example. ray/cifar_pytorch_example.py at 35ec91c4e04c67adc7123aa8461cf50923a316b4 · ray-project/ray · GitHub

    for i in pbar:
        info = {"num_steps": 1} if args.smoke_test else {}
        info["epoch_idx"] = i
        info["num_epochs"] = args.num_epochs
        # Increase `max_retries` to turn on fault tolerance.
        trainer1.train(max_retries=1, info=info)
        dist.barrier()
       # do some thing.
        val_stats = trainer1.validate()

amogkam · May 26, 2021, 10:39pm

Ah ok, the process group is not initialized on the driver process, only the workers.

If you need the process group initialized on the driver, then you can pass in use_local=True to your TorchTrainer. This will make the rank 0 worker run on the driver process.

valiantljk · May 26, 2021, 10:41pm

Cool, that makes sense! Thanks.

Btw, does running worker on driver side have any side effect on performance, reliability, etc?

amogkam · May 26, 2021, 10:44pm

I don’t think it should not have any impact on reliability or performance. But please let us know if you are seeing these effects.

valiantljk · May 26, 2021, 10:46pm

Thanks.

By turning on use_local, I still have the same error:

    dist.barrier() 
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1423, in barrier
    _check_default_pg()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 192, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

amogkam · May 26, 2021, 10:59pm

Hmm interesting. Where in the code are you calling dist.barrier(). This is after TorchTrainer instantiation correct? Also you are using more than 1 worker right?

valiantljk · May 26, 2021, 11:01pm

I call it after trainer.train(), and so far I’ve been only using 1 worker.

amogkam · May 26, 2021, 11:12pm

Ok yeah no process group is created if you only use 1 worker. If you use more than 1 and also have use_local set, then a process group will be initialized on the driver.

Topic		Replies	Views
Distributed training in PyTorch and init_process_group Ray Tune	12	4135	September 7, 2021
Using specific GPUs in a shared machine Ray Tune	6	3066	March 24, 2022
Ray Train doesn't detect GPU Ray Train	4	2011	January 7, 2022
How do I set GPU affinity of workers RLlib	17	2520	April 23, 2021
Ray not finding available GPU on Windows RLlib	4	1024	September 6, 2021

[SGD][Tune] Do I need to specify cuda device id in TrainOperator

Related topics