[SGD][Tune] Do I need to specify cuda device id in TrainOperator

I’m using RaySGD on multiple GPUs. I wonder in training_operator.py, do I need to set device id?

        if self.use_gpu:
            features = [
                feature.cuda(non_blocking=True) for feature in features
            ]
            target = target.cuda(non_blocking=True)

No, usually you shouldn’t need to set this - are you trying to use multiple GPUs per model replica?

Okey. No.

How does ray tell which GPU to copy to?
Also, I run into an error:

 Default process group is not initialized

When I use call dist.barrier() during the training loop.
Isn’t the process group initialized by RaySGD?

Ray automatically sets the CUDA_VISIBLE_DEVICES environment variable so that each process/worker has a unique visible GPU.

Where are you calling dist.barrier()? Do you have a full stack trace?

1 Like

I called the dist.barrier() in the driver side, (I probably can re-produce this using the raysgd example), to simulate what I did:
using the transformer as an example. ray/cifar_pytorch_example.py at 35ec91c4e04c67adc7123aa8461cf50923a316b4 · ray-project/ray · GitHub

    for i in pbar:
        info = {"num_steps": 1} if args.smoke_test else {}
        info["epoch_idx"] = i
        info["num_epochs"] = args.num_epochs
        # Increase `max_retries` to turn on fault tolerance.
        trainer1.train(max_retries=1, info=info)
        dist.barrier()
       # do some thing.
        val_stats = trainer1.validate()

Ah ok, the process group is not initialized on the driver process, only the workers.

If you need the process group initialized on the driver, then you can pass in use_local=True to your TorchTrainer. This will make the rank 0 worker run on the driver process.

1 Like

Cool, that makes sense! Thanks.

Btw, does running worker on driver side have any side effect on performance, reliability, etc?

I don’t think it should not have any impact on reliability or performance. But please let us know if you are seeing these effects.

Thanks.

By turning on use_local, I still have the same error:

    dist.barrier() 
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 1423, in barrier
    _check_default_pg()
  File "/usr/local/lib/python3.7/dist-packages/torch/distributed/distributed_c10d.py", line 192, in _check_default_pg
    "Default process group is not initialized"
AssertionError: Default process group is not initialized

Hmm interesting. Where in the code are you calling dist.barrier(). This is after TorchTrainer instantiation correct? Also you are using more than 1 worker right?

I call it after trainer.train(), and so far I’ve been only using 1 worker.

Ok yeah no process group is created if you only use 1 worker. If you use more than 1 and also have use_local set, then a process group will be initialized on the driver.

1 Like