- Can ray automatically transfer tensors between cpu memory and gpu memory without explicitlly do
tensor.cuda()
? - How do ray do cross gpu tensor communication? Can I use nccl to do high performance training?
- Can I split a model into multiple parts, and ray schedule parts onto different gpus and automatically train the model?
- I follow the example in Parameter Server — Ray v1.4.1 and add
@ray.remote(num_gpus=1)
decorator to the server and worker, I discover the throughput is quite low. How can I use ray to do high performance training of DL models (like alexnet) with pytorch?
Thanks a lot!
Got reply in Question about ray DL training (e.g. alexnet with pytorch) · Issue #16979 · ray-project/ray · GitHub