- Can ray automatically transfer tensors between cpu memory and gpu memory without explicitlly do
- How do ray do cross gpu tensor communication? Can I use nccl to do high performance training?
- Can I split a model into multiple parts, and ray schedule parts onto different gpus and automatically train the model?
- I follow the example in Parameter Server — Ray v1.4.1 and add
@ray.remote(num_gpus=1)decorator to the server and worker, I discover the throughput is quite low. How can I use ray to do high performance training of DL models (like alexnet) with pytorch?
Thanks a lot!