Several questions about DL training (e.g. alexnet with pytorch)

  1. Can ray automatically transfer tensors between cpu memory and gpu memory without explicitlly do tensor.cuda()?
  2. How do ray do cross gpu tensor communication? Can I use nccl to do high performance training?
  3. Can I split a model into multiple parts, and ray schedule parts onto different gpus and automatically train the model?
  4. I follow the example in Parameter Server — Ray v1.4.1 and add @ray.remote(num_gpus=1) decorator to the server and worker, I discover the throughput is quite low. How can I use ray to do high performance training of DL models (like alexnet) with pytorch?

