For domestic Gpus such as the K100 of Sugon DCU, when the ray cluster is started through k8s, the autoscaler of Ray-head cannot automatically recognize it as a GPU. Therefore, resources=“{“DCU”:1}” is configured in the ray cluster. When conducting training tests using the ray train interface and setting resources_per_worker={“NPU”:1}, how should backend and use_gpu in TorchConfig be configured at this time?
It seems likely that for non-NVIDIA accelerators like Sugon DCU (using a custom resource such as “DCU”), Ray’s autoscaler and Ray Train do not natively recognize these as GPUs. The official Ray documentation and code only mention “GPU”, “TPU”, and “HPU” as supported accelerator types, and the use_gpu flag is specifically for NVIDIA GPUs. There is no explicit mention in the data sources of support for “DCU” or “NPU” as accelerator types in Ray Train or TorchConfig. Therefore, setting use_gpu=True would not have the intended effect for DCU/NPUs, and backend selection (e.g., NCCL, Gloo) is only documented for PyTorch with NVIDIA GPUs or CPUs.
My understanding is that you should not set use_gpu=True in ScalingConfig or TorchConfig for DCU/NPUs, and you should not expect Ray Train to automatically handle device placement or backend selection for these accelerators. You may need to manage device selection and distributed backend manually in your training function, and use resources_per_worker={“DCU”: 1} to schedule Ray actors on nodes with DCU resources. There is no data source indicating that Ray Train or TorchConfig supports a “backend” argument for DCU/NPUs, nor that use_gpu can be used for non-NVIDIA devices. Would you like more detail on how to manually handle device placement and distributed training in this scenario?
Sources:
Hint: Mention @RunLLM in the post for followups.
DCU should also belong to the category of ROCm similar to AMD GPU. Can this be adapted to such devices?