Domestic GPU recognition and adaptation

Hxinyue · December 1, 2025, 11:55am

For domestic Gpus such as the K100 of Sugon DCU, when the ray cluster is started through k8s, the autoscaler of Ray-head cannot automatically recognize it as a GPU. Therefore, resources=“{“DCU”:1}” is configured in the ray cluster. When conducting training tests using the ray train interface and setting resources_per_worker={“NPU”:1}, how should backend and use_gpu in TorchConfig be configured at this time?

RunLLM · December 1, 2025, 11:55am

It seems likely that for non-NVIDIA accelerators like Sugon DCU (using a custom resource such as “DCU”), Ray’s autoscaler and Ray Train do not natively recognize these as GPUs. The official Ray documentation and code only mention “GPU”, “TPU”, and “HPU” as supported accelerator types, and the use_gpu flag is specifically for NVIDIA GPUs. There is no explicit mention in the data sources of support for “DCU” or “NPU” as accelerator types in Ray Train or TorchConfig. Therefore, setting use_gpu=True would not have the intended effect for DCU/NPUs, and backend selection (e.g., NCCL, Gloo) is only documented for PyTorch with NVIDIA GPUs or CPUs.

My understanding is that you should not set use_gpu=True in ScalingConfig or TorchConfig for DCU/NPUs, and you should not expect Ray Train to automatically handle device placement or backend selection for these accelerators. You may need to manage device selection and distributed backend manually in your training function, and use resources_per_worker={“DCU”: 1} to schedule Ray actors on nodes with DCU resources. There is no data source indicating that Ray Train or TorchConfig supports a “backend” argument for DCU/NPUs, nor that use_gpu can be used for non-NVIDIA devices. Would you like more detail on how to manually handle device placement and distributed training in this scenario?

Sources:

Ray PR: Add hpu resource description in ray train related docs

Hint: Mention @RunLLM in the post for followups.

Hxinyue · December 1, 2025, 11:59am

DCU should also belong to the category of ROCm similar to AMD GPU. Can this be adapted to such devices?

Topic		Replies	Views
Ray Train doesn't detect GPU Ray Train	4	2050	January 7, 2022
Intentionally not using GPU Ray Core	3	427	February 9, 2022
Can Ray support other hardware, such as TPUs or NPUs in China? Kubernetes	0	407	May 31, 2023
Ray cluster details doesn't show requested number of gpus Kubernetes	3	239	June 19, 2024
GPU accelarate that can not be used with ray and tune in training PPO RLlib	3	992	December 23, 2023

Domestic GPU recognition and adaptation

Related topics