Heterogeneous GPU distributed training / batch

Thank you developing Ray !

I wanted to ask the following question about Ray Train. Is there something that can impede the following scenarios?

  • Client-based distributed training across a K8s cluster with heterogeneous GPUs - some workers are running with GTX 1080Tis and some others with RTX Ampere GPUs (A4000/5000). If this is not allowed are there any instructions that can limit an interactive client session (for code development) to only one GPU compute capability (eg. Ampere) ?

  • Batch job submission across the same cluster of heterogeneous GPUs

I thought to ask first perform spending time trying to make this work so I would appreciate any pointers from anyone.

Thank you

How severe does this issue affect your experience of using Ray?

  • Low: It annoys or frustrates me for a moment.

Hey @pantelis,

some workers are running with GTX 1080Tis and some others with RTX Ampere GPUs (A4000/5000)

Just to double check, are the “workers” here referring to nodes or the distributing training workers? If it’s the former, you should be able to define your cluster configuration with custom resources, and request them in your Ray Train job.