On-premise cluster: different worker node types

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

I’ve an On-premise ray cluster (i.e. provider.type: local). I’d like to declare different worker node types in the cluster (e.g. “CPU only nodes” and “GPU nodes”). I tried adding an available_node_type section in the cluster configuration as follows.

# Rest excluded for brevity

provider:
    type: local
    head_ip: 10.1.0.1
    worker_ips: [10.1.0.2,10.1.0.3,10.1.0.4]

# Rest excluded for brevity

available_node_types:
    head_node:
        min_workers: 0
        max_workers: 0
        resources: {"CPU": 2}
    cpu_node:
        min_workers: 1
        max_workers: 1
        resources: { "CPU": 6}
    gpu_node:
        min_workers: 2
        max_workers: 2
        resources: { "CPU": 6, "GPU": 1}
head_node_type: head_node

# Rest excluded for brevity

When I ran ray up to start the cluster, I got the following error:

The field available_node_types is not supported for on-premise clusters.

Is there way to declare different node types on on-premise clusters?

The workaround that I’m thinking about ATM is to create a Ray cluster for each worker node type and have something external (to Ray) to schedule the workloads to the correct Ray cluster.

Thanks

available_node_type is not supported for static on-prem clusters
Could you explain why you need it?
You should be able to just specify the worker_ips.

Ray should detect node resources correctly – if that’s not the case, then we need an interface for per-node resource-overrides.

Hi Dmitri, and sorry for the late reply.

I have two types of workload that I’m planning to run on the Ray cluster:

  1. “CPU only” workload: CPU only, long running, high latency is acceptable (i.e. from when a job is submitted until it starts executing)
  2. “GPU” workload: runs on GPU, short running, low latency is required.

There are two types of nodes that I can use as cluster nodes:

  1. CPU only nodes
  2. Nodes that have GPUs

Without a way to force the “CPU only” workload to run on the “CPU only” nodes, there is a risk to end up on the following unwanted situation:

  • Several “CPU only” jobs (workload type 1) might end up using all the cluster nodes (including the “GPU” ones)
  • New “GPU” jobs will experience very high latency, since the whole cluster (including the GPU workers) is occupied by “CPU only” workload.