On-premise cluster: different worker node types

mohhaseeb · December 7, 2022, 5:17am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello,

I’ve an On-premise ray cluster (i.e. provider.type: local). I’d like to declare different worker node types in the cluster (e.g. “CPU only nodes” and “GPU nodes”). I tried adding an available_node_type section in the cluster configuration as follows.

# Rest excluded for brevity

provider:
    type: local
    head_ip: 10.1.0.1
    worker_ips: [10.1.0.2,10.1.0.3,10.1.0.4]

# Rest excluded for brevity

available_node_types:
    head_node:
        min_workers: 0
        max_workers: 0
        resources: {"CPU": 2}
    cpu_node:
        min_workers: 1
        max_workers: 1
        resources: { "CPU": 6}
    gpu_node:
        min_workers: 2
        max_workers: 2
        resources: { "CPU": 6, "GPU": 1}
head_node_type: head_node

# Rest excluded for brevity

When I ran ray up to start the cluster, I got the following error:

The field available_node_types is not supported for on-premise clusters.

Is there way to declare different node types on on-premise clusters?

The workaround that I’m thinking about ATM is to create a Ray cluster for each worker node type and have something external (to Ray) to schedule the workloads to the correct Ray cluster.

Thanks

Dmitri · December 8, 2022, 7:27pm

available_node_type is not supported for static on-prem clusters
Could you explain why you need it?
You should be able to just specify the worker_ips.

Ray should detect node resources correctly – if that’s not the case, then we need an interface for per-node resource-overrides.

mohhaseeb · December 28, 2022, 8:38am

Hi Dmitri, and sorry for the late reply.

I have two types of workload that I’m planning to run on the Ray cluster:

“CPU only” workload: CPU only, long running, high latency is acceptable (i.e. from when a job is submitted until it starts executing)
“GPU” workload: runs on GPU, short running, low latency is required.

There are two types of nodes that I can use as cluster nodes:

CPU only nodes
Nodes that have GPUs

Without a way to force the “CPU only” workload to run on the “CPU only” nodes, there is a risk to end up on the following unwanted situation:

Several “CPU only” jobs (workload type 1) might end up using all the cluster nodes (including the “GPU” ones)
New “GPU” jobs will experience very high latency, since the whole cluster (including the GPU workers) is occupied by “CPU only” workload.

Thien_Nguyen · June 7, 2023, 4:23am

@mohhaseeb Did you solve it? I also had the same problem.

mohhaseeb · June 7, 2023, 11:18am

Unfortunately not @Thien_Nguyen

robcaulk · June 16, 2023, 8:23am

We are running into a similar problem - except ours revolves around trying to launch multiple worker nodes on the same server in an effort to isolate resources.

Right now, some of the nodes in our cluster have 32 cpus and 3 gpus. If we try to let the node be a single worker with 32 cpus and 3 gpus, then any task that needs 1 cpu and 1 gpu will take a worker and start a pytorch process on one of the gpus. As the node processes multiple trainings, it starts handing trains to any of the 32 cpus, which each cache pytorch mem on one of the 3 gpus (which is a performance feature of pytorch). This leads to an OOM on the gpus since all 32 cpus have reserved their own mem on the 3 gpus.

One way around this is to use PYTORCH_NO_CUDA_MEMORY_CACHING=1 but it results in severe performance degradation (it is really only designed for debugging purposes) by up to 50%.

So the final solution seems to be that we need to isolate resources by creating two workers on a single node: one “trainer” worker that has 3 cpu and 3 gpus, while another “cpu” worker with resources set to 29 cpus so that single cpu tasks can still be highly parallelized.

Topic		Replies	Views
Using provider.type = local, Is it possible to mark node type to each head and worker node in config.yaml? Ray Clusters	0	388	September 24, 2021
Different setup commands for different kind of workers Ray Clusters	2	508	March 23, 2023
Num_cpus as a parameter? Ray Core	7	448	December 17, 2021
How to resolve No available node types can fulfill resource Ray Clusters	0	1502	February 17, 2022
How to assign different custom resources for each worker nodes? Ray Clusters	9	2226	July 28, 2022

On-premise cluster: different worker node types

Related topics