Ray failing to find 4 V100 gpus on node

krafczyk · May 22, 2022, 3:21pm

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m trying to run a tune search on a single node with 4 V100 gpus. I run ray.init(num_cpus=4, num_gpus=4) at the start of the search, and specify for each trial that resources_per_trial = {"cpu": 1, "gpu": 1} so I expect 4 workers to start and have 4 trials simultaneously, however this doesn’t happen, I only get a single worker/trial. Upon closer inspection the status banner indicates ray thinks the cluster only has a single V100.

== Status ==
Current time: 2022-05-22 07:50:01 (running for 00:00:00.12)
Memory usage on this node: 37.0/311.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/4 GPUs, 0.0/189.48 GiB heap, 0.0/85.2 GiB objects (0.0/1.0 accelerator_type:V100)
...

So we can see the cluster says there are 4 cpus and gpus based on how I called ray.init, however there is only one accelerator V100.

I think ray is getting confused in the backend when creating the cluster, but I’m having trouble figuring out where these resources are enumerated in the code.

In that status banner, I found theResources requested comes from the following path:

ResourceUpdater.update_avail_resources (ray/resource_updater.py at master · ray-project/ray · GitHub).
ray.cluster_resources is then called (internally calling GlobalState.cluster_resources)
GlobalState.node_table is then called
The resource component of that table is filled by GlobalState.node_resource_table
which gets its data from the GlobalStateAccessor.get_node_resource_info which originates in ray._raylet.
We are now in the C/C++ components of ray, however I couldn’t find what c/c++ call maps onto get_node_resource_info.
The nearest next step I think I can find is from the step in C/C++ to create a ray::gcs::GlobalStateAccessor named ray::gcs::CreateGlobalStateAccessor
Ultimately, the GlobalStateAccessor Seems to make a call to some kind of GCS RCP accessor to retrieve the node data, but there are no references to gpus that seem promising in src/ray/src/gcs: ray/src/ray/gcs at master · ray-project/ray · GitHub

So I’m at a dead-end, not sure where or how ray is determining the amount of these ‘custom resources’. If I knew, I might be able to propose a patch.

I tried to post this to the github, but it required a reproducible script, but since this is a hardware dependent problem, I didn’t think that was possible.

ericl · May 23, 2022, 4:49am

Hmm, I think the V100 custom resource here is a bit of a red herring (it’s set to 1 by convention instead of the number of V100 devices). It’s not requested by the trial resource, so it shouldn’t factor into scheduling.

I believe you should be seeing 4 concurrent trials. Does the following show 4 concurrent trials successfully?

from ray import tune
import ray

def func(*args):
    import time
    time.sleep(100)

ray.init(num_cpus=4, num_gpus=4)
tune.run(func, num_samples=10, resources_per_trial={"cpu": 1, "gpu": 1})

krafczyk · May 23, 2022, 4:26pm

@ericl Interesting test, It worked for me. which is strange… Any ideas for next steps?

krafczyk · May 23, 2022, 4:30pm

@ericl Ah okay, So I noticed in my script, I have local_mode=True so I added it, and voila, the test above only runs a single trial at a time!

krafczyk · May 23, 2022, 4:33pm

Indeed, closer inspection shows that local_mode specifically makes it run serially. I forgot about this. Sorry!

Topic		Replies	Views
Ray doesn't use all CPUs Ray Tune	0	289	March 10, 2024
Raytune does not use resources of the second node Ray Clusters	1	348	June 15, 2023
Ray indicates that the request resource is insufficient Ray Clusters	0	657	December 19, 2022
Resources not being used Ray Core	4	1297	September 21, 2021
Ray cluster details doesn't show requested number of gpus Kubernetes	3	193	June 19, 2024

Ray failing to find 4 V100 gpus on node

Related topics