Ray failing to find 4 V100 gpus on node

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m trying to run a tune search on a single node with 4 V100 gpus. I run ray.init(num_cpus=4, num_gpus=4) at the start of the search, and specify for each trial that resources_per_trial = {"cpu": 1, "gpu": 1} so I expect 4 workers to start and have 4 trials simultaneously, however this doesn’t happen, I only get a single worker/trial. Upon closer inspection the status banner indicates ray thinks the cluster only has a single V100.

== Status ==
Current time: 2022-05-22 07:50:01 (running for 00:00:00.12)
Memory usage on this node: 37.0/311.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/4 GPUs, 0.0/189.48 GiB heap, 0.0/85.2 GiB objects (0.0/1.0 accelerator_type:V100)
...

So we can see the cluster says there are 4 cpus and gpus based on how I called ray.init, however there is only one accelerator V100.

I think ray is getting confused in the backend when creating the cluster, but I’m having trouble figuring out where these resources are enumerated in the code.

In that status banner, I found theResources requested comes from the following path:

  • ResourceUpdater.update_avail_resources (ray/resource_updater.py at master · ray-project/ray · GitHub).
  • ray.cluster_resources is then called (internally calling GlobalState.cluster_resources)
  • GlobalState.node_table is then called
  • The resource component of that table is filled by GlobalState.node_resource_table
  • which gets its data from the GlobalStateAccessor.get_node_resource_info which originates in ray._raylet.
  • We are now in the C/C++ components of ray, however I couldn’t find what c/c++ call maps onto get_node_resource_info.
  • The nearest next step I think I can find is from the step in C/C++ to create a ray::gcs::GlobalStateAccessor named ray::gcs::CreateGlobalStateAccessor
  • Ultimately, the GlobalStateAccessor Seems to make a call to some kind of GCS RCP accessor to retrieve the node data, but there are no references to gpus that seem promising in src/ray/src/gcs: ray/src/ray/gcs at master · ray-project/ray · GitHub

So I’m at a dead-end, not sure where or how ray is determining the amount of these ‘custom resources’. If I knew, I might be able to propose a patch.

I tried to post this to the github, but it required a reproducible script, but since this is a hardware dependent problem, I didn’t think that was possible.

Hmm, I think the V100 custom resource here is a bit of a red herring (it’s set to 1 by convention instead of the number of V100 devices). It’s not requested by the trial resource, so it shouldn’t factor into scheduling.

I believe you should be seeing 4 concurrent trials. Does the following show 4 concurrent trials successfully?

from ray import tune
import ray

def func(*args):
    import time
    time.sleep(100)

ray.init(num_cpus=4, num_gpus=4)
tune.run(func, num_samples=10, resources_per_trial={"cpu": 1, "gpu": 1})

@ericl Interesting test, It worked for me. which is strange… Any ideas for next steps?

@ericl Ah okay, So I noticed in my script, I have local_mode=True so I added it, and voila, the test above only runs a single trial at a time!

Indeed, closer inspection shows that local_mode specifically makes it run serially. I forgot about this. Sorry!