How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
I’m trying to run a tune search on a single node with 4 V100 gpus. I run ray.init(num_cpus=4, num_gpus=4)
at the start of the search, and specify for each trial that resources_per_trial = {"cpu": 1, "gpu": 1}
so I expect 4 workers to start and have 4 trials simultaneously, however this doesn’t happen, I only get a single worker/trial. Upon closer inspection the status banner indicates ray thinks the cluster only has a single V100.
== Status ==
Current time: 2022-05-22 07:50:01 (running for 00:00:00.12)
Memory usage on this node: 37.0/311.5 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/4 CPUs, 0/4 GPUs, 0.0/189.48 GiB heap, 0.0/85.2 GiB objects (0.0/1.0 accelerator_type:V100)
...
So we can see the cluster says there are 4 cpus and gpus based on how I called ray.init
, however there is only one accelerator V100.
I think ray is getting confused in the backend when creating the cluster, but I’m having trouble figuring out where these resources are enumerated in the code.
In that status banner, I found theResources requested
comes from the following path:
-
ResourceUpdater.update_avail_resources
(ray/resource_updater.py at master · ray-project/ray · GitHub). -
ray.cluster_resources
is then called (internally callingGlobalState.cluster_resources
) -
GlobalState.node_table
is then called - The resource component of that table is filled by
GlobalState.node_resource_table
- which gets its data from the
GlobalStateAccessor.get_node_resource_info
which originates inray._raylet
. - We are now in the C/C++ components of ray, however I couldn’t find what c/c++ call maps onto
get_node_resource_info
. - The nearest next step I think I can find is from the step in C/C++ to create a
ray::gcs::GlobalStateAccessor
namedray::gcs::CreateGlobalStateAccessor
- Ultimately, the
GlobalStateAccessor
Seems to make a call to some kind of GCS RCP accessor to retrieve the node data, but there are no references to gpus that seem promising insrc/ray/src/gcs
: ray/src/ray/gcs at master · ray-project/ray · GitHub
So I’m at a dead-end, not sure where or how ray is determining the amount of these ‘custom resources’. If I knew, I might be able to propose a patch.
I tried to post this to the github, but it required a reproducible script, but since this is a hardware dependent problem, I didn’t think that was possible.