[Ray Core] RuntimeError: No CUDA GPUs are available

ravi · September 29, 2022, 2:28pm

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

I am getting started with Ray and want to use it for scaling the training of my PyTorch neural network. Before using Ray with PyTorch Network class, i.e., nn.Module., I want to make sure I can use Ray with simple tensor. Therefore, please see the snippet below:

import ray
import torch

ray.init()

@ray.remote
class Counter(object):
    def __init__(self):
        self.tensor = torch.ones((1, 3))
        self.device = "cuda:0"

    def move_and_increment(self):
        self.tensor.to(self.device)
        self.tensor += 1

    def print(self):
        return self.tensor

print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")

counters = [Counter.remote() for i in range(2)]
[c.move_and_increment.remote() for c in counters]
futures = [c.print.remote() for c in counters]
print(ray.get(futures))

ray.shutdown()

I tried running it but failed miserably. Please see below the reported error:

$ python ray_test.py 
2022-09-29 23:22:25,112	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
torch.cuda.is_available(): True
2022-09-29 23:22:26,434	ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::Counter.move_and_increment() (pid=6008, ip=192.168.10.5, repr=<ray_test.Counter object at 0x7fa39c2db490>)
  File "/home/ravi/learning_ray/ray_test.py", line 14, in move_and_increment
    self.tensor.to(self.device)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
2022-09-29 23:22:26,474	ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::Counter.move_and_increment() (pid=6006, ip=192.168.10.5, repr=<ray_test.Counter object at 0x7f9fe462b520>)
  File "/home/ravi/learning_ray/ray_test.py", line 14, in move_and_increment
    self.tensor.to(self.device)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
[tensor([[1., 1., 1.]]), tensor([[1., 1., 1.]])]

I have enough memory in my graphic card as shown below:

$ nvidia-smi 
Thu Sep 29 23:26:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8     7W /  N/A |    443MiB /  7982MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1529      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A      2812      G   /usr/lib/xorg/Xorg                161MiB |
|    0   N/A  N/A      2987      G   /usr/bin/gnome-shell              105MiB |
|    0   N/A  N/A      6112      G   ...252301518872410907,131072       80MiB |
|    0   N/A  N/A      8280      G   ...RendererForSitePerProcess       36MiB |
+-----------------------------------------------------------------------------+

I think the error is related to lazy initialization. Can you please confirm and provide a way to fix it?

rickyyx · September 29, 2022, 4:44pm

Hey @ravi , thanks for posting the question and providing all the code/error context!

A couple of thoughts from me:

Wondering if you have tried adding num_gpus=1 to your class definition? Like below:

@ray.remote(num_gpus=1)
class Counter(object):
 ....

This tells ray that the Counter class needs to scheduled at someone with access to the gpus.

Also, could you also help me by running ray.cluster_resources() after ray.init() to make sure Ray is seeing the correct resources?

ravi · September 30, 2022, 5:14am

Thanks @rickyyx for quick response. I appreciate it. I am sorry for not adding (num_gpus=1) to my actor. Thanks for the sharp observation. Now, I have added it and also added ray.cluster_resources() as suggested. Please see below:

import ray
import torch

ray.init()
ray.cluster_resources() 

@ray.remote(num_gpus=1)
class Counter(object):
    def __init__(self):
        self.tensor = torch.ones((1, 3))
        self.device = "cuda:0"

    def move_and_increment(self):
        self.tensor.to(self.device)
        self.tensor += 1

    def print(self):
        return self.tensor


print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")

counters = [Counter.remote() for i in range(2)]
[c.move_and_increment.remote() for c in counters]
futures = [c.print.remote() for c in counters]
print(ray.get(futures))

ray.shutdown()

Unfortunately, the above code does not return. Therefore I have to press CTRL+C after 2-3 minutes to stop it. Below is the information, printed in terminal:

$ python ray_test.py 
2022-09-30 13:57:55,626	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
torch.cuda.is_available(): True
(scheduler +16s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +16s) Warning: The following resource request cannot be scheduled right now: {'GPU': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +51s) Warning: The following resource request cannot be scheduled right now: {'GPU': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +1m26s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
^CTraceback (most recent call last):
  File "/home/ravi/learning_ray/ray_test.py", line 26, in <module>
    print(ray.get(futures))
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/worker.py", line 2269, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/worker.py", line 669, in get_objects
    data_metadata_pairs = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 1211, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 173, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/node.py", line 1143, in _kill_process_type
    self._kill_process_impl(
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/node.py", line 1199, in _kill_process_impl
    process.wait(timeout_seconds)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/subprocess.py", line 1189, in wait
    return self._wait(timeout=timeout)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/subprocess.py", line 1911, in _wait
    time.sleep(delay)
KeyboardInterrupt

During the above execution, I noticed that my simplest actor is consuming 1089MiB GPU memory. Isn’t it too much for such a simple actor? Please see the output of nvidia-smi below:

$ nvidia-smi 
Fri Sep 30 13:58:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8     8W /  N/A |   1513MiB /  7982MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1529      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A      2812      G   /usr/lib/xorg/Xorg                161MiB |
|    0   N/A  N/A      2987      G   /usr/bin/gnome-shell               56MiB |
|    0   N/A  N/A      6112      G   ...252301518872410907,131072      144MiB |
|    0   N/A  N/A     35286      C   ray::Counter                     1089MiB |
+-----------------------------------------------------------------------------+

Finally, below is the output of ray status:

$ ray status
======== Autoscaler status: 2022-09-30 14:02:37.908976 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_64c97f1b08cc9131d1f160698cae490c845ebb2e1ef0cdfb98e086b4
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 1.0/16.0 CPU
 1.0/1.0 GPU
 0.0/1.0 accelerator_type:RTX
 0.00/15.066 GiB memory
 0.00/7.533 GiB object_store_memory

Demands:
 {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors

Please note that I am using Ray v2.0.0. Thank you very much.

rickyyx · October 3, 2022, 4:52pm

Hey @ravi thanks for the iteration and the context.

I see 2 issues that you are running into:

The program gets stuck:

I think this is because the ray cluster only sees 1 GPU (from the ray.status) available but you are trying to run 2 Counter actor which requires 1 GPU each. So the second Counter actor wasn’t able to schedule so it gets stuck at the ray.get(futures) call.
You could either

schedule just 1 Counter actor
or change num_gpus=1 to num_gpus=0.5 for logical accounting.
See more details here.

The Counter actor using 1GiB of memory. Let’s see what this number is when you could successfully run the script above. But seems like a bug (or some obvious optimization we should be doing).

ravi · October 4, 2022, 9:19am

Thanks @rickyyx for your wonderful response.

Both of them work like a charm. However, I have the following 3 things to say:

In my simple program, the actor and main function are in the same place. Furthermore, the number of actors is a handful in number. Both of these situations make it very easy to update the num_gpus parameter. But how do you edit this parameter (and others, say num_cpu, etc.) in a large project having multiple files?
Consider having an RTX 3090 having 24GB GPU memory. However, my tensor is tiny. In this case, if I allocate, use num_gpu=1 (instead of 0.5) and run two actors. Shouldn’t ray automatically find free memory on the GPU and then allocate the second actor to the same GPU to save resources? Therefore, I can run a large number of actors in a GPU.
Finally, the memory usage of 1 actor in the new program is the same as the above, i.e., 1089MiB. It turned out that most of this memory is consumed by CUDA context loading (kernels etc.). However, while using 2 actors, the memory is twice, as I can see two entities reported by nvidia-smi. Does it mean that ray is loading the CUDA context twice? GPU memory is most precious than anything else in the world!!!

For the sake of completeness, I am reporting the output shown by nvidia-smi below:

$ nvidia-smi 
Tue Oct  4 16:13:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   52C    P0    28W /  N/A |   3398MiB /  7982MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1554      G   /usr/lib/xorg/Xorg                160MiB |
|    0   N/A  N/A      2820      G   /usr/lib/xorg/Xorg                688MiB |
|    0   N/A  N/A      3001      G   /usr/bin/gnome-shell              111MiB |
|    0   N/A  N/A      3614      G   ...763400436228628087,131072      172MiB |
|    0   N/A  N/A     39131      G   ...RendererForSitePerProcess       78MiB |
|    0   N/A  N/A    143170      C   ...nter.move_and_increment()     1087MiB |
|    0   N/A  N/A    143171      C   ...nter.move_and_increment()     1085MiB |
+-----------------------------------------------------------------------------+

I am looking forward to hearing your wonderful suggestions. Thank you, again.

ravi · October 15, 2022, 2:30pm

Thanks @rickyyx for your valuable time.

Can you provide any suggestions, please?

Thanks

Topic		Replies	Views
Ray not finding available GPU on Windows RLlib	4	1005	September 6, 2021
RayOutOfMemoryError: More than 95% of the memory is used Ray Core	6	4884	September 9, 2022
Ray actor multiple gpu available but only one used Ray Core	3	114	October 4, 2024
Memory usage increasing with newest torch version 2.0 above Ray Core	0	23	November 27, 2023
RuntimeError: No CUDA GPUs are available Ray Tune	12	14911	February 3, 2023

[Ray Core] RuntimeError: No CUDA GPUs are available

Related topics