[Ray Core] RuntimeError: No CUDA GPUs are available

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

I am getting started with Ray and want to use it for scaling the training of my PyTorch neural network. Before using Ray with PyTorch Network class, i.e., nn.Module., I want to make sure I can use Ray with simple tensor. Therefore, please see the snippet below:

import ray
import torch

ray.init()

@ray.remote
class Counter(object):
    def __init__(self):
        self.tensor = torch.ones((1, 3))
        self.device = "cuda:0"

    def move_and_increment(self):
        self.tensor.to(self.device)
        self.tensor += 1

    def print(self):
        return self.tensor

print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")

counters = [Counter.remote() for i in range(2)]
[c.move_and_increment.remote() for c in counters]
futures = [c.print.remote() for c in counters]
print(ray.get(futures))

ray.shutdown()

I tried running it but failed miserably. Please see below the reported error:

$ python ray_test.py 
2022-09-29 23:22:25,112	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
torch.cuda.is_available(): True
2022-09-29 23:22:26,434	ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::Counter.move_and_increment() (pid=6008, ip=192.168.10.5, repr=<ray_test.Counter object at 0x7fa39c2db490>)
  File "/home/ravi/learning_ray/ray_test.py", line 14, in move_and_increment
    self.tensor.to(self.device)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
2022-09-29 23:22:26,474	ERROR worker.py:399 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::Counter.move_and_increment() (pid=6006, ip=192.168.10.5, repr=<ray_test.Counter object at 0x7f9fe462b520>)
  File "/home/ravi/learning_ray/ray_test.py", line 14, in move_and_increment
    self.tensor.to(self.device)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: No CUDA GPUs are available
[tensor([[1., 1., 1.]]), tensor([[1., 1., 1.]])]

I have enough memory in my graphic card as shown below:

$ nvidia-smi 
Thu Sep 29 23:26:14 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   45C    P8     7W /  N/A |    443MiB /  7982MiB |      6%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1529      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A      2812      G   /usr/lib/xorg/Xorg                161MiB |
|    0   N/A  N/A      2987      G   /usr/bin/gnome-shell              105MiB |
|    0   N/A  N/A      6112      G   ...252301518872410907,131072       80MiB |
|    0   N/A  N/A      8280      G   ...RendererForSitePerProcess       36MiB |
+-----------------------------------------------------------------------------+

I think the error is related to lazy initialization. Can you please confirm and provide a way to fix it?

Hey @ravi , thanks for posting the question and providing all the code/error context!

A couple of thoughts from me:

  1. Wondering if you have tried adding num_gpus=1 to your class definition? Like below:
@ray.remote(num_gpus=1)
class Counter(object):
 ....

This tells ray that the Counter class needs to scheduled at someone with access to the gpus.

  1. Also, could you also help me by running ray.cluster_resources() after ray.init() to make sure Ray is seeing the correct resources?

Thanks @rickyyx for quick response. I appreciate it. I am sorry for not adding (num_gpus=1) to my actor. Thanks for the sharp observation. Now, I have added it and also added ray.cluster_resources() as suggested. Please see below:

import ray
import torch

ray.init()
ray.cluster_resources() 

@ray.remote(num_gpus=1)
class Counter(object):
    def __init__(self):
        self.tensor = torch.ones((1, 3))
        self.device = "cuda:0"

    def move_and_increment(self):
        self.tensor.to(self.device)
        self.tensor += 1

    def print(self):
        return self.tensor


print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")

counters = [Counter.remote() for i in range(2)]
[c.move_and_increment.remote() for c in counters]
futures = [c.print.remote() for c in counters]
print(ray.get(futures))

ray.shutdown()

Unfortunately, the above code does not return. Therefore I have to press CTRL+C after 2-3 minutes to stop it. Below is the information, printed in terminal:

$ python ray_test.py 
2022-09-30 13:57:55,626	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
torch.cuda.is_available(): True
(scheduler +16s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +16s) Warning: The following resource request cannot be scheduled right now: {'GPU': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +51s) Warning: The following resource request cannot be scheduled right now: {'GPU': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +1m26s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
^CTraceback (most recent call last):
  File "/home/ravi/learning_ray/ray_test.py", line 26, in <module>
    print(ray.get(futures))
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/worker.py", line 2269, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/worker.py", line 669, in get_objects
    data_metadata_pairs = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 1211, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 173, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/node.py", line 1143, in _kill_process_type
    self._kill_process_impl(
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/node.py", line 1199, in _kill_process_impl
    process.wait(timeout_seconds)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/subprocess.py", line 1189, in wait
    return self._wait(timeout=timeout)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/subprocess.py", line 1911, in _wait
    time.sleep(delay)
KeyboardInterrupt

During the above execution, I noticed that my simplest actor is consuming 1089MiB GPU memory. Isn’t it too much for such a simple actor? Please see the output of nvidia-smi below:

$ nvidia-smi 
Fri Sep 30 13:58:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8     8W /  N/A |   1513MiB /  7982MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1529      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A      2812      G   /usr/lib/xorg/Xorg                161MiB |
|    0   N/A  N/A      2987      G   /usr/bin/gnome-shell               56MiB |
|    0   N/A  N/A      6112      G   ...252301518872410907,131072      144MiB |
|    0   N/A  N/A     35286      C   ray::Counter                     1089MiB |
+-----------------------------------------------------------------------------+

Finally, below is the output of ray status:

$ ray status
======== Autoscaler status: 2022-09-30 14:02:37.908976 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_64c97f1b08cc9131d1f160698cae490c845ebb2e1ef0cdfb98e086b4
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 1.0/16.0 CPU
 1.0/1.0 GPU
 0.0/1.0 accelerator_type:RTX
 0.00/15.066 GiB memory
 0.00/7.533 GiB object_store_memory

Demands:
 {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors

Please note that I am using Ray v2.0.0. Thank you very much.

Hey @ravi thanks for the iteration and the context.

I see 2 issues that you are running into:

  1. The program gets stuck:

I think this is because the ray cluster only sees 1 GPU (from the ray.status) available but you are trying to run 2 Counter actor which requires 1 GPU each. So the second Counter actor wasn’t able to schedule so it gets stuck at the ray.get(futures) call.
You could either

  • schedule just 1 Counter actor
  • or change num_gpus=1 to num_gpus=0.5 for logical accounting.
    See more details here.
  1. The Counter actor using 1GiB of memory. Let’s see what this number is when you could successfully run the script above. But seems like a bug (or some obvious optimization we should be doing).

Thanks @rickyyx for your wonderful response.

Both of them work like a charm. However, I have the following 3 things to say:

  1. In my simple program, the actor and main function are in the same place. Furthermore, the number of actors is a handful in number. Both of these situations make it very easy to update the num_gpus parameter. But how do you edit this parameter (and others, say num_cpu, etc.) in a large project having multiple files?

  2. Consider having an RTX 3090 having 24GB GPU memory. However, my tensor is tiny. In this case, if I allocate, use num_gpu=1 (instead of 0.5) and run two actors. Shouldn’t ray automatically find free memory on the GPU and then allocate the second actor to the same GPU to save resources? Therefore, I can run a large number of actors in a GPU.

  3. Finally, the memory usage of 1 actor in the new program is the same as the above, i.e., 1089MiB. It turned out that most of this memory is consumed by CUDA context loading (kernels etc.). However, while using 2 actors, the memory is twice, as I can see two entities reported by nvidia-smi. Does it mean that ray is loading the CUDA context twice? GPU memory is most precious than anything else in the world!!!

For the sake of completeness, I am reporting the output shown by nvidia-smi below:

$ nvidia-smi 
Tue Oct  4 16:13:30 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0  On |                  N/A |
| N/A   52C    P0    28W /  N/A |   3398MiB /  7982MiB |     20%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1554      G   /usr/lib/xorg/Xorg                160MiB |
|    0   N/A  N/A      2820      G   /usr/lib/xorg/Xorg                688MiB |
|    0   N/A  N/A      3001      G   /usr/bin/gnome-shell              111MiB |
|    0   N/A  N/A      3614      G   ...763400436228628087,131072      172MiB |
|    0   N/A  N/A     39131      G   ...RendererForSitePerProcess       78MiB |
|    0   N/A  N/A    143170      C   ...nter.move_and_increment()     1087MiB |
|    0   N/A  N/A    143171      C   ...nter.move_and_increment()     1085MiB |
+-----------------------------------------------------------------------------+

I am looking forward to hearing your wonderful suggestions. Thank you, again.

Thanks @rickyyx for your valuable time.

Can you provide any suggestions, please?

Thanks