[Ray Core] RuntimeError: No CUDA GPUs are available

Thanks @rickyyx for quick response. I appreciate it. I am sorry for not adding (num_gpus=1) to my actor. Thanks for the sharp observation. Now, I have added it and also added ray.cluster_resources() as suggested. Please see below:

import ray
import torch

ray.init()
ray.cluster_resources() 

@ray.remote(num_gpus=1)
class Counter(object):
    def __init__(self):
        self.tensor = torch.ones((1, 3))
        self.device = "cuda:0"

    def move_and_increment(self):
        self.tensor.to(self.device)
        self.tensor += 1

    def print(self):
        return self.tensor


print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")

counters = [Counter.remote() for i in range(2)]
[c.move_and_increment.remote() for c in counters]
futures = [c.print.remote() for c in counters]
print(ray.get(futures))

ray.shutdown()

Unfortunately, the above code does not return. Therefore I have to press CTRL+C after 2-3 minutes to stop it. Below is the information, printed in terminal:

$ python ray_test.py 
2022-09-30 13:57:55,626	INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
torch.cuda.is_available(): True
(scheduler +16s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +16s) Warning: The following resource request cannot be scheduled right now: {'GPU': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +51s) Warning: The following resource request cannot be scheduled right now: {'GPU': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +1m26s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
^CTraceback (most recent call last):
  File "/home/ravi/learning_ray/ray_test.py", line 26, in <module>
    print(ray.get(futures))
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
    return func(*args, **kwargs)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/worker.py", line 2269, in get
    values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/worker.py", line 669, in get_objects
    data_metadata_pairs = self.core_worker.get_objects(
  File "python/ray/_raylet.pyx", line 1211, in ray._raylet.CoreWorker.get_objects
  File "python/ray/_raylet.pyx", line 173, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/node.py", line 1143, in _kill_process_type
    self._kill_process_impl(
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/node.py", line 1199, in _kill_process_impl
    process.wait(timeout_seconds)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/subprocess.py", line 1189, in wait
    return self._wait(timeout=timeout)
  File "/home/ravi/anaconda/envs/ray/lib/python3.9/subprocess.py", line 1911, in _wait
    time.sleep(delay)
KeyboardInterrupt

During the above execution, I noticed that my simplest actor is consuming 1089MiB GPU memory. Isn’t it too much for such a simple actor? Please see the output of nvidia-smi below:

$ nvidia-smi 
Fri Sep 30 13:58:45 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  GeForce RTX 208...  On   | 00000000:01:00.0 Off |                  N/A |
| N/A   47C    P8     8W /  N/A |   1513MiB /  7982MiB |      2%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1529      G   /usr/lib/xorg/Xorg                 45MiB |
|    0   N/A  N/A      2812      G   /usr/lib/xorg/Xorg                161MiB |
|    0   N/A  N/A      2987      G   /usr/bin/gnome-shell               56MiB |
|    0   N/A  N/A      6112      G   ...252301518872410907,131072      144MiB |
|    0   N/A  N/A     35286      C   ray::Counter                     1089MiB |
+-----------------------------------------------------------------------------+

Finally, below is the output of ray status:

$ ray status
======== Autoscaler status: 2022-09-30 14:02:37.908976 ========
Node status
---------------------------------------------------------------
Healthy:
 1 node_64c97f1b08cc9131d1f160698cae490c845ebb2e1ef0cdfb98e086b4
Pending:
 (no pending nodes)
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 1.0/16.0 CPU
 1.0/1.0 GPU
 0.0/1.0 accelerator_type:RTX
 0.00/15.066 GiB memory
 0.00/7.533 GiB object_store_memory

Demands:
 {'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors

Please note that I am using Ray v2.0.0. Thank you very much.