Thanks @rickyyx for quick response. I appreciate it. I am sorry for not adding (num_gpus=1)
to my actor. Thanks for the sharp observation. Now, I have added it and also added ray.cluster_resources()
as suggested. Please see below:
import ray
import torch
ray.init()
ray.cluster_resources()
@ray.remote(num_gpus=1)
class Counter(object):
def __init__(self):
self.tensor = torch.ones((1, 3))
self.device = "cuda:0"
def move_and_increment(self):
self.tensor.to(self.device)
self.tensor += 1
def print(self):
return self.tensor
print(f"torch.cuda.is_available(): {torch.cuda.is_available()}")
counters = [Counter.remote() for i in range(2)]
[c.move_and_increment.remote() for c in counters]
futures = [c.print.remote() for c in counters]
print(ray.get(futures))
ray.shutdown()
Unfortunately, the above code does not return. Therefore I have to press CTRL+C after 2-3 minutes to stop it. Below is the information, printed in terminal:
$ python ray_test.py
2022-09-30 13:57:55,626 INFO worker.py:1509 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
torch.cuda.is_available(): True
(scheduler +16s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +16s) Warning: The following resource request cannot be scheduled right now: {'GPU': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +51s) Warning: The following resource request cannot be scheduled right now: {'GPU': 1.0, 'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
(scheduler +1m26s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
^CTraceback (most recent call last):
File "/home/ravi/learning_ray/ray_test.py", line 26, in <module>
print(ray.get(futures))
File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/worker.py", line 2269, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/worker.py", line 669, in get_objects
data_metadata_pairs = self.core_worker.get_objects(
File "python/ray/_raylet.pyx", line 1211, in ray._raylet.CoreWorker.get_objects
File "python/ray/_raylet.pyx", line 173, in ray._raylet.check_status
KeyboardInterrupt
^CError in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/node.py", line 1143, in _kill_process_type
self._kill_process_impl(
File "/home/ravi/anaconda/envs/ray/lib/python3.9/site-packages/ray/_private/node.py", line 1199, in _kill_process_impl
process.wait(timeout_seconds)
File "/home/ravi/anaconda/envs/ray/lib/python3.9/subprocess.py", line 1189, in wait
return self._wait(timeout=timeout)
File "/home/ravi/anaconda/envs/ray/lib/python3.9/subprocess.py", line 1911, in _wait
time.sleep(delay)
KeyboardInterrupt
During the above execution, I noticed that my simplest actor is consuming 1089MiB GPU memory. Isn’t it too much for such a simple actor? Please see the output of nvidia-smi
below:
$ nvidia-smi
Fri Sep 30 13:58:45 2022
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 GeForce RTX 208... On | 00000000:01:00.0 Off | N/A |
| N/A 47C P8 8W / N/A | 1513MiB / 7982MiB | 2% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 1529 G /usr/lib/xorg/Xorg 45MiB |
| 0 N/A N/A 2812 G /usr/lib/xorg/Xorg 161MiB |
| 0 N/A N/A 2987 G /usr/bin/gnome-shell 56MiB |
| 0 N/A N/A 6112 G ...252301518872410907,131072 144MiB |
| 0 N/A N/A 35286 C ray::Counter 1089MiB |
+-----------------------------------------------------------------------------+
Finally, below is the output of ray status
:
$ ray status
======== Autoscaler status: 2022-09-30 14:02:37.908976 ========
Node status
---------------------------------------------------------------
Healthy:
1 node_64c97f1b08cc9131d1f160698cae490c845ebb2e1ef0cdfb98e086b4
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
1.0/16.0 CPU
1.0/1.0 GPU
0.0/1.0 accelerator_type:RTX
0.00/15.066 GiB memory
0.00/7.533 GiB object_store_memory
Demands:
{'CPU': 1.0, 'GPU': 1.0}: 1+ pending tasks/actors
Please note that I am using Ray v2.0.0. Thank you very much.