Lots of ray::IDLE proccess occupy GPU memory

  • High: It blocks me to complete my task.

I has some train tasks run on ray cluster. After some tasks are finished, GPU are left with lots of ray::IDLE process, occupy lots of GPU memory, and new tasks are blocked.

How can I deal with this?

here is the result of nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A   2263371      C   ray::IDLE                        1750MiB |
|    0   N/A  N/A   2276403      C   ray::IDLE                        1750MiB |
|    0   N/A  N/A   2290531      C   ray::IDLE                        1752MiB |
|    1   N/A  N/A   2263372      C   ray::IDLE                        1750MiB |
|    1   N/A  N/A   2275104      C   ray::IDLE                        1750MiB |
|    1   N/A  N/A   2290532      C   ray::IDLE                        1750MiB |
|    2   N/A  N/A   2263373      C   ray::IDLE                        1752MiB |
|    2   N/A  N/A   2272750      C   ray::IDLE                        1750MiB |
|    2   N/A  N/A   2281196      C   ray::IDLE                        1750MiB |
|    3   N/A  N/A   2263374      C   ray::IDLE                        1750MiB |
|    3   N/A  N/A   2273212      C   ray::IDLE                        1752MiB |
|    4   N/A  N/A   2263376      C   ray::IDLE                        1750MiB |
|    4   N/A  N/A   2282263      C   ray::IDLE                        1752MiB |
|    4   N/A  N/A   2290533      C   ray::IDLE                        1750MiB |
|    5   N/A  N/A   2263377      C   ray::IDLE                        1752MiB |
|    5   N/A  N/A   2274407      C   ray::IDLE                        1750MiB |
|    5   N/A  N/A   2290534      C   ray::IDLE                        1750MiB |
|    6   N/A  N/A   2263378      C   ray::IDLE                        1750MiB |
|    6   N/A  N/A   2278193      C   ray::IDLE                        1750MiB |
|    6   N/A  N/A   2290535      C   ray::IDLE                        1752MiB |
|    7   N/A  N/A   2263379      C   ray::IDLE                        1750MiB |
|    7   N/A  N/A   2290536      C   ray::IDLE                        1752MiB |
+-----------------------------------------------------------------------------+

for more: I have set torch.cuda.empty_cache() in each task.

I updated to python 3.12 and ray 2.32.0, the problem still exits.

How do I remove these ray::IDLE, they occupy too many GPU memory.

Were you using the Ray Train library?
Are these actors or tasks? You can also use py-spy to examine the stack for more information. and maybe post here.

Hi,

Thank you for sharing this information it is really useful.