- High: It blocks me to complete my task.
I has some train tasks run on ray cluster. After some tasks are finished, GPU are left with lots of ray::IDLE process, occupy lots of GPU memory, and new tasks are blocked.
How can I deal with this?
here is the result of nvidia-smi
:
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 N/A N/A 2263371 C ray::IDLE 1750MiB |
| 0 N/A N/A 2276403 C ray::IDLE 1750MiB |
| 0 N/A N/A 2290531 C ray::IDLE 1752MiB |
| 1 N/A N/A 2263372 C ray::IDLE 1750MiB |
| 1 N/A N/A 2275104 C ray::IDLE 1750MiB |
| 1 N/A N/A 2290532 C ray::IDLE 1750MiB |
| 2 N/A N/A 2263373 C ray::IDLE 1752MiB |
| 2 N/A N/A 2272750 C ray::IDLE 1750MiB |
| 2 N/A N/A 2281196 C ray::IDLE 1750MiB |
| 3 N/A N/A 2263374 C ray::IDLE 1750MiB |
| 3 N/A N/A 2273212 C ray::IDLE 1752MiB |
| 4 N/A N/A 2263376 C ray::IDLE 1750MiB |
| 4 N/A N/A 2282263 C ray::IDLE 1752MiB |
| 4 N/A N/A 2290533 C ray::IDLE 1750MiB |
| 5 N/A N/A 2263377 C ray::IDLE 1752MiB |
| 5 N/A N/A 2274407 C ray::IDLE 1750MiB |
| 5 N/A N/A 2290534 C ray::IDLE 1750MiB |
| 6 N/A N/A 2263378 C ray::IDLE 1750MiB |
| 6 N/A N/A 2278193 C ray::IDLE 1750MiB |
| 6 N/A N/A 2290535 C ray::IDLE 1752MiB |
| 7 N/A N/A 2263379 C ray::IDLE 1750MiB |
| 7 N/A N/A 2290536 C ray::IDLE 1752MiB |
+-----------------------------------------------------------------------------+
for more: I have set torch.cuda.empty_cache()
in each task.