Ray worker nodes showing high memory usage even though no tasks running on cluster

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m currently seeing high memory usage on our cluster at times when the cluster is inactive. This feels like a leak but I’m not sure where to begin to investigate.

(base) ➜  ghost git:(master) kubectl top pods
NAME                                 CPU(cores)   MEMORY(bytes)
ray-cluster-head-glhcj               233m         8814Mi
ray-cluster-worker-group-cpu-2dfxw   78m          16759Mi
ray-cluster-worker-group-cpu-2trpl   98m          29028Mi
ray-cluster-worker-group-cpu-btnh9   93m          19810Mi
ray-cluster-worker-group-cpu-hw26f   78m          14724Mi
ray-cluster-worker-group-cpu-jn52z   103m         16448Mi
ray-cluster-worker-group-cpu-p59cx   78m          22880Mi
ray-cluster-worker-group-cpu-tmkzw   98m          30623Mi
ray-cluster-worker-group-cpu-v5k2t   78m          17013Mi
ray-cluster-worker-group-cpu-zpmjx   89m          22920Mi
ray-cluster-worker-group-cpu-zpw4m   109m         21765Mi
ray-cluster-worker-group-gpu-6kbvd   99m          14251Mi
ray-cluster-worker-group-gpu-7qnhq   130m         26942Mi
ray-cluster-worker-group-gpu-brlql   98m          19955Mi
ray-cluster-worker-group-gpu-j7vs4   78m          17171Mi
ray-cluster-worker-group-gpu-m2mjd   61m          575Mi
ray-cluster-worker-group-gpu-pjb6r   114m         30894Mi
ray-cluster-worker-group-gpu-prh5d   67m          618Mi
ray-cluster-worker-group-gpu-r9rpd   103m         16480Mi
ray-cluster-worker-group-gpu-tjfkk   82m          16220Mi
ray-cluster-worker-group-gpu-xxfwl   83m          16104Mi

If I jump into one of the nodes, I see the following :

(base) root@ray-cluster-worker-group-cpu-zpmjx:/opt/ray# ray memory
======== Object references status: 2023-05-30 10:58:59.884291 ========
Grouping by node address...        Sorting by object size...        Display allentries per group...

--- Summary for node address: ---
Mem Used by Objects  Local References  Pinned        Used by task   Captured in Objects  Actor Handles
0.0 B                0, (0.0 B)        0, (0.0 B)    0, (0.0 B)     0, (0.0 B)           1, (0.0 B)

--- Object references for node address: ---
IP Address    | PID      | Type    | Call Site | Status    | Size     | Reference Type | Object Ref | 62305    | Worker  | disabled  | -         | ?        | ACTOR_HANDLE   | ffffffffffffffffed3ba63749898995a8fd4f2ee405000001000000

To record callsite information for each ObjectRef created, set env variable RAY_record_ref_creation_sites=1

--- Aggregate object store stats across all nodes ---
Plasma memory usage 0 MiB, 0 objects, 0.0% full, 0.0% needed
Objects consumed by Ray tasks: 1232 MiB.

Attaching a screenshot of that nodes heap memory :

  1. Is it possible to share the output of “node memory by component”? Getting Started — Ray 3.0.0.dev0
  2. Also is this the memory usage “after” running workloads or the fresh cluster?
  3. What’s the total memory usage per node?

I also encountered the same problem. Have you solved it yet? The version I am using is 2.4.0. And I found that after the job runs completely, the memory usage is higher compared to before the job runs. This leads to an increasing memory usage.

Hello, I have encountered the same issue. I am using version 2.4.0, and I declare a memory size of 8GB for the head node and 24GB for the gpu group. I have attached a screenshot to my previous response. Looking forward to receiving your response.

Can you please share “node memory by component”? It is impossible for us to troubleshoot anything with just the memory usage graph. We need to at least understand which processes use more memory.