Large memory usesage by dashboard

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task. (unless I can disable it on the workers)

I’m running some RLlib + Tune workloads on multiple nodes.

After a day or so I’m getting:

(_PackActor pid=233066, ip=10.10.4.2) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ***** is used (357.77 / 376.36 GB). The top 10 memory consu
mers are:                  
(_PackActor pid=233066, ip=10.10.4.2)   
(_PackActor pid=233066, ip=10.10.4.2) PID       MEM     COMMAND
(_PackActor pid=233066, ip=10.10.4.2) 256207    127.58GiB       /usr/local/bin/python3 -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-ad
(_PackActor pid=233066, ip=10.10.4.2) 256148    3.14GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/
(_PackActor pid=233066, ip=10.10.4.2) 256060    1.71GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
(_PackActor pid=233066, ip=10.10.4.2) 256351    1.51GiB python3 -u scripts/train.py --logdir logs/hyperion/exp-set-04/exp-01 --exp-name exp-gnn/exp-set-04/e
(_PackActor pid=233066, ip=10.10.4.2) 260405    1.25GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258394    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260391    1.24GiB ray::RolloutWorker                                                                                                                                         
(_PackActor pid=233066, ip=10.10.4.2) 258389    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260392    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258382    1.24GiB ray::RolloutWorker  

That doesn’t look healthy.
I can start the worker with ray start <...> --include-dashboard false (according to the doc), but this might be indicating some deeper issue.

Related dashboard memory leak issue: [Dashboard] The dashboard.py process leaks memory · Issue #26568 · ray-project/ray · GitHub
cc: @aguo @architkulkarni

@vakker00 , sorry you’re running into this and thanks for reporting! This looks like a new (but possibly related) issue (here it’s the agent.py process leaking memory, not the dashboard.py process). One function of the dashboard agent.py is to handle runtime environments, by chance are you using runtime_env in your Ray code?

Do you mind creating an issue on Github, ideally with a way for us to reproduce it? We can follow up there and investigate.

Thanks for the reply.
I’m not using runtime_env anywhere.

I can create an issue, but I’m not sure how to reproduce this with minimal code. I’m running my experiments on a Slurm cluster with multiple nodes.

Edit:
I’ve just submitted [Dashboard] The agent.py process leaks memory · Issue #29199 · ray-project/ray · GitHub