Large memory usesage by dashboard

vakker00 · October 5, 2022, 10:57am

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity
Low: It annoys or frustrates me for a moment.
Medium: It contributes to significant difficulty to complete my task, but I can work around it.
High: It blocks me to complete my task. (unless I can disable it on the workers)

I’m running some RLlib + Tune workloads on multiple nodes.

After a day or so I’m getting:

(_PackActor pid=233066, ip=10.10.4.2) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ***** is used (357.77 / 376.36 GB). The top 10 memory consu
mers are:                  
(_PackActor pid=233066, ip=10.10.4.2)   
(_PackActor pid=233066, ip=10.10.4.2) PID       MEM     COMMAND
(_PackActor pid=233066, ip=10.10.4.2) 256207    127.58GiB       /usr/local/bin/python3 -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-ad
(_PackActor pid=233066, ip=10.10.4.2) 256148    3.14GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/
(_PackActor pid=233066, ip=10.10.4.2) 256060    1.71GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
(_PackActor pid=233066, ip=10.10.4.2) 256351    1.51GiB python3 -u scripts/train.py --logdir logs/hyperion/exp-set-04/exp-01 --exp-name exp-gnn/exp-set-04/e
(_PackActor pid=233066, ip=10.10.4.2) 260405    1.25GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258394    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260391    1.24GiB ray::RolloutWorker                                                                                                                                         
(_PackActor pid=233066, ip=10.10.4.2) 258389    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260392    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258382    1.24GiB ray::RolloutWorker

That doesn’t look healthy.
I can start the worker with ray start <...> --include-dashboard false (according to the doc), but this might be indicating some deeper issue.

Huaiwei_Sun · October 7, 2022, 6:54pm

Related dashboard memory leak issue: [Dashboard] The dashboard.py process leaks memory · Issue #26568 · ray-project/ray · GitHub
cc: @aguo @architkulkarni

architkulkarni · October 7, 2022, 10:30pm

@vakker00 , sorry you’re running into this and thanks for reporting! This looks like a new (but possibly related) issue (here it’s the agent.py process leaking memory, not the dashboard.py process). One function of the dashboard agent.py is to handle runtime environments, by chance are you using runtime_env in your Ray code?

Do you mind creating an issue on Github, ideally with a way for us to reproduce it? We can follow up there and investigate.

vakker00 · October 8, 2022, 9:55am

Thanks for the reply.
I’m not using runtime_env anywhere.

I can create an issue, but I’m not sure how to reproduce this with minimal code. I’m running my experiments on a Slurm cluster with multiple nodes.

Edit:
I’ve just submitted [Dashboard] The agent.py process leaks memory · Issue #29199 · ray-project/ray · GitHub

Topic		Replies	Views
RayOutOfMemoryError: More than 95% of the memory is used Ray Core	6	4852	September 9, 2022
RayOutOfMemoryError RLlib	2	779	May 24, 2021
Although node memory usage is high, I don't want to kill my actor Ray Train	3	534	February 2, 2023
Ray dashboard agent processs cost big memory due to out of memory Dashboard, Monitoring & Debugging	0	33	July 22, 2024
Memory usage in dashboard is confusing Dashboard, Monitoring & Debugging	9	270	April 8, 2024

Large memory usesage by dashboard

Related topics