Large memory usesage by dashboard

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity
  • Low: It annoys or frustrates me for a moment.
  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.
  • High: It blocks me to complete my task. (unless I can disable it on the workers)

I’m running some RLlib + Tune workloads on multiple nodes.

After a day or so I’m getting:

(_PackActor pid=233066, ip= ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ***** is used (357.77 / 376.36 GB). The top 10 memory consu
mers are:                  
(_PackActor pid=233066, ip=   
(_PackActor pid=233066, ip= PID       MEM     COMMAND
(_PackActor pid=233066, ip= 256207    127.58GiB       /usr/local/bin/python3 -u /usr/local/lib/python3.9/site-packages/ray/dashboard/ --node-ip-ad
(_PackActor pid=233066, ip= 256148    3.14GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/
(_PackActor pid=233066, ip= 256060    1.71GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
(_PackActor pid=233066, ip= 256351    1.51GiB python3 -u scripts/ --logdir logs/hyperion/exp-set-04/exp-01 --exp-name exp-gnn/exp-set-04/e
(_PackActor pid=233066, ip= 260405    1.25GiB ray::RolloutWorker
(_PackActor pid=233066, ip= 258394    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip= 260391    1.24GiB ray::RolloutWorker                                                                                                                                         
(_PackActor pid=233066, ip= 258389    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip= 260392    1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip= 258382    1.24GiB ray::RolloutWorker  

That doesn’t look healthy.
I can start the worker with ray start <...> --include-dashboard false (according to the doc), but this might be indicating some deeper issue.

Related dashboard memory leak issue: [Dashboard] The process leaks memory · Issue #26568 · ray-project/ray · GitHub
cc: @aguo @architkulkarni

@vakker00 , sorry you’re running into this and thanks for reporting! This looks like a new (but possibly related) issue (here it’s the process leaking memory, not the process). One function of the dashboard is to handle runtime environments, by chance are you using runtime_env in your Ray code?

Do you mind creating an issue on Github, ideally with a way for us to reproduce it? We can follow up there and investigate.

Thanks for the reply.
I’m not using runtime_env anywhere.

I can create an issue, but I’m not sure how to reproduce this with minimal code. I’m running my experiments on a Slurm cluster with multiple nodes.

I’ve just submitted [Dashboard] The process leaks memory · Issue #29199 · ray-project/ray · GitHub