How severe does this issue affect your experience of using Ray?
- None: Just asking a question out of curiosity
- Low: It annoys or frustrates me for a moment.
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
- High: It blocks me to complete my task. (unless I can disable it on the workers)
I’m running some RLlib + Tune workloads on multiple nodes.
After a day or so I’m getting:
(_PackActor pid=233066, ip=10.10.4.2) ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ***** is used (357.77 / 376.36 GB). The top 10 memory consu
mers are:
(_PackActor pid=233066, ip=10.10.4.2)
(_PackActor pid=233066, ip=10.10.4.2) PID MEM COMMAND
(_PackActor pid=233066, ip=10.10.4.2) 256207 127.58GiB /usr/local/bin/python3 -u /usr/local/lib/python3.9/site-packages/ray/dashboard/agent.py --node-ip-ad
(_PackActor pid=233066, ip=10.10.4.2) 256148 3.14GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/
(_PackActor pid=233066, ip=10.10.4.2) 256060 1.71GiB /usr/local/lib/python3.9/site-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_20
(_PackActor pid=233066, ip=10.10.4.2) 256351 1.51GiB python3 -u scripts/train.py --logdir logs/hyperion/exp-set-04/exp-01 --exp-name exp-gnn/exp-set-04/e
(_PackActor pid=233066, ip=10.10.4.2) 260405 1.25GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258394 1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260391 1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258389 1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 260392 1.24GiB ray::RolloutWorker
(_PackActor pid=233066, ip=10.10.4.2) 258382 1.24GiB ray::RolloutWorker
That doesn’t look healthy.
I can start the worker with ray start <...> --include-dashboard false
(according to the doc), but this might be indicating some deeper issue.