Ray head node excessive memory usage

shiranbi · August 6, 2024, 4:06am

How severe does this issue affect your experience of using ray? High

We have observed different memory consumption for the Ray head when running on different Ubuntu hosts.
One host is a VDI (Virtual Desktop Infrastructure) while the other is a cloud machine. Both hosts have similar configurations, running Ubuntu 20.04 with 64 GB of RAM and 16 cores.
In our setup, we deploy the Ray head and worker as pods within a Kubernetes cluster (we are not using kuberay).
We set a memory limit of 1.5 GB for the head pod. However, when both the head and worker pods are launched, the memory usage of the head pod on the VDI host remains below the limit at approximately 0.8 GB.
On the cloud machine, however, the memory usage of the Ray head exceeds 2 GB.
The versions of Ray and Python we are using are Ray 2.20.0 and Python 3.10.14, respectively.
We start the Ray head with the following command:
ray start --head --metrics-export-port=8090 --port=6385 --redis-shard-ports=6380,6381 --num-cpus=0 --object-manager-port=22345 --object-store-memory=200000000 --node-manager-port=22346 --dashboard-host=0.0.0.0 –block
We have analyzed the memory consumption of processes running inside the Ray head pod and identified significant differences in the following processes:
Raylet: 43 Mb on VDI vs. 240 Mb on cloud
dashboard.py: 95 Mb on VDI vs 323 Mb on cloud
log_monitor.py: 59 Mb on VDI vs 156 Mb on cloud
ray.util.client.server (port 23000) : 124 Mb on VDI vs 398 Mb on cloud
ray.util.client.server (port 10001) : 72 Mb on VDI vs 301Mb on cloud
monitor.py: 60 Mb on VDI vs 169 Mb on cloud

Below is the list of all processes running inside the pod:
%Cpu(s): 14.6 us, 7.9 sy, 0.0 ni, 76.8 id, 0.4 wa, 0.0 hi, 0.4 si, 0.0 st
MiB Mem : 64299.9 total, 3853.5 free, 24531.9 used, 35914.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 33835.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10 root 20 0 714716 291648 16296 S 1.0 0.4 0:07.58 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7/logs --config_list=eyJvYmplY3Rf+
272 root 20 0 3722744 101960 38472 S 1.0 0.2 0:10.48 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address=10.244.0.68 --metrics-export-port=8090 --dashboard-agen+
82 root 20 0 781912 331036 40924 S 0.7 0.5 0:06.22 /usr/bin/python3.10 /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=0.0.0.0 --port=8265 --port-retries=0 --temp-dir=/tmp/ray --log-di+
201 root 20 0 1074516 245636 16728 S 0.7 0.4 0:07.97 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-07-09_16-00-26_967883_7/sockets/raylet --stor+
366 root 20 0 1923340 407560 56268 S 0.7 0.6 0:07.23 /usr/bin/python3.10 -m ray.util.client.server --address=10.244.0.68:6385 --host=0.0.0.0 --port=23000 --mode=specific-server
81 root 20 0 759044 308396 35396 S 0.3 0.5 0:05.03 /usr/bin/python3.10 -m ray.util.client.server --address=10.244.0.68:6385 --host=0.0.0.0 --port=10001 --mode=proxy --runtime-env-agent-address=http://10.244.0.6+
202 root 20 0 347940 159668 25948 S 0.3 0.2 0:04.56 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/_private/log_monitor.py --session-dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7 --logs-d+
274 root 20 0 1698388 64352 26168 S 0.3 0.1 0:00.68 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/_private/runtime_env/agent/main.py --node-ip-address=10.244.0.68 --runtime-env-agent-port=53+
7 root 20 0 1727884 83440 35744 S 0.0 0.1 0:01.23 /usr/bin/python3.10 /usr/local/bin/ray start --head --tracing-startup-hook=opwi_detection_core_infra.otel.otel_tracing:setup_python_and_cpp_otel_tracing --metr+
80 root 20 0 390960 172772 26536 S 0.0 0.3 0:01.66 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7/logs+

The Ubuntu, Kubernetes, and Docker versions are identical on both hosts
Could you please assist in identifying the potential factors that could be causing the variance in memory consumption among the Ray head processes?

Sam_Chan · August 6, 2024, 6:44am

which cloud what machine and are you going direct to vms or on top of kuberay?

shiranbi · August 6, 2024, 8:00am

*cloud provider is Azure (we have a servers allocated to our company and those are on, all of the time, so same set of servers on all runs).
*we are working directly on VMs (not using kuberay)

Sam_Chan · August 13, 2024, 5:19am

Any difference in OS versions between Cloud and VDI on premise?

shiranbi · August 13, 2024, 7:25am

OS is the same version
kernel is almost the same
vdi: 5.15.0-94-generic
cloud: 5.15.0-1053-azure

Sam_Chan · August 13, 2024, 7:58pm

Are you running on AKS or on VMs directly with your own K8s substrate managed deployed on top?

shiranbi · August 14, 2024, 3:58am

on VMs directly with our own K8s managed on top
which is the same one we deploy on the on prem VDIs

Sam_Chan · August 28, 2024, 6:34pm

Have you any luck in using KubeRay as a middleware orchestrator for you?

shiranbi · August 29, 2024, 3:31am

we don’t have any use in the additional capabilities that it brings, even though we are using kuberentes are system is very strict (number of nodes and where the ray nodes should run)

Topic		Replies	Views
variance in ray head memory consumption Ray Clusters	2	30	August 11, 2024
Why is the head dying regularly with OOM while the workers barely have any RAM usage?	3	642	July 5, 2023
Ray Serve Pods Scheduling Failing Ray Serve	3	78	July 26, 2024
Ray head memory leak in 1.13? Ray Core	5	559	July 18, 2022
Ray worker nodes showing high memory usage even though no tasks running on cluster Ray Core	4	462	September 26, 2023

Ray head node excessive memory usage

Related topics