How severe does this issue affect your experience of using ray? High
We have observed different memory consumption for the Ray head when running on different Ubuntu hosts.
One host is a VDI (Virtual Desktop Infrastructure) while the other is a cloud machine. Both hosts have similar configurations, running Ubuntu 20.04 with 64 GB of RAM and 16 cores.
In our setup, we deploy the Ray head and worker as pods within a Kubernetes cluster (we are not using kuberay).
We set a memory limit of 1.5 GB for the head pod. However, when both the head and worker pods are launched, the memory usage of the head pod on the VDI host remains below the limit at approximately 0.8 GB.
On the cloud machine, however, the memory usage of the Ray head exceeds 2 GB.
The versions of Ray and Python we are using are Ray 2.20.0 and Python 3.10.14, respectively.
We start the Ray head with the following command:
ray start --head --metrics-export-port=8090 --port=6385 --redis-shard-ports=6380,6381 --num-cpus=0 --object-manager-port=22345 --object-store-memory=200000000 --node-manager-port=22346 --dashboard-host=0.0.0.0 –block
We have analyzed the memory consumption of processes running inside the Ray head pod and identified significant differences in the following processes:
Raylet: 43 Mb on VDI vs. 240 Mb on cloud
dashboard.py: 95 Mb on VDI vs 323 Mb on cloud
log_monitor.py: 59 Mb on VDI vs 156 Mb on cloud
ray.util.client.server (port 23000) : 124 Mb on VDI vs 398 Mb on cloud
ray.util.client.server (port 10001) : 72 Mb on VDI vs 301Mb on cloud
monitor.py: 60 Mb on VDI vs 169 Mb on cloud
Below is the list of all processes running inside the pod:
%Cpu(s): 14.6 us, 7.9 sy, 0.0 ni, 76.8 id, 0.4 wa, 0.0 hi, 0.4 si, 0.0 st
MiB Mem : 64299.9 total, 3853.5 free, 24531.9 used, 35914.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 33835.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
10 root 20 0 714716 291648 16296 S 1.0 0.4 0:07.58 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7/logs --config_list=eyJvYmplY3Rf+
272 root 20 0 3722744 101960 38472 S 1.0 0.2 0:10.48 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address=10.244.0.68 --metrics-export-port=8090 --dashboard-agen+
82 root 20 0 781912 331036 40924 S 0.7 0.5 0:06.22 /usr/bin/python3.10 /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=0.0.0.0 --port=8265 --port-retries=0 --temp-dir=/tmp/ray --log-di+
201 root 20 0 1074516 245636 16728 S 0.7 0.4 0:07.97 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-07-09_16-00-26_967883_7/sockets/raylet --stor+
366 root 20 0 1923340 407560 56268 S 0.7 0.6 0:07.23 /usr/bin/python3.10 -m ray.util.client.server --address=10.244.0.68:6385 --host=0.0.0.0 --port=23000 --mode=specific-server
81 root 20 0 759044 308396 35396 S 0.3 0.5 0:05.03 /usr/bin/python3.10 -m ray.util.client.server --address=10.244.0.68:6385 --host=0.0.0.0 --port=10001 --mode=proxy --runtime-env-agent-address=http://10.244.0.6+
202 root 20 0 347940 159668 25948 S 0.3 0.2 0:04.56 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/_private/log_monitor.py --session-dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7 --logs-d+
274 root 20 0 1698388 64352 26168 S 0.3 0.1 0:00.68 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/_private/runtime_env/agent/main.py --node-ip-address=10.244.0.68 --runtime-env-agent-port=53+
7 root 20 0 1727884 83440 35744 S 0.0 0.1 0:01.23 /usr/bin/python3.10 /usr/local/bin/ray start --head --tracing-startup-hook=opwi_detection_core_infra.otel.otel_tracing:setup_python_and_cpp_otel_tracing --metr+
80 root 20 0 390960 172772 26536 S 0.0 0.3 0:01.66 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7/logs+
The Ubuntu, Kubernetes, and Docker versions are identical on both hosts
Could you please assist in identifying the potential factors that could be causing the variance in memory consumption among the Ray head processes?