variance in ray head memory consumption

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We have observed different memory consumption for the Ray head when running on different Ubuntu hosts. One host is a VDI (Virtual Desktop Infrastructure) while the other is a cloud machine. Both hosts have similar configurations, running Ubuntu 20.04 with 64 GB of RAM and 16 cores.

In our setup, we deploy the Ray head and worker as pods within a Kubernetes cluster. We set a memory limit of 1.5 GB for the head pod. However, when both the head and worker pods are launched, the memory usage of the head pod on the VDI host remains below the limit at approximately 0.8 GB. On the cloud machine, however, the memory usage of the Ray head exceeds 2 GB.

The versions of Ray and Python we are using are Ray 2.20.0 and Python 3.10.14, respectively. We start the Ray head with the following command:
ray start --head --metrics-export-port=8090 --port=6385 --redis-shard-ports=6380,6381 --num-cpus=0 --object-manager-port=22345 --object-store-memory=200000000 --node-manager-port=22346 --dashboard-host=0.0.0.0 –block

We have analyzed the memory consumption of processes running inside the Ray head pod and identified significant differences in the following processes:

Raylet: 43 Mb on VDI vs. 240 Mb on cloud
dashboard.py: 95 Mb on VDI vs 323 Mb on cloud
log_monitor.py: 59 Mb on VDI vs 156 Mb on cloud
ray.util.client.server (port 23000) : 124 Mb on VDI vs 398 Mb on cloud
ray.util.client.server (port 10001) : 72 Mb on VDI vs 301Mb on cloud
monitor.py: 60 Mb on VDI vs 169 Mb on cloud

Below is the list of all processes running inside the pod:

%Cpu(s): 14.6 us, 7.9 sy, 0.0 ni, 76.8 id, 0.4 wa, 0.0 hi, 0.4 si, 0.0 st
MiB Mem : 64299.9 total, 3853.5 free, 24531.9 used, 35914.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 33835.4 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                          
 10 root      20   0  714716 291648  16296 S   1.0   0.4   0:07.58 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7/logs --config_list=eyJvYmplY3Rf+ 
272 root      20   0 3722744 101960  38472 S   1.0   0.2   0:10.48 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address=10.244.0.68 --metrics-export-port=8090 --dashboard-agen+ 
 82 root      20   0  781912 331036  40924 S   0.7   0.5   0:06.22 /usr/bin/python3.10 /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=0.0.0.0 --port=8265 --port-retries=0 --temp-dir=/tmp/ray --log-di+ 
201 root      20   0 1074516 245636  16728 S   0.7   0.4   0:07.97 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-07-09_16-00-26_967883_7/sockets/raylet --stor+ 
366 root      20   0 1923340 407560  56268 S   0.7   0.6   0:07.23 /usr/bin/python3.10 -m ray.util.client.server --address=10.244.0.68:6385 --host=0.0.0.0 --port=23000 --mode=specific-server                                      
 81 root      20   0  759044 308396  35396 S   0.3   0.5   0:05.03 /usr/bin/python3.10 -m ray.util.client.server --address=10.244.0.68:6385 --host=0.0.0.0 --port=10001 --mode=proxy --runtime-env-agent-address=http://10.244.0.6+ 
202 root      20   0  347940 159668  25948 S   0.3   0.2   0:04.56 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/_private/log_monitor.py --session-dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7 --logs-d+ 
274 root      20   0 1698388  64352  26168 S   0.3   0.1   0:00.68 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/_private/runtime_env/agent/main.py --node-ip-address=10.244.0.68 --runtime-env-agent-port=53+ 
  7 root      20   0 1727884  83440  35744 S   0.0   0.1   0:01.23 /usr/bin/python3.10 /usr/local/bin/ray start --head --tracing-startup-hook=opwi_detection_core_infra.otel.otel_tracing:setup_python_and_cpp_otel_tracing --metr+ 
 80 root      20   0  390960 172772  26536 S   0.0   0.3   0:01.66 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7/logs+ 

The Ubuntu, Kubernetes, and Docker versions are identical on both hosts

Could you please assist in identifying the potential factors that could be causing the variance in memory consumption among the Ray head processes?

I’m not an expert, but my understanding is that different virtualisation technologies may account for memory differently. VDI is typically OS-level virtualisation, and it may underreport resource utilisation. Cloud machines usually based on hardware virtualisation.

I know I’m not answering your question directly, but I’m wondering what’s your goal when you try to analyse this?

1 Like

Simliar Q to @lobanov > are you trying to get this working on on-prem infra which is why you’re looking at the VDI option?