variance in ray head memory consumption

Leonid_Gilovoy · August 10, 2024, 6:43am

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

We have observed different memory consumption for the Ray head when running on different Ubuntu hosts. One host is a VDI (Virtual Desktop Infrastructure) while the other is a cloud machine. Both hosts have similar configurations, running Ubuntu 20.04 with 64 GB of RAM and 16 cores.

In our setup, we deploy the Ray head and worker as pods within a Kubernetes cluster. We set a memory limit of 1.5 GB for the head pod. However, when both the head and worker pods are launched, the memory usage of the head pod on the VDI host remains below the limit at approximately 0.8 GB. On the cloud machine, however, the memory usage of the Ray head exceeds 2 GB.

The versions of Ray and Python we are using are Ray 2.20.0 and Python 3.10.14, respectively. We start the Ray head with the following command:
ray start --head --metrics-export-port=8090 --port=6385 --redis-shard-ports=6380,6381 --num-cpus=0 --object-manager-port=22345 --object-store-memory=200000000 --node-manager-port=22346 --dashboard-host=0.0.0.0 –block

We have analyzed the memory consumption of processes running inside the Ray head pod and identified significant differences in the following processes:

Raylet: 43 Mb on VDI vs. 240 Mb on cloud
dashboard.py: 95 Mb on VDI vs 323 Mb on cloud
log_monitor.py: 59 Mb on VDI vs 156 Mb on cloud
ray.util.client.server (port 23000) : 124 Mb on VDI vs 398 Mb on cloud
ray.util.client.server (port 10001) : 72 Mb on VDI vs 301Mb on cloud
monitor.py: 60 Mb on VDI vs 169 Mb on cloud

Below is the list of all processes running inside the pod:

%Cpu(s): 14.6 us, 7.9 sy, 0.0 ni, 76.8 id, 0.4 wa, 0.0 hi, 0.4 si, 0.0 st
MiB Mem : 64299.9 total, 3853.5 free, 24531.9 used, 35914.5 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 33835.4 avail Mem

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                                                                                                                                                          
 10 root      20   0  714716 291648  16296 S   1.0   0.4   0:07.58 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/gcs/gcs_server --log_dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7/logs --config_list=eyJvYmplY3Rf+ 
272 root      20   0 3722744 101960  38472 S   1.0   0.2   0:10.48 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/dashboard/agent.py --node-ip-address=10.244.0.68 --metrics-export-port=8090 --dashboard-agen+ 
 82 root      20   0  781912 331036  40924 S   0.7   0.5   0:06.22 /usr/bin/python3.10 /usr/local/lib/python3.10/dist-packages/ray/dashboard/dashboard.py --host=0.0.0.0 --port=8265 --port-retries=0 --temp-dir=/tmp/ray --log-di+ 
201 root      20   0 1074516 245636  16728 S   0.7   0.4   0:07.97 /usr/local/lib/python3.10/dist-packages/ray/core/src/ray/raylet/raylet --raylet_socket_name=/tmp/ray/session_2024-07-09_16-00-26_967883_7/sockets/raylet --stor+ 
366 root      20   0 1923340 407560  56268 S   0.7   0.6   0:07.23 /usr/bin/python3.10 -m ray.util.client.server --address=10.244.0.68:6385 --host=0.0.0.0 --port=23000 --mode=specific-server                                      
 81 root      20   0  759044 308396  35396 S   0.3   0.5   0:05.03 /usr/bin/python3.10 -m ray.util.client.server --address=10.244.0.68:6385 --host=0.0.0.0 --port=10001 --mode=proxy --runtime-env-agent-address=http://10.244.0.6+ 
202 root      20   0  347940 159668  25948 S   0.3   0.2   0:04.56 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/_private/log_monitor.py --session-dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7 --logs-d+ 
274 root      20   0 1698388  64352  26168 S   0.3   0.1   0:00.68 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/_private/runtime_env/agent/main.py --node-ip-address=10.244.0.68 --runtime-env-agent-port=53+ 
  7 root      20   0 1727884  83440  35744 S   0.0   0.1   0:01.23 /usr/bin/python3.10 /usr/local/bin/ray start --head --tracing-startup-hook=opwi_detection_core_infra.otel.otel_tracing:setup_python_and_cpp_otel_tracing --metr+ 
 80 root      20   0  390960 172772  26536 S   0.0   0.3   0:01.66 /usr/bin/python3.10 -u /usr/local/lib/python3.10/dist-packages/ray/autoscaler/_private/monitor.py --logs-dir=/tmp/ray/session_2024-07-09_16-00-26_967883_7/logs+

The Ubuntu, Kubernetes, and Docker versions are identical on both hosts

Could you please assist in identifying the potential factors that could be causing the variance in memory consumption among the Ray head processes?

lobanov · August 10, 2024, 10:14pm

I’m not an expert, but my understanding is that different virtualisation technologies may account for memory differently. VDI is typically OS-level virtualisation, and it may underreport resource utilisation. Cloud machines usually based on hardware virtualisation.

I know I’m not answering your question directly, but I’m wondering what’s your goal when you try to analyse this?

samuel_chan · August 11, 2024, 7:52pm

Simliar Q to @lobanov > are you trying to get this working on on-prem infra which is why you’re looking at the VDI option?

Topic		Replies	Views
Ray head node excessive memory usage Ray Core	8	99	August 29, 2024
Why is the head dying regularly with OOM while the workers barely have any RAM usage?	3	667	July 5, 2023
Ray Serve Pods Scheduling Failing Ray Serve	3	101	July 26, 2024
Ray head node regularly using up all host disk space Ray Clusters	0	478	June 22, 2021
Memory error in distributed multiprocessing	11	683	February 23, 2021

variance in ray head memory consumption

Related topics