Hello,
I’m using KubeRay on an internal k8s cluster with really beefy machines (hundreds of cores, 2.3TiB RAM).
Setting up clusters in general through Helm Charts and using the KubeRay operator works without a problem:
helm install testcluster kuberay/ray-cluster -f values.yaml --version 1.2.2
But, as soon as I raise the Ray worker memory limits to a value equal or above 4GiB, no worker is able anymore to connect to the head pod. This is entirely reproducable and there is an extremely sharp boundary between where it works and where it doesn’t.
worker:
resources:
limits:
cpu: "1"
memory: "4Gi" # <-- this setting is changed
requests:
cpu: "1"
memory: "1G"
This is how it looks like for when limits
is below 4GiB:
Defaulted container "ray-worker" out of: ray-worker, wait-gcs-ready (init)
[2024-12-12 04:31:44,938 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 6fdc2334df20c69d556553032fb4b03e7e9cfa0f53d9aba053669b63
2024-12-12 04:31:44,878 INFO scripts.py:946 -- Local node IP: 10.236.217.236
2024-12-12 04:31:45,940 SUCC scripts.py:959 -- --------------------
2024-12-12 04:31:45,941 SUCC scripts.py:960 -- Ray runtime started.
2024-12-12 04:31:45,941 SUCC scripts.py:961 -- --------------------
2024-12-12 04:31:45,941 INFO scripts.py:963 -- To terminate the Ray runtime, run
2024-12-12 04:31:45,941 INFO scripts.py:964 -- ray stop
2024-12-12 04:31:45,941 INFO scripts.py:972 -- --block
2024-12-12 04:31:45,941 INFO scripts.py:973 -- This command will now block forever until terminated by a signal.
2024-12-12 04:31:45,941 INFO scripts.py:976 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.
This is how it looks like for it’s above or equal 4GiB:
Defaulted container "ray-worker" out of: ray-worker, wait-gcs-ready (init)
2024-12-12 04:56:06,843 WARNING services.py:2009 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 1000001536 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=1.28gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2024-12-12 04:56:06,933 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:07,934 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:08,935 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:09,936 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:10,937 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:11,938 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:12,939 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:13,940 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:14,941 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:15,941 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
Traceback (most recent call last):
File "/home/ray/anaconda3/bin/ray", line 8, in <module>
2024-12-12 04:56:06,823 INFO scripts.py:946 -- Local node IP: 10.236.219.78
sys.exit(main())
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2612, in main
return cli()
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1078, in main
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 783, in invoke
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
return f(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 948, in start
node = ray._private.node.Node(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/node.py", line 349, in __init__
node_info = ray._private.services.get_node(
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/services.py", line 486, in get_node
return global_state.get_node(node_id)
File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/state.py", line 806, in get_node
return self.global_state_accessor.get_node(node_id)
File "python/ray/includes/global_state_accessor.pxi", line 270, in ray._raylet.GlobalStateAccessor.get_node
RuntimeError: b'GCS cannot find the node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13'