KubeRay clusters fail to start when workers memory limit >=4GiB

Hello,

I’m using KubeRay on an internal k8s cluster with really beefy machines (hundreds of cores, 2.3TiB RAM).

Setting up clusters in general through Helm Charts and using the KubeRay operator works without a problem:

helm install testcluster kuberay/ray-cluster -f values.yaml --version 1.2.2 

But, as soon as I raise the Ray worker memory limits to a value equal or above 4GiB, no worker is able anymore to connect to the head pod. This is entirely reproducable and there is an extremely sharp boundary between where it works and where it doesn’t.

worker:
  resources:
    limits:
      cpu: "1"
      memory: "4Gi"  # <-- this setting is changed
    requests:
      cpu: "1"
      memory: "1G"

This is how it looks like for when limits is below 4GiB:

Defaulted container "ray-worker" out of: ray-worker, wait-gcs-ready (init)
[2024-12-12 04:31:44,938 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 6fdc2334df20c69d556553032fb4b03e7e9cfa0f53d9aba053669b63
2024-12-12 04:31:44,878	INFO scripts.py:946 -- Local node IP: 10.236.217.236
2024-12-12 04:31:45,940	SUCC scripts.py:959 -- --------------------
2024-12-12 04:31:45,941	SUCC scripts.py:960 -- Ray runtime started.
2024-12-12 04:31:45,941	SUCC scripts.py:961 -- --------------------
2024-12-12 04:31:45,941	INFO scripts.py:963 -- To terminate the Ray runtime, run
2024-12-12 04:31:45,941	INFO scripts.py:964 --   ray stop
2024-12-12 04:31:45,941	INFO scripts.py:972 -- --block
2024-12-12 04:31:45,941	INFO scripts.py:973 -- This command will now block forever until terminated by a signal.
2024-12-12 04:31:45,941	INFO scripts.py:976 -- Running subprocesses are monitored and a message will be printed if any of them terminate unexpectedly. Subprocesses exit with SIGTERM will be treated as graceful, thus NOT reported.

This is how it looks like for it’s above or equal 4GiB:

Defaulted container "ray-worker" out of: ray-worker, wait-gcs-ready (init)
2024-12-12 04:56:06,843	WARNING services.py:2009 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 1000001536 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=1.28gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
[2024-12-12 04:56:06,933 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:07,934 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:08,935 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:09,936 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:10,937 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:11,938 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:12,939 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:13,940 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:14,941 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
[2024-12-12 04:56:15,941 W 8 8] global_state_accessor.cc:437: Retrying to get node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13
Traceback (most recent call last):
  File "/home/ray/anaconda3/bin/ray", line 8, in <module>
2024-12-12 04:56:06,823	INFO scripts.py:946 -- Local node IP: 10.236.219.78
    sys.exit(main())
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 2612, in main
    return cli()
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/autoscaler/_private/cli_logger.py", line 856, in wrapper
    return f(*args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/scripts/scripts.py", line 948, in start
    node = ray._private.node.Node(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/node.py", line 349, in __init__
    node_info = ray._private.services.get_node(
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/services.py", line 486, in get_node
    return global_state.get_node(node_id)
  File "/home/ray/anaconda3/lib/python3.10/site-packages/ray/_private/state.py", line 806, in get_node
    return self.global_state_accessor.get_node(node_id)
  File "python/ray/includes/global_state_accessor.pxi", line 270, in ray._raylet.GlobalStateAccessor.get_node
RuntimeError: b'GCS cannot find the node with node ID 09be14ce03b62ea760f96cce6619bc102a8c41ab9121ec92ce821c13'

Hi, I cannot reproduce this error in my own cluster. Could you provide your full RayCluster yaml file or your values.yaml?

Sorry, as the spam filter for some reason blocked this thread at first, I created a GitHub issue.

I think the problem is understood and already potentially solved. The issue: [Bug] KubeRay cluster fails to start whenever worker memory limits >=4GiB · Issue #2640 · ray-project/kuberay · GitHub

Thank you for your reply!