Try to run distributed training with docker containers

sdonoso · February 11, 2025, 9:14pm

I am trying to run ray through docker on two different nodes. When I initialize one container and run ray head it works but when I try to run the worker on the other node it works for a couple of seconds and then crashes. According to the logs I have checked it has to do with lack of memory. I run the container with sudo -E docker run \ -p 6381:6381 \ -p 8265:8265 \ --user root \ --shm-size=10G \ --runtime=nvidia \ -it \ -v /workspace1/sdonoso:/workspace1/sdonoso \ -v /shared/data:/shared/data \ -e CUDA_VISIBLE_DEVICES="$CUDA_VISIBLE_DEVICES" \ -e LD_LIBRARY_PATH="$LD_LIBRARY_PATH" \ rayproject/ray:nightly-py311-cu124

and the ray head: CUDA_VISIBLE_DEVICES=1 ray start --head --port=6381 --num-gpus=1 --num-cpus=8 --temp-dir=/workspace1/sdonoso/ray_tmp
and worker: CUDA_VISIBLE_DEVICES=1 ray start --address="146.155.155.83:6381" --num-gpus=1 --num-cpus=8 --temp-dir=/workspace1/sdonoso/ray_tmp
here is an example of my logs after crash. I try increasing --shm-size but i have the same error

[2025-02-11 11:17:17,855 C 139 139] node_manager.cc:1015: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
[...]

[raylet.out]
[2025-02-11 11:17:00,787 W 139 139] store_runner.cc:66: System memory request exceeds memory available in /dev/shm. The request is for 10200547328 bytes, and the amount available is 9663676416 bytes. You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
[...]
[state-dump] InitialConfigResources: {accelerator_type:A100: 10000, memory: 10716252200960000, object_store_memory: 102005473280000, node:172.17.0.2: 10000, GPU: 10000, CPU: 30000}
[...]
[2025-02-11 11:17:17,855 C 139 139] node_manager.cc:1015: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.```

christina · February 12, 2025, 2:24am

Hi sdonoso! Welcome to the ray community

There are a few different reasons that happen because of resource management.

Since you already tried bumping up the --shm-size make sure it meets the needs of your task. From your error, it seems like you need at least 10GB. Double-check that your shared memory size is set to at least this value or more. You can also try scaling back the number of tasks by requesting more CPUs per task. For instance, if the task is memory-intense, assign more CPUs to it like ray.remote(num_cpus=<value>) .

Some other things that might help, you can tweak the RAY_memory_usage_threshold environment variable when you start Ray. This can help ward off premature process dying due to memory spikes.

Ray also have some helpful functions for debugging, for your case, the ray memory command will help uncover memory usage and spot lingering object references.

Here’s our doc on memory resource management:

Debugging Memory Issues — Ray 2.42.1

sdonoso · February 13, 2025, 8:26pm

Thanks for the response.
I tried increasing --shm-size to 100GB—my nodes have 1TB of RAM—but I got the same error. Additionally, I set RAY_memory_usage_threshold=0.1, but when the threshold is exceeded, the logs indicate that there is no Ray process that can be killed.

Maybe I didn’t explain myself well in the previous post, but the worker crashes before it can start any training. This means that I initialize the Ray head process inside the container, and once it is running, I initialize the worker in the container located on the other node. The worker manages to start, but after 2 seconds, it crashes and outputs the following log:

[2025-02-13 12:08:15,896 I 141 141] main.cc:204: Setting cluster ID to: 718887e4b3ef3b1434d496ca9a6f72d2c318acb810fb66f263dc11f3
[2025-02-13 12:08:15,905 I 141 141] main.cc:319: Raylet is not set to kill unknown children.
[2025-02-13 12:08:15,905 I 141 141] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2025-02-13 12:08:15,905 I 141 141] main.cc:449: Setting node ID node_id=9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[2025-02-13 12:08:15,906 I 141 141] store_runner.cc:33: Allowing the Plasma store to use up to 9GB of memory.
[2025-02-13 12:08:15,906 I 141 141] store_runner.cc:49: Starting object store with directory /dev/shm, fallback /workspace1/sdonoso/ray_tmp, and huge page support disabled
[2025-02-13 12:08:15,906 I 141 174] dlmalloc.cc:154: create_and_mmap_buffer(9000058888, /dev/shm/plasmaXXXXXX)
[2025-02-13 12:08:15,907 I 141 174] store.cc:564: Plasma store debug dump:
Current usage: 0 / 9 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

[2025-02-13 12:08:15,908 I 141 141] grpc_server.cc:135: ObjectManager server started, listening on port 43089.
[2025-02-13 12:08:15,909 I 141 141] worker_killing_policy.cc:101: Running GroupByOwner policy.
[2025-02-13 12:08:15,911 I 141 141] memory_monitor.cc:47: MemoryMonitor initialized with usage threshold at 108197437440 bytes (0.10 system memory), total system memory bytes: 1081974374400
[2025-02-13 12:08:15,911 I 141 141] node_manager.cc:296: Initializing NodeManager node_id=9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[2025-02-13 12:08:15,912 I 141 141] grpc_server.cc:135: NodeManager server started, listening on port 33895.
[2025-02-13 12:08:15,916 I 141 202] agent_manager.cc:78: Monitor agent process with name dashboard_agent/424238335
[2025-02-13 12:08:15,917 I 141 204] agent_manager.cc:78: Monitor agent process with name runtime_env_agent
[2025-02-13 12:08:15,918 I 141 141] event.cc:496: Ray Event initialized for RAYLET
[2025-02-13 12:08:15,918 I 141 141] event.cc:327: Set ray event level to warning
[2025-02-13 12:08:15,919 I 141 141] memory_monitor.cc:88: Node memory usage above threshold, used: 855384137728, threshold_bytes: 108197437440, total bytes: 1081974374400, threshold fraction: 0.1
[2025-02-13 12:08:15,926 W 141 141] node_manager.cc:2995: Memory usage above threshold but no workers are available for killing.This could be due to worker memory leak andidle worker are occupying most of the memory.
[2025-02-13 12:08:15,926 I 141 141] raylet.cc:134: Raylet of id, 9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e started. Raylet consists of node_manager and object_manager. node_manager address: 0.0.0.0:33895 object_manager address: 0.0.0.0:43089 hostname: db32f45f72fd
[2025-02-13 12:08:15,929 I 141 141] node_manager.cc:533: [state-dump] NodeManager:
[state-dump] Node ID: 9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[state-dump] Node name: 0.0.0.0
[state-dump] InitialConfigResources: {CPU: 80000, object_store_memory: 90000000000000, node:0.0.0.0: 10000, accelerator_type:A100: 10000, GPU: 10000, memory: 10728238709760000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 8000749833399851366 Local resources: {"total":{object_store_memory: [90000000000000], node:0.0.0.0: [10000], CPU: [80000], GPU: [10000], accelerator_type:A100: [10000], memory: [10728238709760000]}}, "available": {object_store_memory: [90000000000000], node:0.0.0.0: [10000], CPU: [80000], GPU: [10000], accelerator_type:A100: [10000], memory: [10728238709760000]}}, "labels":{"ray.io/node_id":"9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e",} is_draining: 0 is_idle: 1 Cluster resources: node id: 8000749833399851366{"total":{node:0.0.0.0: 10000, memory: 10728238709760000, accelerator_type:A100: 10000, GPU: 10000, object_store_memory: 90000000000000, CPU: 80000}}, "available": {node:0.0.0.0: 10000, memory: 10728238709760000, accelerator_type:A100: 10000, GPU: 10000, object_store_memory: 90000000000000, CPU: 80000}}, "labels":{"ray.io/node_id":"9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Backlog Size per scheduling descriptor :{workerId: num backlogs}:
[state-dump]
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump]
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump]
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time:  mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 70262595245916000.000
[state-dump] - num location lookups per second: 70262595245904000.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 9000000000
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump]
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num PYTHON pending start requests: 0
[state-dump] - num PYTHON pending registration requests: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers:
[state-dump] Event stats:
[state-dump] Global stats: 27 total (13 active)
[state-dump] Queueing time: mean = 2.539 ms, max = 15.363 ms, min = 14.930 us, total = 68.563 ms
[state-dump] Execution time:  mean = 1.043 ms, total = 28.166 ms
[state-dump] Event stats:
[state-dump] 	PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), Execution time: mean = 903.800 us, total = 9.942 ms, Queueing time: mean = 5.619 ms, max = 15.363 ms, min = 22.380 us, total = 61.809 ms
[state-dump] 	NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 216.720 us, total = 216.720 us, Queueing time: mean = 6.739 ms, max = 6.739 ms, min = 6.739 ms, total = 6.739 ms
[state-dump] 	ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 1.423 ms, total = 1.423 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 13.241 ms, total = 13.241 ms, Queueing time: mean = 14.930 us, max = 14.930 us, min = 14.930 us, total = 14.930 us
[state-dump] 	ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 1.489 ms, total = 1.489 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.854 ms, total = 1.854 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] 	ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] DebugString() time ms: 0
[state-dump]
[state-dump]
[2025-02-13 12:08:15,930 I 141 141] accessor.cc:777: Received notification for node, IsAlive = 1 node_id=55d4e7fb9e048cde1e170bab6ff9d4e58de597ec9560b9928b29caf1
[2025-02-13 12:08:15,930 I 141 141] accessor.cc:777: Received notification for node, IsAlive = 1 node_id=9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[2025-02-13 12:08:20,946 I 141 141] memory_monitor.cc:88: Node memory usage above threshold, used: 855545192448, threshold_bytes: 108197437440, total bytes: 1081974374400, threshold fraction: 0.1
[2025-02-13 12:08:20,960 W 141 141] node_manager.cc:2995: Memory usage above threshold but no workers are available for killing.This could be due to worker memory leak andidle worker are occupying most of the memory.
[2025-02-13 12:08:21,224 W 141 141] memory_monitor.cc:324: Got zero used memory for smap file /proc/306/smaps_rollup
[2025-02-13 12:08:25,947 I 141 141] memory_monitor.cc:88: Node memory usage above threshold, used: 855431766016, threshold_bytes: 108197437440, total bytes: 1081974374400, threshold fraction: 0.1
[2025-02-13 12:08:26,218 W 141 141] node_manager.cc:2995: Memory usage above threshold but no workers are available for killing.This could be due to worker memory leak andidle worker are occupying most of the memory.
[2025-02-13 12:08:31,158 I 141 141] memory_monitor.cc:88: Node memory usage above threshold, used: 855438663680, threshold_bytes: 108197437440, total bytes: 1081974374400, threshold fraction: 0.1
[2025-02-13 12:08:31,427 W 141 141] node_manager.cc:2995: Memory usage above threshold but no workers are available for killing.This could be due to worker memory leak andidle worker are occupying most of the memory.
[2025-02-13 12:08:32,923 I 141 141] accessor.cc:777: Received notification for node, IsAlive = 0 node_id=9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[2025-02-13 12:08:32,977 C 141 141] node_manager.cc:1015: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xdb334a) [0x556a05a7934a] ray::operator<<()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xdb57d1) [0x556a05a7b7d1] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x32e062) [0x556a04ff4062] ray::raylet::NodeManager::NodeRemoved()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x55ca55) [0x556a05222a55] ray::gcs::NodeInfoAccessor::HandleNotification()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x6da028) [0x556a053a0028] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x6d501e) [0x556a0539b01e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x6d5496) [0x556a0539b496] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xd9153b) [0x556a05a5753b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xd93ac9) [0x556a05a59ac9] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xd93fe2) [0x556a05a59fe2] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1ecd5f) [0x556a04eb2d5f] main
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fce925cbd90]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fce925cbe40] __libc_start_main
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x2470a7) [0x556a04f0d0a7]```

Niklas_Rindtorff · February 26, 2025, 2:50pm

Hi @sdonoso and @christina - I want to amplify this issue.

We’ve been using Ray for more than a year now with multiple kuberay clusters on GCP. This week I started looking into ways to attach containerised edge-workers to the GCP-located head node. This is the first time we want to use this pattern at a larger scale. I ran into exactly the same behaviour.

The edge workers are well provisioned with 32 CPUs, 2 GPUs.
About 20s after connecting successfully to the head node the connection drops. To be more specific, the worker shows up in the dashboard after a manual ray start command within the container (with target ) but then drops out.

The cluster appears to identify the worker properly and assigns it to currently running jobs. An example log from such a job is shown below. We can see the worker getting attached to the cluster, followed by a crash report, and a downscaling of the cluster.

[36m(autoscaler +49m0s)e[0m Resized to 32 CPUs, 2 GPUs.
e[33m(raylet)e[0m The node with node id: b7ac2bd2621e1fbced76f4809e252bbd4811ce4c9992cb3abf0e03f7 and address: 172.17.0.2 and node name: 172.17.0.2 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a 	(1) raylet crashes unexpectedly (OOM, preempted node, etc.) 
	(2) raylet has lagging heartbeats due to slow network or busy workload.
e[33m(raylet)e[0m Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    
    [2025-02-26 14:09:00,733 I 38 38] (raylet) accessor.cc:630: Received notification for node id = c30ce4ecb55d6c9c84e4b4c6febadd103f8d70c9ccd17a5093e96cda, IsAlive = 0
    [2025-02-26 14:09:00,791 C 38 38] (raylet) node_manager.cc:1028: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
    *** StackTrace Information ***
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbaed2a) [0x626a0ae1cd2a] ray::operator<<()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbb04e7) [0x626a0ae1e4e7] ray::SpdLogMessage::Flush()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbb0987) [0x626a0ae1e987] ray::RayLog::~RayLog()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x2f88b1) [0x626a0a5668b1] ray::raylet::NodeManager::NodeRemoved()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x4d5a37) [0x626a0a743a37] ray::gcs::NodeInfoAccessor::HandleNotification()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5dcdde) [0x626a0a84adde] EventTracker::RecordExecution()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5d61ce) [0x626a0a8441ce] std::_Function_handler<>::_M_invoke()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5d6646) [0x626a0a844646] boost::asio::detail::completion_handler<>::do_complete()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc9092b) [0x626a0aefe92b] boost::asio::detail::scheduler::do_run_one()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc92eb9) [0x626a0af00eb9] boost::asio::detail::scheduler::run()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc933d2) [0x626a0af013d2] boost::asio::io_context::run()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x1d6556) [0x626a0a444556] main
    /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7beb6482ad90]
    /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7beb6482ae40] __libc_start_main
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x22cdd7) [0x626a0a49add7]
    

e[33m(raylet)e[0m Raylet is terminated. Termination is unexpected. Possible reasons include: (1) SIGKILL by the user or system OOM killer, (2) Invalid memory access from Raylet causing SIGSEGV or SIGBUS, (3) Other termination signals. Last 20 lines of the Raylet logs:
    
    [2025-02-26 14:34:31,241 I 36 36] (raylet) accessor.cc:630: Received notification for node id = b7ac2bd2621e1fbced76f4809e252bbd4811ce4c9992cb3abf0e03f7, IsAlive = 0
    [2025-02-26 14:34:31,300 C 36 36] (raylet) node_manager.cc:1028: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
    *** StackTrace Information ***
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbaed2a) [0x573f71eaad2a] ray::operator<<()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbb04e7) [0x573f71eac4e7] ray::SpdLogMessage::Flush()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbb0987) [0x573f71eac987] ray::RayLog::~RayLog()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x2f88b1) [0x573f715f48b1] ray::raylet::NodeManager::NodeRemoved()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x4d5a37) [0x573f717d1a37] ray::gcs::NodeInfoAccessor::HandleNotification()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5dcdde) [0x573f718d8dde] EventTracker::RecordExecution()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5d61ce) [0x573f718d21ce] std::_Function_handler<>::_M_invoke()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5d6646) [0x573f718d2646] boost::asio::detail::completion_handler<>::do_complete()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc9092b) [0x573f71f8c92b] boost::asio::detail::scheduler::do_run_one()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc92eb9) [0x573f71f8eeb9] boost::asio::detail::scheduler::run()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc933d2) [0x573f71f8f3d2] boost::asio::io_context::run()
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x1d6556) [0x573f714d2556] main
    /usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7c3d6bba9d90]
    /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7c3d6bba9e40] __libc_start_main
    /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x22cdd7) [0x573f71528dd7]
    

e[33m(raylet, ip=172.17.0.2)e[0m [2025-02-26 14:34:31,300 C 36 36] (raylet) node_manager.cc:1028: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m *** StackTrace Information ***e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbaed2a) [0x573f71eaad2a] ray::operator<<()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbb04e7) [0x573f71eac4e7] ray::SpdLogMessage::Flush()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xbb0987) [0x573f71eac987] ray::RayLog::~RayLog()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x2f88b1) [0x573f715f48b1] ray::raylet::NodeManager::NodeRemoved()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x4d5a37) [0x573f717d1a37] ray::gcs::NodeInfoAccessor::HandleNotification()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5dcdde) [0x573f718d8dde] EventTracker::RecordExecution()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5d61ce) [0x573f718d21ce] std::_Function_handler<>::_M_invoke()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x5d6646) [0x573f718d2646] boost::asio::detail::completion_handler<>::do_complete()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc9092b) [0x573f71f8c92b] boost::asio::detail::scheduler::do_run_one()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc92eb9) [0x573f71f8eeb9] boost::asio::detail::scheduler::run()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0xc933d2) [0x573f71f8f3d2] boost::asio::io_context::run()e[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/local/lib/python3.11/dist-packages/ray/core/src/ray/raylet/raylet(+0x1d6556) [0x573f714d2556] maine[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m /usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7c3d6bba9e40] __libc_start_maine[32m [repeated 2x across cluster]e[0m
e[33m(raylet, ip=172.17.0.2)e[0m e[32m [repeated 6x across cluster]e[0m
e[36m(autoscaler +49m30s)e[0m Resized to 0 CPUs.

Related issues I have looked at include notes from the cohere team a couple months ago when setting up a manual cluster:

Node dying because of missing too many heartbeats · Issue #44680 · ray-project/ray
[gcp] Node mistakenly marked dead: increase heartbeat timeout? · Issue #16945 · ray-project/ray

More details about my reproducible example:

ARG NVCR_CONTAINER_REPO=nvcr.io
FROM $NVCR_CONTAINER_REPO/nvidia/cuda:12.6.3-devel-ubuntu22.04

# Python
WORKDIR /opt
RUN apt-get update && \
    apt-get install -y python3.11 python3.11-distutils python3.11-dev python3-pip && \
    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1 && \
    update-alternatives --set python3 /usr/bin/python3.11

# other libraries, incl.
RUN --mount=from=ghcr.io/astral-sh/uv,source=/uv,target=/bin/uv \
    uv pip install --system ray[default]==2.23.0

# NVCC and other ENVs
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=${CUDA_HOME}/bin:${PATH}
ENV LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}

# Entrypoint
COPY ./halo/entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh

EXPOSE 8000
EXPOSE 8265
ENTRYPOINT ["/entrypoint.sh"]

the entrypoint.sh is effectively just calling bash in this minimal reprex and I am connecting manually with

ray start --address=XYZ:6379 --verbose --block

The headnode is running on GCP via kuberay with the docker/rayproject/ray:2.40.0. base image.

My tests worker is running on a lambda workstation with 2x RTX4090 cards. I noticed that my worker is running ray 2.23.0.

I am now looking into bumping the version and will report back.

@sdonoso - are you running the same ray versions between your two containers?

christina · February 27, 2025, 9:43pm

Hi @Niklas_Rindtorff and @sdonoso ! I just wanted to let you know that I’m actively looking into this and will reply as soon as I get an answer! Ty for escalating it.

Topic		Replies	Views
Remote Worker Nodes die after a few seconds Ray Clusters	5	1962	July 17, 2024
Head and worked node dies after few seconds Kubernetes	3	1179	March 24, 2023
Worker nodes fail to setup container Ray Clusters	1	705	September 12, 2022
(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start Ray Clusters	4	2489	May 26, 2022
Problems with using Ray in multiple Dockers Ray Core	3	870	March 20, 2023

Try to run distributed training with docker containers

Related topics