Thanks for the response.
I tried increasing --shm-size
to 100GB—my nodes have 1TB of RAM—but I got the same error. Additionally, I set RAY_memory_usage_threshold=0.1
, but when the threshold is exceeded, the logs indicate that there is no Ray process that can be killed.
Maybe I didn’t explain myself well in the previous post, but the worker crashes before it can start any training. This means that I initialize the Ray head process inside the container, and once it is running, I initialize the worker in the container located on the other node. The worker manages to start, but after 2 seconds, it crashes and outputs the following log:
[2025-02-13 12:08:15,896 I 141 141] main.cc:204: Setting cluster ID to: 718887e4b3ef3b1434d496ca9a6f72d2c318acb810fb66f263dc11f3
[2025-02-13 12:08:15,905 I 141 141] main.cc:319: Raylet is not set to kill unknown children.
[2025-02-13 12:08:15,905 I 141 141] io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2025-02-13 12:08:15,905 I 141 141] main.cc:449: Setting node ID node_id=9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[2025-02-13 12:08:15,906 I 141 141] store_runner.cc:33: Allowing the Plasma store to use up to 9GB of memory.
[2025-02-13 12:08:15,906 I 141 141] store_runner.cc:49: Starting object store with directory /dev/shm, fallback /workspace1/sdonoso/ray_tmp, and huge page support disabled
[2025-02-13 12:08:15,906 I 141 174] dlmalloc.cc:154: create_and_mmap_buffer(9000058888, /dev/shm/plasmaXXXXXX)
[2025-02-13 12:08:15,907 I 141 174] store.cc:564: Plasma store debug dump:
Current usage: 0 / 9 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0
- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0
[2025-02-13 12:08:15,908 I 141 141] grpc_server.cc:135: ObjectManager server started, listening on port 43089.
[2025-02-13 12:08:15,909 I 141 141] worker_killing_policy.cc:101: Running GroupByOwner policy.
[2025-02-13 12:08:15,911 I 141 141] memory_monitor.cc:47: MemoryMonitor initialized with usage threshold at 108197437440 bytes (0.10 system memory), total system memory bytes: 1081974374400
[2025-02-13 12:08:15,911 I 141 141] node_manager.cc:296: Initializing NodeManager node_id=9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[2025-02-13 12:08:15,912 I 141 141] grpc_server.cc:135: NodeManager server started, listening on port 33895.
[2025-02-13 12:08:15,916 I 141 202] agent_manager.cc:78: Monitor agent process with name dashboard_agent/424238335
[2025-02-13 12:08:15,917 I 141 204] agent_manager.cc:78: Monitor agent process with name runtime_env_agent
[2025-02-13 12:08:15,918 I 141 141] event.cc:496: Ray Event initialized for RAYLET
[2025-02-13 12:08:15,918 I 141 141] event.cc:327: Set ray event level to warning
[2025-02-13 12:08:15,919 I 141 141] memory_monitor.cc:88: Node memory usage above threshold, used: 855384137728, threshold_bytes: 108197437440, total bytes: 1081974374400, threshold fraction: 0.1
[2025-02-13 12:08:15,926 W 141 141] node_manager.cc:2995: Memory usage above threshold but no workers are available for killing.This could be due to worker memory leak andidle worker are occupying most of the memory.
[2025-02-13 12:08:15,926 I 141 141] raylet.cc:134: Raylet of id, 9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e started. Raylet consists of node_manager and object_manager. node_manager address: 0.0.0.0:33895 object_manager address: 0.0.0.0:43089 hostname: db32f45f72fd
[2025-02-13 12:08:15,929 I 141 141] node_manager.cc:533: [state-dump] NodeManager:
[state-dump] Node ID: 9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[state-dump] Node name: 0.0.0.0
[state-dump] InitialConfigResources: {CPU: 80000, object_store_memory: 90000000000000, node:0.0.0.0: 10000, accelerator_type:A100: 10000, GPU: 10000, memory: 10728238709760000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 8000749833399851366 Local resources: {"total":{object_store_memory: [90000000000000], node:0.0.0.0: [10000], CPU: [80000], GPU: [10000], accelerator_type:A100: [10000], memory: [10728238709760000]}}, "available": {object_store_memory: [90000000000000], node:0.0.0.0: [10000], CPU: [80000], GPU: [10000], accelerator_type:A100: [10000], memory: [10728238709760000]}}, "labels":{"ray.io/node_id":"9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e",} is_draining: 0 is_idle: 1 Cluster resources: node id: 8000749833399851366{"total":{node:0.0.0.0: 10000, memory: 10728238709760000, accelerator_type:A100: 10000, GPU: 10000, object_store_memory: 90000000000000, CPU: 80000}}, "available": {node:0.0.0.0: 10000, memory: 10728238709760000, accelerator_type:A100: 10000, GPU: 10000, object_store_memory: 90000000000000, CPU: 80000}}, "labels":{"ray.io/node_id":"9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e",}, "is_draining": 0, "draining_deadline_timestamp_ms": -1} { "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Backlog Size per scheduling descriptor :{workerId: num backlogs}:
[state-dump]
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump]
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump]
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time: mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 70262595245916000.000
[state-dump] - num location lookups per second: 70262595245904000.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 9000000000
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump]
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num PYTHON pending start requests: 0
[state-dump] - num PYTHON pending registration requests: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers:
[state-dump] Event stats:
[state-dump] Global stats: 27 total (13 active)
[state-dump] Queueing time: mean = 2.539 ms, max = 15.363 ms, min = 14.930 us, total = 68.563 ms
[state-dump] Execution time: mean = 1.043 ms, total = 28.166 ms
[state-dump] Event stats:
[state-dump] PeriodicalRunner.RunFnPeriodically - 11 total (2 active, 1 running), Execution time: mean = 903.800 us, total = 9.942 ms, Queueing time: mean = 5.619 ms, max = 15.363 ms, min = 22.380 us, total = 61.809 ms
[state-dump] NodeManager.deadline_timer.spill_objects_when_over_threshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode.OnReplyReceived - 1 total (0 active), Execution time: mean = 216.720 us, total = 216.720 us, Queueing time: mean = 6.739 ms, max = 6.739 ms, min = 6.739 ms, total = 6.739 ms
[state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig - 1 total (0 active), Execution time: mean = 1.423 ms, total = 1.423 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.GCTaskFailureReason - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::InternalKVGcsService.grpc_client.GetInternalConfig.OnReplyReceived - 1 total (0 active), Execution time: mean = 13.241 ms, total = 13.241 ms, Queueing time: mean = 14.930 us, max = 14.930 us, min = 14.930 us, total = 14.930 us
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), Execution time: mean = 1.489 ms, total = 1.489 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.record_metrics - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] NodeManager.ScheduleAndDispatchTasks - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch.OnReplyReceived - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ray::rpc::NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), Execution time: mean = 1.854 ms, total = 1.854 ms, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] ClusterResourceManager.ResetRemoteNodeView - 1 total (1 active), Execution time: mean = 0.000 s, total = 0.000 s, Queueing time: mean = 0.000 s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] DebugString() time ms: 0
[state-dump]
[state-dump]
[2025-02-13 12:08:15,930 I 141 141] accessor.cc:777: Received notification for node, IsAlive = 1 node_id=55d4e7fb9e048cde1e170bab6ff9d4e58de597ec9560b9928b29caf1
[2025-02-13 12:08:15,930 I 141 141] accessor.cc:777: Received notification for node, IsAlive = 1 node_id=9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[2025-02-13 12:08:20,946 I 141 141] memory_monitor.cc:88: Node memory usage above threshold, used: 855545192448, threshold_bytes: 108197437440, total bytes: 1081974374400, threshold fraction: 0.1
[2025-02-13 12:08:20,960 W 141 141] node_manager.cc:2995: Memory usage above threshold but no workers are available for killing.This could be due to worker memory leak andidle worker are occupying most of the memory.
[2025-02-13 12:08:21,224 W 141 141] memory_monitor.cc:324: Got zero used memory for smap file /proc/306/smaps_rollup
[2025-02-13 12:08:25,947 I 141 141] memory_monitor.cc:88: Node memory usage above threshold, used: 855431766016, threshold_bytes: 108197437440, total bytes: 1081974374400, threshold fraction: 0.1
[2025-02-13 12:08:26,218 W 141 141] node_manager.cc:2995: Memory usage above threshold but no workers are available for killing.This could be due to worker memory leak andidle worker are occupying most of the memory.
[2025-02-13 12:08:31,158 I 141 141] memory_monitor.cc:88: Node memory usage above threshold, used: 855438663680, threshold_bytes: 108197437440, total bytes: 1081974374400, threshold fraction: 0.1
[2025-02-13 12:08:31,427 W 141 141] node_manager.cc:2995: Memory usage above threshold but no workers are available for killing.This could be due to worker memory leak andidle worker are occupying most of the memory.
[2025-02-13 12:08:32,923 I 141 141] accessor.cc:777: Received notification for node, IsAlive = 0 node_id=9dcf25989a44f0aef4c4c1dcfc31fa4f7ad50bdeb67b8f3dadf4364e
[2025-02-13 12:08:32,977 C 141 141] node_manager.cc:1015: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS failed to check the health of this node for 5 times. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xdb334a) [0x556a05a7934a] ray::operator<<()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xdb57d1) [0x556a05a7b7d1] ray::RayLog::~RayLog()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x32e062) [0x556a04ff4062] ray::raylet::NodeManager::NodeRemoved()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x55ca55) [0x556a05222a55] ray::gcs::NodeInfoAccessor::HandleNotification()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x6da028) [0x556a053a0028] EventTracker::RecordExecution()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x6d501e) [0x556a0539b01e] std::_Function_handler<>::_M_invoke()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x6d5496) [0x556a0539b496] boost::asio::detail::completion_handler<>::do_complete()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xd9153b) [0x556a05a5753b] boost::asio::detail::scheduler::do_run_one()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xd93ac9) [0x556a05a59ac9] boost::asio::detail::scheduler::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0xd93fe2) [0x556a05a59fe2] boost::asio::io_context::run()
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x1ecd5f) [0x556a04eb2d5f] main
/usr/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7fce925cbd90]
/usr/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7fce925cbe40] __libc_start_main
/home/ray/anaconda3/lib/python3.11/site-packages/ray/core/src/ray/raylet/raylet(+0x2470a7) [0x556a04f0d0a7]```