How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi,
I’m trying set up a small Ray Cluster on our local company Network but once I connect a worker node (on another machine) to the head node, it appears in the dashboard showing all the info about the new node and then the node dies after 20 seconds. It’s also not possible to open the logs from this node in the dashboard. I tried the same thing at my private home network (same OS, python env) and there it works.I set an inbound firewall rule to allow all connections for ray.exe, python.exe, raylet.exe and gcs_server.exe.
My setup:
Windows 10 pro
Python 3.7.9
Ray 2.1.0 (tried 3.0.0 as well)
Head node:
ray start --head --node-ip-address=10.14.228.50 --port 5201
Worker Node:
ray start --address=10.14.228.50:5201 --node-ip-address=10.14.228.74
Worker Node raylet.out
[2022-11-18 15:10:14,388 I 14604 2452] (raylet.exe) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-11-18 15:10:14,389 I 14604 2452] (raylet.exe) store_runner.cc:32: Allowing the Plasma store to use up to 38.4538GB of memory.
[2022-11-18 15:10:14,389 I 14604 2452] (raylet.exe) store_runner.cc:48: Starting object store with directory S:\Temp, fallback S:\Temp\ray, and huge page support disabled
[2022-11-18 15:10:14,390 I 14604 2088] (raylet.exe) store.cc:551: ========== Plasma store: =================
Current usage: 0 / 38.4538 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0
- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0
[2022-11-18 15:10:14,392 I 14604 2452] (raylet.exe) grpc_server.cc:120: ObjectManager server started, listening on port 58389.
[2022-11-18 15:10:14,404 W 14604 2452] (raylet.exe) memory_monitor.cc:65: Not running MemoryMonitor. It is currently supported only on Linux.
[2022-11-18 15:10:14,404 I 14604 2452] (raylet.exe) node_manager.cc:345: Initializing NodeManager with ID 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa
[2022-11-18 15:10:14,405 I 14604 2452] (raylet.exe) grpc_server.cc:120: NodeManager server started, listening on port 58391.
[2022-11-18 15:10:14,575 I 14604 8848] (raylet.exe) agent_manager.cc:109: Monitor agent process with id 15724, register timeout 100000ms.
[2022-11-18 15:10:14,579 I 14604 2452] (raylet.exe) raylet.cc:114: Raylet of id, 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa started. Raylet consists of node_manager and object_manager. node_manager address: 10.14.228.74:58391 object_manager address: 10.14.228.74:58389 hostname: 10.14.228.74
[2022-11-18 15:10:14,581 I 14604 2452] (raylet.exe) node_manager.cc:559: [state-dump] Event stats:
[state-dump]
[state-dump]
[state-dump] Global stats: 20 total (10 active)
[state-dump] Queueing time: mean = 27.177 ms, max = 189.309 ms, min = 9.200 us, total = 543.547 ms
[state-dump] Execution time: mean = 9.659 ms, total = 193.170 ms
[state-dump] Event stats:
[state-dump] PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 252.338 us, total = 2.019 ms
[state-dump] UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 7.200 us, total = 7.200 us
[state-dump] NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]
[state-dump] NodeManager:
[state-dump] Node ID: 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa
[state-dump] Node name: 10.14.228.74
[state-dump] InitialConfigResources: {node:10.14.228.74: 10000, CPU: 160000, GPU: 10000, memory: 897254457350000, object_store_memory: 384537624570000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: -7075803485337003867 Local resources: {object_store_memory: [384537624570000]/[384537624570000], node:10.14.228.74: [10000]/[10000], GPU: [10000]/[10000], CPU: [160000]/[160000], memory: [897254457350000]/[897254457350000]}node id: -7075803485337003867{node:10.14.228.74: 10000/10000, object_store_memory: 384537624570000/384537624570000, memory: 897254457350000/897254457350000, CPU: 160000/160000, GPU: 10000/10000}{ "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump]
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump]
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan(ind) s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time: mean = -nan(ind) s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 7500.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 38453762457
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump]
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers:
[state-dump] Event stats:
[state-dump] Global stats: 20 total (10 active)
[state-dump] Queueing time: mean = 27.177 ms, max = 189.309 ms, min = 9.200 us, total = 543.547 ms
[state-dump] Execution time: mean = 9.659 ms, total = 193.170 ms
[state-dump] Event stats:
[state-dump] PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 252.338 us, total = 2.019 ms
[state-dump] UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 7.200 us, total = 7.200 us
[state-dump] NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 1
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = d95f9e10cddcc512ed19448bcfa76c6660cdbaf48f60b7c5a481d044, IsAlive = 1
[2022-11-18 15:10:15,718 I 14604 2452] (raylet.exe) agent_manager.cc:40: HandleRegisterAgent, ip: 10.14.228.74, port: 59081, id: 15724
[2022-11-18 15:10:43,587 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 0
[2022-11-18 15:10:43,591 C 14604 2452] (raylet.exe) node_manager.cc:1057: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
BaseThreadInitThunk
RtlUserThreadStart
Worker Node debug_state.txt
NodeManager:
Node ID: 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa
Node name: 10.14.228.74
InitialConfigResources: {node:10.14.228.74: 10000, CPU: 160000, GPU: 10000, memory: 897254457350000, object_store_memory: 384537624570000}
ClusterTaskManager:
========== Node: 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa =================
Infeasible queue length: 0
Schedule queue length: 0
Dispatch queue length: 0
num_waiting_for_resource: 0
num_waiting_for_plasma_memory: 0
num_waiting_for_remote_node_resources: 0
num_worker_not_started_by_job_config_not_exist: 0
num_worker_not_started_by_registration_timeout: 0
num_tasks_waiting_for_workers: 0
num_cancelled_tasks: 0
cluster_resource_scheduler state:
Local id: -7075803485337003867 Local resources: {object_store_memory: [384537624570000]/[384537624570000], node:10.14.228.74: [10000]/[10000], GPU: [10000]/[10000], CPU: [160000]/[160000], memory: [897254457350000]/[897254457350000]}node id: -7075803485337003867{node:10.14.228.74: 10000/10000, object_store_memory: 384537624570000/384537624570000, memory: 897254457350000/897254457350000, CPU: 160000/160000, GPU: 10000/10000}node id: -4510166795099005845{object_store_memory: 328991858680000/328991858680000, node:10.14.228.50: 10000/10000, memory: 667647670280000/667647670280000, CPU: 160000/160000, GPU: 10000/10000}{ "placment group locations": [], "node to bundles": []}
Waiting tasks size: 0
Number of executing tasks: 0
Number of pinned task arguments: 0
Number of total spilled tasks: 0
Number of spilled waiting tasks: 0
Number of spilled unschedulable tasks: 0
Resource usage {
}
Running tasks by scheduling class:
==================================================
ClusterResources:
LocalObjectManager:
- num pinned objects: 0
- pinned objects size: 0
- num objects pending restore: 0
- num objects pending spill: 0
- num bytes pending spill: 0
- num bytes currently spilled: 0
- cumulative spill requests: 0
- cumulative restore requests: 0
- spilled objects pending delete: 0
ObjectManager:
- num local objects: 0
- num unfulfilled push requests: 0
- num object pull requests: 0
- num chunks received total: 0
- num chunks received failed (all): 0
- num chunks received failed / cancelled: 0
- num chunks received failed / plasma error: 0
Event stats:
Global stats: 0 total (0 active)
Queueing time: mean = -nan(ind) s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time: mean = -nan(ind) s, total = 0.000 s
Event stats:
PushManager:
- num pushes in flight: 0
- num chunks in flight: 0
- num chunks remaining: 0
- max chunks allowed: 409
OwnershipBasedObjectDirectory:
- num listeners: 0
- cumulative location updates: 0
- num location updates per second: 0.000
- num location lookups per second: 0.000
- num locations added per second: 0.000
- num locations removed per second: 0.000
BufferPool:
- create buffer state map size: 0
PullManager:
- num bytes available for pulled objects: 38453762457
- num bytes being pulled (all): 0
- num bytes being pulled / pinned: 0
- get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
- wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
- task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
- first get request bundle: N/A
- first wait request bundle: N/A
- first task request bundle: N/A
- num objects queued: 0
- num objects actively pulled (all): 0
- num objects actively pulled / pinned: 0
- num bundles being pulled: 0
- num pull retries: 0
- max timeout seconds: 0
- max timeout request is already processed. No entry.
WorkerPool:
- registered jobs: 0
- process_failed_job_config_missing: 0
- process_failed_rate_limited: 0
- process_failed_pending_registration: 0
- process_failed_runtime_env_setup_failed: 0
- num PYTHON workers: 0
- num PYTHON drivers: 0
- num object spill callbacks queued: 0
- num object restore queued: 0
- num util functions queued: 0
- num idle workers: 0
TaskDependencyManager:
- task deps map size: 0
- get req map size: 0
- wait req map size: 0
- local objects map size: 0
WaitManager:
- num active wait requests: 0
Subscriber:
Channel WORKER_REF_REMOVED_CHANNEL
- cumulative subscribe requests: 0
- cumulative unsubscribe requests: 0
- active subscribed publishers: 0
- cumulative published messages: 0
- cumulative processed messages: 0
Channel WORKER_OBJECT_EVICTION
- cumulative subscribe requests: 0
- cumulative unsubscribe requests: 0
- active subscribed publishers: 0
- cumulative published messages: 0
- cumulative processed messages: 0
Channel WORKER_OBJECT_LOCATIONS_CHANNEL
- cumulative subscribe requests: 0
- cumulative unsubscribe requests: 0
- active subscribed publishers: 0
- cumulative published messages: 0
- cumulative processed messages: 0
num async plasma notifications: 0
Remote node managers:
d95f9e10cddcc512ed19448bcfa76c6660cdbaf48f60b7c5a481d044
Event stats:
Global stats: 565 total (9 active)
Queueing time: mean = 1.236 ms, max = 189.309 ms, min = -0.001 s, total = 698.304 ms
Execution time: mean = 352.527 us, total = 199.178 ms
Event stats:
UNKNOWN - 221 total (3 active), CPU time: mean = 7.037 us, total = 1.555 ms
ObjectManager.UpdateAvailableMemory - 200 total (0 active), CPU time: mean = 3.954 us, total = 790.843 us
RayletWorkerPool.deadline_timer.kill_idle_workers - 100 total (1 active), CPU time: mean = 10.397 us, total = 1.040 ms
NodeManager.deadline_timer.flush_free_objects - 20 total (1 active), CPU time: mean = 9.862 us, total = 197.241 us
PeriodicalRunner.RunFnPeriodically - 8 total (0 active), CPU time: mean = 324.363 us, total = 2.595 ms
NodeManager.deadline_timer.record_metrics - 4 total (1 active), CPU time: mean = 176.853 us, total = 707.414 us
NodeManager.deadline_timer.debug_state_dump - 2 total (1 active, 1 running), CPU time: mean = 231.992 us, total = 463.984 us
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 114.250 us, total = 228.500 us
NodeManager.deadline_timer.print_event_loop_stats - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 164.600 us, total = 164.600 us
NodeResourceInfoGcsService.grpc_client.GetResources - 1 total (0 active), CPU time: mean = 48.600 us, total = 48.600 us
NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 229.645 us, total = 229.645 us
JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 12.800 us, total = 12.800 us
DebugString() time ms: 0