Remote Worker Nodes die after a few seconds

bananajoe182 · November 18, 2022, 2:20pm

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi,
I’m trying set up a small Ray Cluster on our local company Network but once I connect a worker node (on another machine) to the head node, it appears in the dashboard showing all the info about the new node and then the node dies after 20 seconds. It’s also not possible to open the logs from this node in the dashboard. I tried the same thing at my private home network (same OS, python env) and there it works.I set an inbound firewall rule to allow all connections for ray.exe, python.exe, raylet.exe and gcs_server.exe.

My setup:
Windows 10 pro
Python 3.7.9
Ray 2.1.0 (tried 3.0.0 as well)

Head node:

ray start --head --node-ip-address=10.14.228.50 --port 5201

Worker Node:

ray start --address=10.14.228.50:5201 --node-ip-address=10.14.228.74

Worker Node raylet.out

[2022-11-18 15:10:14,388 I 14604 2452] (raylet.exe) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2022-11-18 15:10:14,389 I 14604 2452] (raylet.exe) store_runner.cc:32: Allowing the Plasma store to use up to 38.4538GB of memory.
[2022-11-18 15:10:14,389 I 14604 2452] (raylet.exe) store_runner.cc:48: Starting object store with directory S:\Temp, fallback S:\Temp\ray, and huge page support disabled
[2022-11-18 15:10:14,390 I 14604 2088] (raylet.exe) store.cc:551: ========== Plasma store: =================
Current usage: 0 / 38.4538 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

[2022-11-18 15:10:14,392 I 14604 2452] (raylet.exe) grpc_server.cc:120: ObjectManager server started, listening on port 58389.
[2022-11-18 15:10:14,404 W 14604 2452] (raylet.exe) memory_monitor.cc:65: Not running MemoryMonitor. It is currently supported only on Linux.
[2022-11-18 15:10:14,404 I 14604 2452] (raylet.exe) node_manager.cc:345: Initializing NodeManager with ID 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa
[2022-11-18 15:10:14,405 I 14604 2452] (raylet.exe) grpc_server.cc:120: NodeManager server started, listening on port 58391.
[2022-11-18 15:10:14,575 I 14604 8848] (raylet.exe) agent_manager.cc:109: Monitor agent process with id 15724, register timeout 100000ms.
[2022-11-18 15:10:14,579 I 14604 2452] (raylet.exe) raylet.cc:114: Raylet of id, 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa started. Raylet consists of node_manager and object_manager. node_manager address: 10.14.228.74:58391 object_manager address: 10.14.228.74:58389 hostname: 10.14.228.74
[2022-11-18 15:10:14,581 I 14604 2452] (raylet.exe) node_manager.cc:559: [state-dump] Event stats:
[state-dump] 
[state-dump] 
[state-dump] Global stats: 20 total (10 active)
[state-dump] Queueing time: mean = 27.177 ms, max = 189.309 ms, min = 9.200 us, total = 543.547 ms
[state-dump] Execution time:  mean = 9.659 ms, total = 193.170 ms
[state-dump] Event stats:
[state-dump] 	PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 252.338 us, total = 2.019 ms
[state-dump] 	UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
[state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 7.200 us, total = 7.200 us
[state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 
[state-dump] NodeManager:
[state-dump] Node ID: 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa
[state-dump] Node name: 10.14.228.74
[state-dump] InitialConfigResources: {node:10.14.228.74: 10000, CPU: 160000, GPU: 10000, memory: 897254457350000, object_store_memory: 384537624570000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state: 
[state-dump] Local id: -7075803485337003867 Local resources: {object_store_memory: [384537624570000]/[384537624570000], node:10.14.228.74: [10000]/[10000], GPU: [10000]/[10000], CPU: [160000]/[160000], memory: [897254457350000]/[897254457350000]}node id: -7075803485337003867{node:10.14.228.74: 10000/10000, object_store_memory: 384537624570000/384537624570000, memory: 897254457350000/897254457350000, CPU: 160000/160000, GPU: 10000/10000}{ "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump] 
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump] 
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan(ind) s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time:  mean = -nan(ind) s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 7500.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 38453762457
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump] 
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers: 
[state-dump] Event stats:
[state-dump] Global stats: 20 total (10 active)
[state-dump] Queueing time: mean = 27.177 ms, max = 189.309 ms, min = 9.200 us, total = 543.547 ms
[state-dump] Execution time:  mean = 9.659 ms, total = 193.170 ms
[state-dump] Event stats:
[state-dump] 	PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 252.338 us, total = 2.019 ms
[state-dump] 	UNKNOWN - 3 total (3 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
[state-dump] 	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	ObjectManager.UpdateAvailableMemory - 1 total (0 active), CPU time: mean = 7.200 us, total = 7.200 us
[state-dump] 	NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] 	NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump] 
[state-dump] 
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 1
[2022-11-18 15:10:14,582 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = d95f9e10cddcc512ed19448bcfa76c6660cdbaf48f60b7c5a481d044, IsAlive = 1
[2022-11-18 15:10:15,718 I 14604 2452] (raylet.exe) agent_manager.cc:40: HandleRegisterAgent, ip: 10.14.228.74, port: 59081, id: 15724
[2022-11-18 15:10:43,587 I 14604 2452] (raylet.exe) accessor.cc:608: Received notification for node id = 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa, IsAlive = 0
[2022-11-18 15:10:43,591 C 14604 2452] (raylet.exe) node_manager.cc:1057: [Timeout] Exiting because this node manager has mistakenly been marked as dead by the GCS: GCS didn't receive heartbeats from this node for 30000 ms. This is likely because the machine or raylet has become overloaded.
*** StackTrace Information ***
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
unknown
BaseThreadInitThunk
RtlUserThreadStart

Worker Node debug_state.txt

NodeManager:
Node ID: 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa
Node name: 10.14.228.74
InitialConfigResources: {node:10.14.228.74: 10000, CPU: 160000, GPU: 10000, memory: 897254457350000, object_store_memory: 384537624570000}
ClusterTaskManager:
========== Node: 8cc668ae845aca22294702a29e89487ae7d358b4bf7d5ba5d59ff9fa =================
Infeasible queue length: 0
Schedule queue length: 0
Dispatch queue length: 0
num_waiting_for_resource: 0
num_waiting_for_plasma_memory: 0
num_waiting_for_remote_node_resources: 0
num_worker_not_started_by_job_config_not_exist: 0
num_worker_not_started_by_registration_timeout: 0
num_tasks_waiting_for_workers: 0
num_cancelled_tasks: 0
cluster_resource_scheduler state: 
Local id: -7075803485337003867 Local resources: {object_store_memory: [384537624570000]/[384537624570000], node:10.14.228.74: [10000]/[10000], GPU: [10000]/[10000], CPU: [160000]/[160000], memory: [897254457350000]/[897254457350000]}node id: -7075803485337003867{node:10.14.228.74: 10000/10000, object_store_memory: 384537624570000/384537624570000, memory: 897254457350000/897254457350000, CPU: 160000/160000, GPU: 10000/10000}node id: -4510166795099005845{object_store_memory: 328991858680000/328991858680000, node:10.14.228.50: 10000/10000, memory: 667647670280000/667647670280000, CPU: 160000/160000, GPU: 10000/10000}{ "placment group locations": [], "node to bundles": []}
Waiting tasks size: 0
Number of executing tasks: 0
Number of pinned task arguments: 0
Number of total spilled tasks: 0
Number of spilled waiting tasks: 0
Number of spilled unschedulable tasks: 0
Resource usage {
}
Running tasks by scheduling class:
==================================================

ClusterResources:
LocalObjectManager:
- num pinned objects: 0
- pinned objects size: 0
- num objects pending restore: 0
- num objects pending spill: 0
- num bytes pending spill: 0
- num bytes currently spilled: 0
- cumulative spill requests: 0
- cumulative restore requests: 0
- spilled objects pending delete: 0

ObjectManager:
- num local objects: 0
- num unfulfilled push requests: 0
- num object pull requests: 0
- num chunks received total: 0
- num chunks received failed (all): 0
- num chunks received failed / cancelled: 0
- num chunks received failed / plasma error: 0
Event stats:
Global stats: 0 total (0 active)
Queueing time: mean = -nan(ind) s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
Execution time:  mean = -nan(ind) s, total = 0.000 s
Event stats:
PushManager:
- num pushes in flight: 0
- num chunks in flight: 0
- num chunks remaining: 0
- max chunks allowed: 409
OwnershipBasedObjectDirectory:
- num listeners: 0
- cumulative location updates: 0
- num location updates per second: 0.000
- num location lookups per second: 0.000
- num locations added per second: 0.000
- num locations removed per second: 0.000
BufferPool:
- create buffer state map size: 0
PullManager:
- num bytes available for pulled objects: 38453762457
- num bytes being pulled (all): 0
- num bytes being pulled / pinned: 0
- get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
- wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
- task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
- first get request bundle: N/A
- first wait request bundle: N/A
- first task request bundle: N/A
- num objects queued: 0
- num objects actively pulled (all): 0
- num objects actively pulled / pinned: 0
- num bundles being pulled: 0
- num pull retries: 0
- max timeout seconds: 0
- max timeout request is already processed. No entry.

WorkerPool:
- registered jobs: 0
- process_failed_job_config_missing: 0
- process_failed_rate_limited: 0
- process_failed_pending_registration: 0
- process_failed_runtime_env_setup_failed: 0
- num PYTHON workers: 0
- num PYTHON drivers: 0
- num object spill callbacks queued: 0
- num object restore queued: 0
- num util functions queued: 0
- num idle workers: 0
TaskDependencyManager:
- task deps map size: 0
- get req map size: 0
- wait req map size: 0
- local objects map size: 0
WaitManager:
- num active wait requests: 0
Subscriber:
Channel WORKER_REF_REMOVED_CHANNEL
- cumulative subscribe requests: 0
- cumulative unsubscribe requests: 0
- active subscribed publishers: 0
- cumulative published messages: 0
- cumulative processed messages: 0
Channel WORKER_OBJECT_EVICTION
- cumulative subscribe requests: 0
- cumulative unsubscribe requests: 0
- active subscribed publishers: 0
- cumulative published messages: 0
- cumulative processed messages: 0
Channel WORKER_OBJECT_LOCATIONS_CHANNEL
- cumulative subscribe requests: 0
- cumulative unsubscribe requests: 0
- active subscribed publishers: 0
- cumulative published messages: 0
- cumulative processed messages: 0
num async plasma notifications: 0
Remote node managers: 
d95f9e10cddcc512ed19448bcfa76c6660cdbaf48f60b7c5a481d044
Event stats:
Global stats: 565 total (9 active)
Queueing time: mean = 1.236 ms, max = 189.309 ms, min = -0.001 s, total = 698.304 ms
Execution time:  mean = 352.527 us, total = 199.178 ms
Event stats:
	UNKNOWN - 221 total (3 active), CPU time: mean = 7.037 us, total = 1.555 ms
	ObjectManager.UpdateAvailableMemory - 200 total (0 active), CPU time: mean = 3.954 us, total = 790.843 us
	RayletWorkerPool.deadline_timer.kill_idle_workers - 100 total (1 active), CPU time: mean = 10.397 us, total = 1.040 ms
	NodeManager.deadline_timer.flush_free_objects - 20 total (1 active), CPU time: mean = 9.862 us, total = 197.241 us
	PeriodicalRunner.RunFnPeriodically - 8 total (0 active), CPU time: mean = 324.363 us, total = 2.595 ms
	NodeManager.deadline_timer.record_metrics - 4 total (1 active), CPU time: mean = 176.853 us, total = 707.414 us
	NodeManager.deadline_timer.debug_state_dump - 2 total (1 active, 1 running), CPU time: mean = 231.992 us, total = 463.984 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 114.250 us, total = 228.500 us
	NodeManager.deadline_timer.print_event_loop_stats - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 190.864 ms, total = 190.864 ms
	NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 164.600 us, total = 164.600 us
	NodeResourceInfoGcsService.grpc_client.GetResources - 1 total (0 active), CPU time: mean = 48.600 us, total = 48.600 us
	NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 280.800 us, total = 280.800 us
	InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
	AgentManagerService.grpc_server.RegisterAgent - 1 total (0 active), CPU time: mean = 229.645 us, total = 229.645 us
	JobInfoGcsService.grpc_client.GetAllJobInfo - 1 total (0 active), CPU time: mean = 12.800 us, total = 12.800 us
DebugString() time ms: 0

architkulkarni · November 29, 2022, 10:10pm

Hi @bananajoe182 , it would be helpful to see if there’s something in the logs that could explain why the head node dies. You mentioned the logs aren’t available via the dashboard. Is it possible to retrive the logs from the disk of the head node? By default the logs are located at /tmp/ray/session_latest/logs, though the directory structure may be slightly different on Windows–let me know if you can’t find it.

bananajoe182 · December 5, 2022, 4:19pm

Hi @architkulkarni ,
thanks for taking the time. Sorry describing my problem was not precise enough. It’s not the head node that is dying, it’s the worker nodes. The head node is still running but the ray dashboard shows the connected workers to be dead after 30 secs. Are there any particular logs from the worker nodes that could help here? (Such as debug_state.txt, gcs_server.err …) I could post parts of them.
My guts say it’s something with the ports. If one would set all necessary ports manually, how would an example ray start command would look like for the head node. I’m wondering if maybe some ports are used by other services and therefore occupied.

architkulkarni · December 5, 2022, 9:37pm

Hi @bananajoe182 , thanks for the details. Even though the worker node is dying, it seems like it’s exiting because of GCS didn't receive heartbeats from this node for 30000 ms.. The GCS is only on the head node, so I guess the head node logs won’t be too helpful because it will just show that it’s missing heartbeats. Regarding the worker logs, I’m not sure what other logs to check–perhaps raylet.err?

By the way, is the successful home setup also on Windows, or is only the failing setup on Windows?

You can check this doc for information on specifying ports: Configuring Ray — Ray 2.1.0
Though if ports are an issue, I would expect some sort of error to appear in the logs. Is it possible to search through the logs for any error related to port?

bananajoe182 · December 20, 2022, 1:47pm

Look here for the solution:

Nemo_X · July 17, 2024, 8:21am

Hi, i had the same problem
I run ray in docker on two windows pc, after see your answer i closed the firewall but not work
Any tips, thanks!

Topic		Replies	Views
[ray1.0.0] stuck when connecting to existing ray cluster Ray Core	6	1743	December 15, 2020
Remote worker nodes only alive for 30 seconds Ray Clusters	7	1661	April 24, 2025
(raylet) Some workers of the worker process(68497) have not registered within the timeout. The process is still alive, probably it's hanging during start Ray Clusters	4	2634	May 26, 2022
Head and worked node dies after few seconds Kubernetes	3	1195	March 24, 2023
Raylet errors some worker have not registered within the timeout Ray Core	31	3838	March 30, 2023

Remote Worker Nodes die after a few seconds

Related topics