Ray init fail in my local server with error agent_manager.cc:135:

huhansan666666 · April 7, 2023, 6:02am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

Hi, meet a problem when using ray (But the same code can run correctly in another machine with same version of RAY in my test)

2023-04-07 13:39:46,137	INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(raylet) [2023-04-07 13:39:55,968 E 3658508 3658521] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-04-07_13-39-43_510811_3620699 is over 95% full, available space: 21831651328; capacity: 470428008448. Object creation will fail if spilling is required.
(raylet) [2023-04-07 13:39:56,141 E 3658508 3658552] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
(raylet) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
[2023-04-07 13:39:56,281 E 3620699 3658577] core_worker.cc:569: :info_message: Attempting to recover 2 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).

The log file raylet.log is

[2023-04-07 13:39:45,962 I 3658508 3658508] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-04-07 13:39:45,963 I 3658508 3658508] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 76.9443GB of memory.
[2023-04-07 13:39:45,963 I 3658508 3658508] (raylet) store_runner.cc:48: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2023-04-07 13:39:45,963 I 3658508 3658520] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(76944375816, /dev/shm/plasmaXXXXXX)
[2023-04-07 13:39:45,963 E 3658508 3658520] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-04-07_13-39-43_510811_3620699 is over 95% full, available space: 21831839744; capacity: 470428008448. Object creation will fail if spilling is required.
[2023-04-07 13:39:45,963 I 3658508 3658520] (raylet) store.cc:554: ========== Plasma store: =================
Current usage: 0 / 76.9443 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0

- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0

[2023-04-07 13:39:45,965 I 3658508 3658508] (raylet) grpc_server.cc:140: ObjectManager server started, listening on port 41985.
[2023-04-07 13:39:45,970 I 3658508 3658508] (raylet) worker_killing_policy.cc:100: Running GroupByOwner policy.
[2023-04-07 13:39:45,970 I 3658508 3658508] (raylet) memory_monitor.cc:47: MemoryMonitor initialized with usage threshold at 512650182656 bytes (0.95 system memory), total system memory bytes: 539631767552
[2023-04-07 13:39:45,970 I 3658508 3658508] (raylet) node_manager.cc:294: Initializing NodeManager with ID 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189
[2023-04-07 13:39:45,971 I 3658508 3658508] (raylet) grpc_server.cc:140: NodeManager server started, listening on port 45963.
[2023-04-07 13:39:45,977 I 3658508 3658552] (raylet) agent_manager.cc:109: Monitor agent process with id 424238335, register timeout 30000ms.
[2023-04-07 13:39:45,978 I 3658508 3658508] (raylet) raylet.cc:115: Raylet of id, 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189 started. Raylet consists of node_manager and object_manager. node_manager address: 162.105.250.159:45963 object_manager address: 162.105.250.159:41985 hostname: 162.105.250.159
[2023-04-07 13:39:45,981 I 3658508 3658508] (raylet) node_manager.cc:506: [state-dump] Event stats:
[state-dump]
[state-dump]
[state-dump] Global stats: 20 total (11 active)
[state-dump] Queueing time: mean = 1.646 ms, max = 14.738 ms, min = 11.394 us, total = 32.922 ms
[state-dump] Execution time:  mean = 876.586 us, total = 17.532 ms
[state-dump] Event stats:
[state-dump]    PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 248.649 us, total = 1.989 ms
[state-dump]    UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 173.036 us, total = 173.036 us
[state-dump]    NodeManagerService.grpc_server.RequestResourceReport - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 15.369 ms, total = 15.369 ms
[state-dump]    RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]
[state-dump] NodeManager:
[state-dump] Node ID: 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189
[state-dump] Node name: 162.105.250.159
[state-dump] InitialConfigResources: {accelerator_type:G: 10000, memory: 1695365863430000, CPU: 400000, node:162.105.250.159: 10000, object_store_memory: 769442512890000, GPU: 80000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189 =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 8836576097972133795 Local resources: {object_store_memory: [769442512890000]/[769442512890000], accelerator_type:G: [10000]/[10000], node:162.105.250.159: [10000]/[10000], GPU: [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000]/[10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], memory: [1695365863430000]/[1695365863430000], CPU: [400000]/[400000]}node id: 8836576097972133795{object_store_memory: 769442512890000/769442512890000, node:162.105.250.159: 10000/10000, accelerator_type:G: 10000/10000, GPU: 80000/80000, memory: 1695365863430000/1695365863430000, CPU: 400000/400000}{ "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump]
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump]
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time:  mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 0.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 76944251289
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump]
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers:
[state-dump] Event stats:
[state-dump] Global stats: 20 total (11 active)
[state-dump] Queueing time: mean = 1.646 ms, max = 14.738 ms, min = 11.394 us, total = 32.922 ms
[state-dump] Execution time:  mean = 876.586 us, total = 17.532 ms
[state-dump] Event stats:
[state-dump]    PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 248.649 us, total = 1.989 ms
[state-dump]    UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 173.036 us, total = 173.036 us
[state-dump]    NodeManagerService.grpc_server.RequestResourceReport - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]    NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 15.369 ms, total = 15.369 ms
[state-dump]    RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
[2023-04-07 13:39:45,981 I 3658508 3658508] (raylet) accessor.cc:590: Received notification for node id = 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189, IsAlive = 1
[2023-04-07 13:39:46,151 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658578, the token is 0
[2023-04-07 13:39:46,153 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658579, the token is 1
[2023-04-07 13:39:46,154 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658580, the token is 2
[2023-04-07 13:39:46,156 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658581, the token is 3
[2023-04-07 13:39:46,158 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658582, the token is 4
[2023-04-07 13:39:46,159 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658583, the token is 5
[2023-04-07 13:39:46,160 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658584, the token is 6
[2023-04-07 13:39:46,161 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658585, the token is 7
[2023-04-07 13:39:46,162 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658586, the token is 8
[2023-04-07 13:39:46,163 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658587, the token is 9
[2023-04-07 13:39:46,164 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658588, the token is 10
[2023-04-07 13:39:46,165 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658589, the token is 11
[2023-04-07 13:39:46,166 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658590, the token is 12
[2023-04-07 13:39:46,167 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658591, the token is 13
[2023-04-07 13:39:46,169 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658592, the token is 14
[2023-04-07 13:39:46,170 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658593, the token is 15
[2023-04-07 13:39:46,171 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658594, the token is 16
[2023-04-07 13:39:46,172 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658595, the token is 17
[2023-04-07 13:39:46,174 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658596, the token is 18
[2023-04-07 13:39:46,175 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658597, the token is 19
[2023-04-07 13:39:46,176 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658598, the token is 20
[2023-04-07 13:39:46,177 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658599, the token is 21
[2023-04-07 13:39:46,178 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658600, the token is 22
[2023-04-07 13:39:46,180 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658601, the token is 23
[2023-04-07 13:39:46,181 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658602, the token is 24
[2023-04-07 13:39:46,182 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658603, the token is 25
[2023-04-07 13:39:46,183 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658604, the token is 26
[2023-04-07 13:39:46,184 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658605, the token is 27
[2023-04-07 13:39:46,186 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658606, the token is 28
[2023-04-07 13:39:46,187 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658607, the token is 29
[2023-04-07 13:39:46,188 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658608, the token is 30
[2023-04-07 13:39:46,189 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658609, the token is 31
[2023-04-07 13:39:46,190 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658610, the token is 32
[2023-04-07 13:39:46,191 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658611, the token is 33
[2023-04-07 13:39:46,192 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658612, the token is 34
[2023-04-07 13:39:46,193 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658613, the token is 35
[2023-04-07 13:39:46,195 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658614, the token is 36
[2023-04-07 13:39:46,196 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658615, the token is 37
[2023-04-07 13:39:46,197 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658616, the token is 38
[2023-04-07 13:39:46,198 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658617, the token is 39
[2023-04-07 13:39:46,871 I 3658508 3658520] (raylet) object_store.cc:35: Object store current usage 8e-09 / 76.9443 GB.
[2023-04-07 13:39:47,076 I 3658508 3658508] (raylet) agent_manager.cc:40: HandleRegisterAgent, ip: 162.105.250.159, port: 49926, id: 424238335
[2023-04-07 13:39:47,166 I 3658508 3658508] (raylet) node_manager.cc:590: New job has started. Job id 01000000 Driver pid 3620699 is dead: 0 driver address: 162.105.250.159
[2023-04-07 13:39:47,166 I 3658508 3658508] (raylet) worker_pool.cc:653: Job 01000000 already started in worker pool.
[2023-04-07 13:39:55,968 E 3658508 3658521] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-04-07_13-39-43_510811_3620699 is over 95% full, available space: 21831651328; capacity: 470428008448. Object creation will fail if spilling is required.
[2023-04-07 13:39:56,141 I 3658508 3658552] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, exit code 0. ip 162.105.250.159. id 424238335
[2023-04-07 13:39:56,141 E 3658508 3658552] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
- The agent is killed by the OS (e.g., out of memory).
[2023-04-07 13:39:56,141 I 3658508 3658508] (raylet) main.cc:300: Raylet received SIGTERM, shutting down...
[2023-04-07 13:39:56,141 I 3658508 3658508] (raylet) accessor.cc:435: Unregistering node info, node id = 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189
[2023-04-07 13:39:56,141 W 3658508 3658514] (raylet) metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-04-07 13:39:56,142 I 3658508 3658508] (raylet) io_service_pool.cc:47: IOServicePool is stopped.

My code is :

@ray.remote
def cal_distance(a,
                 b,
                 alpha:Optional[float]=0.5,
    ) -> Tuple[float, np.ndarray]:

    return distance(a, b, alpha=alpha)


@ray.remote
def compute_distances(start, end, elements) -> List:
    distances = []
    for i in range(start, end):
        for j in range(i+1, len(elements)):
            # remote function return ray.ObjectRef object
            distances.append((cal_distance.remote(elements[i], elements[j]),i,j))
    return distances


def parallel_compute_distances(elements:List[Graph],
                               cpus:Optional[int]=-1
    ) -> Tuple[np.ndarray, Dict]:
 
    start = time.time()
    ray.shutdown()
    if cpus>0:
        ray.init(num_cpus=cpus)
    else:
        ray.init()

    # compute ot distances in parallel
    results = []
    step = max(int(len(elements) / ray.cluster_resources()['CPU']), 1)
    for i in range(0, len(elements), step):
        end = min(i+step, len(elements))
        results.append(compute_distances.remote(i, end, elements))

    # unpack
    distance_matrix = np.zeros((len(elements), len(elements)))
    match_dict = Dict()
    for r in results: #  a row
        for d, i, j in ray.get(r):
            # print(f"i={i},j={j},d={d}")
            distance, match = ray.get(d)
            distance_matrix[i][j] = distance
            distance_matrix[j][i] = distance
            match_dict[i][j] = match

    ray.shutdown()
    return distance_matrix, match_dict

rliaw · April 12, 2023, 10:05pm

Seems like a grpcio issue - can you make sure you have the right version here?

yuni · July 27, 2023, 1:35am

How can I know what the right version of grpcio is?

xwi · August 19, 2023, 5:10pm

github.com/ray-project/ray

[Core] investigate why Ray hangs with grpcio==1.48.0

opened 12:15AM - 30 Jul 22 UTC

closed 04:36PM - 25 Sep 22 UTC

scv119

bug P1 core Ray 2.1

### What happened + What you expected to happen We need to dive deep down to …why `ray.init` hangs with grpcio==1.48.0. Bonus point: we need to figure out our strategy moving forward, since grpcio has been caused several issues. https://github.com/ray-project/ray/commit/502c3e132d27ce5106fe98ecde8deacbdde57e6a https://github.com/ray-project/ray/commit/e3051ebf673a6bb36e0c380b121c08c1b5fd0f4c https://github.com/ray-project/ray/commit/fda345335ae5fe4d8ff50927f5b2c6afa3e9b56d ### Versions / Dependencies latest ray ### Reproduction script pip install ray pip install grpcio==1.48.0 python ``` import ray ray.init() ``` This will hang. ### Issue Severity Medium: It is a significant difficulty but I can work around it.

ray==2.3.0 grpcio==1.45 worked for me

M_W · November 29, 2023, 4:31pm

Is this issue solved? I am seeing this too trying to run on an ec2 instance.

Jules_Damji · November 29, 2023, 5:15pm

@M_W That issue was closed a while back: [Core] investigate why Ray hangs with grpcio==1.48.0 · Issue #27299 · ray-project/ray · GitHub

cc: @rliaw @Chen_Shen

M_W · November 30, 2023, 2:58am

Hi @Jules_Damji : thanks for the information. It appears the issue still happens. I am using the the most recent versions of ray and grpcio installed on ec2 instance(ubuntu). Here is the error message:

ray.init()
2023-11-30 02:54:48,956 ERROR services.py:1329 – Failed to start the dashboard , return code 1
2023-11-30 02:54:48,956 ERROR services.py:1354 – Error should be written to ‘dashboard.log’ or ‘dashboard.err’. We are printing the last 20 lines for you. See ‘Configuring Logging — Ray 3.0.0.dev0’ to find where the log file is.
2023-11-30 02:54:48,956 ERROR services.py:1398 –
The last 20 lines of /tmp/ray/session_2023-11-30_02-54-47_436818_198606/logs/dashboard.log (it contains the error message from the dashboard):
File “/home/ubuntu/myenv/lib/python3.10/site-packages/ray/dashboard/dashboard.py”, line 75, in run
await self.dashboard_head.run()
File “/home/ubuntu/myenv/lib/python3.10/site-packages/ray/dashboard/head.py”, line 325, in run
modules = self._load_modules(self._modules_to_load)
File “/home/ubuntu/myenv/lib/python3.10/site-packages/ray/dashboard/head.py”, line 219, in _load_modules
head_cls_list = dashboard_utils.get_all_modules(DashboardHeadModule)
File “/home/ubuntu/myenv/lib/python3.10/site-packages/ray/dashboard/utils.py”, line 121, in get_all_modules
importlib.import_module(name)
File “/usr/lib/python3.10/importlib/init.py”, line 126, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File “”, line 1050, in _gcd_import
File “”, line 1027, in _find_and_load
File “”, line 1006, in _find_and_load_unlocked
File “”, line 688, in _load_unlocked
File “”, line 883, in exec_module
File “”, line 241, in _call_with_frames_removed
File “/home/ubuntu/myenv/lib/python3.10/site-packages/ray/dashboard/modules/log/log_manager.py”, line 26, in
class ResolvedStreamFileInfo(BaseModel):
TypeError: NoneType takes no arguments
2023-11-30 02:54:49,067 INFO worker.py:1673 – Started a local Ray instance.
RayContext(dashboard_url=None, python_version=‘3.10.12’, ray_version=‘2.8.0’, ray_commit=‘105355bd253d6538ed34d331f6a4bdf0e38ace3a’, protocol_version=None)
(raylet) [2023-11-30 02:54:49,784 E 198734 198774] (raylet) agent_manager.cc:70: The raylet exited immediately because one Ray agent failed, agent_name = dashboard_agent/424238335.
(raylet) The raylet fate shares with the agent. This can happen because
(raylet) - The version of grpcio doesn’t follow Ray’s requirement. Agent can segfault with the incorrect grpcio version. Check the grpcio version pip freeze | grep grpcio.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log cat /tmp/ray/session_latest/logs/{dashboard_agent|runtime_env_agent}.log. You can find the log file structure here Configuring Logging — Ray 3.0.0.dev0.
(raylet) - The agent is killed by the OS (e.g., out of memory).

mbbyn · December 19, 2023, 6:04pm

Getting the same error as @M_W , only started to happen after upgrading ray to 2.8.1.
I reverted back to 2.7.1 and things started working again.

Our setup is super simple. Just installing ray without any extras, and running ray.init() directly in python on startup. We aren’t sure why the dashboard is initializing in the first place, as we install without extras, and run ray.init(include_dashboard=False) for good measure.

M_W · December 19, 2023, 10:56pm

@mbbyn : I solved this problem by removing some of the packages. It appears there some of the packages I installed earlier have conflict with ray installation.

Topic		Replies	Views
Ray head isn't starting properly sometimes Ray Core	7	525	April 28, 2023
Error occurs under high memory use Ray Core	3	762	July 25, 2022
Head and worked node dies after few seconds Kubernetes	3	1179	March 24, 2023
Failed to register worker to raylet (2) RLlib	2	113	June 20, 2025
Actor died unexpectedly (GrpcUnavailable: failed to connect to all addresses) RLlib	4	2519	July 5, 2022

Ray init fail in my local server with error agent_manager.cc:135:

Related topics