How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi, meet a problem when using ray (But the same code can run correctly in another machine with same version of RAY in my test)
2023-04-07 13:39:46,137 INFO worker.py:1544 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(raylet) [2023-04-07 13:39:55,968 E 3658508 3658521] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-04-07_13-39-43_510811_3620699 is over 95% full, available space: 21831651328; capacity: 470428008448. Object creation will fail if spilling is required.
(raylet) [2023-04-07 13:39:56,141 E 3658508 3658552] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
(raylet) - The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
(raylet) - The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
(raylet) - The agent is killed by the OS (e.g., out of memory).
[2023-04-07 13:39:56,281 E 3620699 3658577] core_worker.cc:569: :info_message: Attempting to recover 2 lost objects by resubmitting their tasks. To disable object reconstruction, set @ray.remote(max_retries=0).
The log file raylet.log is
[2023-04-07 13:39:45,962 I 3658508 3658508] (raylet) io_service_pool.cc:35: IOServicePool is running with 1 io_service.
[2023-04-07 13:39:45,963 I 3658508 3658508] (raylet) store_runner.cc:32: Allowing the Plasma store to use up to 76.9443GB of memory.
[2023-04-07 13:39:45,963 I 3658508 3658508] (raylet) store_runner.cc:48: Starting object store with directory /dev/shm, fallback /tmp/ray, and huge page support disabled
[2023-04-07 13:39:45,963 I 3658508 3658520] (raylet) dlmalloc.cc:154: create_and_mmap_buffer(76944375816, /dev/shm/plasmaXXXXXX)
[2023-04-07 13:39:45,963 E 3658508 3658520] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-04-07_13-39-43_510811_3620699 is over 95% full, available space: 21831839744; capacity: 470428008448. Object creation will fail if spilling is required.
[2023-04-07 13:39:45,963 I 3658508 3658520] (raylet) store.cc:554: ========== Plasma store: =================
Current usage: 0 / 76.9443 GB
- num bytes created total: 0
0 pending objects of total size 0MB
- objects spillable: 0
- bytes spillable: 0
- objects unsealed: 0
- bytes unsealed: 0
- objects in use: 0
- bytes in use: 0
- objects evictable: 0
- bytes evictable: 0
- objects created by worker: 0
- bytes created by worker: 0
- objects restored: 0
- bytes restored: 0
- objects received: 0
- bytes received: 0
- objects errored: 0
- bytes errored: 0
[2023-04-07 13:39:45,965 I 3658508 3658508] (raylet) grpc_server.cc:140: ObjectManager server started, listening on port 41985.
[2023-04-07 13:39:45,970 I 3658508 3658508] (raylet) worker_killing_policy.cc:100: Running GroupByOwner policy.
[2023-04-07 13:39:45,970 I 3658508 3658508] (raylet) memory_monitor.cc:47: MemoryMonitor initialized with usage threshold at 512650182656 bytes (0.95 system memory), total system memory bytes: 539631767552
[2023-04-07 13:39:45,970 I 3658508 3658508] (raylet) node_manager.cc:294: Initializing NodeManager with ID 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189
[2023-04-07 13:39:45,971 I 3658508 3658508] (raylet) grpc_server.cc:140: NodeManager server started, listening on port 45963.
[2023-04-07 13:39:45,977 I 3658508 3658552] (raylet) agent_manager.cc:109: Monitor agent process with id 424238335, register timeout 30000ms.
[2023-04-07 13:39:45,978 I 3658508 3658508] (raylet) raylet.cc:115: Raylet of id, 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189 started. Raylet consists of node_manager and object_manager. node_manager address: 162.105.250.159:45963 object_manager address: 162.105.250.159:41985 hostname: 162.105.250.159
[2023-04-07 13:39:45,981 I 3658508 3658508] (raylet) node_manager.cc:506: [state-dump] Event stats:
[state-dump]
[state-dump]
[state-dump] Global stats: 20 total (11 active)
[state-dump] Queueing time: mean = 1.646 ms, max = 14.738 ms, min = 11.394 us, total = 32.922 ms
[state-dump] Execution time: mean = 876.586 us, total = 17.532 ms
[state-dump] Event stats:
[state-dump] PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 248.649 us, total = 1.989 ms
[state-dump] UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 173.036 us, total = 173.036 us
[state-dump] NodeManagerService.grpc_server.RequestResourceReport - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 15.369 ms, total = 15.369 ms
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump]
[state-dump] NodeManager:
[state-dump] Node ID: 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189
[state-dump] Node name: 162.105.250.159
[state-dump] InitialConfigResources: {accelerator_type:G: 10000, memory: 1695365863430000, CPU: 400000, node:162.105.250.159: 10000, object_store_memory: 769442512890000, GPU: 80000}
[state-dump] ClusterTaskManager:
[state-dump] ========== Node: 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189 =================
[state-dump] Infeasible queue length: 0
[state-dump] Schedule queue length: 0
[state-dump] Dispatch queue length: 0
[state-dump] num_waiting_for_resource: 0
[state-dump] num_waiting_for_plasma_memory: 0
[state-dump] num_waiting_for_remote_node_resources: 0
[state-dump] num_worker_not_started_by_job_config_not_exist: 0
[state-dump] num_worker_not_started_by_registration_timeout: 0
[state-dump] num_tasks_waiting_for_workers: 0
[state-dump] num_cancelled_tasks: 0
[state-dump] cluster_resource_scheduler state:
[state-dump] Local id: 8836576097972133795 Local resources: {object_store_memory: [769442512890000]/[769442512890000], accelerator_type:G: [10000]/[10000], node:162.105.250.159: [10000]/[10000], GPU: [10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000]/[10000, 10000, 10000, 10000, 10000, 10000, 10000, 10000], memory: [1695365863430000]/[1695365863430000], CPU: [400000]/[400000]}node id: 8836576097972133795{object_store_memory: 769442512890000/769442512890000, node:162.105.250.159: 10000/10000, accelerator_type:G: 10000/10000, GPU: 80000/80000, memory: 1695365863430000/1695365863430000, CPU: 400000/400000}{ "placment group locations": [], "node to bundles": []}
[state-dump] Waiting tasks size: 0
[state-dump] Number of executing tasks: 0
[state-dump] Number of pinned task arguments: 0
[state-dump] Number of total spilled tasks: 0
[state-dump] Number of spilled waiting tasks: 0
[state-dump] Number of spilled unschedulable tasks: 0
[state-dump] Resource usage {
[state-dump] }
[state-dump] Running tasks by scheduling class:
[state-dump] ==================================================
[state-dump]
[state-dump] ClusterResources:
[state-dump] LocalObjectManager:
[state-dump] - num pinned objects: 0
[state-dump] - pinned objects size: 0
[state-dump] - num objects pending restore: 0
[state-dump] - num objects pending spill: 0
[state-dump] - num bytes pending spill: 0
[state-dump] - num bytes currently spilled: 0
[state-dump] - cumulative spill requests: 0
[state-dump] - cumulative restore requests: 0
[state-dump] - spilled objects pending delete: 0
[state-dump]
[state-dump] ObjectManager:
[state-dump] - num local objects: 0
[state-dump] - num unfulfilled push requests: 0
[state-dump] - num object pull requests: 0
[state-dump] - num chunks received total: 0
[state-dump] - num chunks received failed (all): 0
[state-dump] - num chunks received failed / cancelled: 0
[state-dump] - num chunks received failed / plasma error: 0
[state-dump] Event stats:
[state-dump] Global stats: 0 total (0 active)
[state-dump] Queueing time: mean = -nan s, max = -0.000 s, min = 9223372036.855 s, total = 0.000 s
[state-dump] Execution time: mean = -nan s, total = 0.000 s
[state-dump] Event stats:
[state-dump] PushManager:
[state-dump] - num pushes in flight: 0
[state-dump] - num chunks in flight: 0
[state-dump] - num chunks remaining: 0
[state-dump] - max chunks allowed: 409
[state-dump] OwnershipBasedObjectDirectory:
[state-dump] - num listeners: 0
[state-dump] - cumulative location updates: 0
[state-dump] - num location updates per second: 0.000
[state-dump] - num location lookups per second: 0.000
[state-dump] - num locations added per second: 0.000
[state-dump] - num locations removed per second: 0.000
[state-dump] BufferPool:
[state-dump] - create buffer state map size: 0
[state-dump] PullManager:
[state-dump] - num bytes available for pulled objects: 76944251289
[state-dump] - num bytes being pulled (all): 0
[state-dump] - num bytes being pulled / pinned: 0
[state-dump] - get request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - wait request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - task request bundles: BundlePullRequestQueue{0 total, 0 active, 0 inactive, 0 unpullable}
[state-dump] - first get request bundle: N/A
[state-dump] - first wait request bundle: N/A
[state-dump] - first task request bundle: N/A
[state-dump] - num objects queued: 0
[state-dump] - num objects actively pulled (all): 0
[state-dump] - num objects actively pulled / pinned: 0
[state-dump] - num bundles being pulled: 0
[state-dump] - num pull retries: 0
[state-dump] - max timeout seconds: 0
[state-dump] - max timeout request is already processed. No entry.
[state-dump]
[state-dump] WorkerPool:
[state-dump] - registered jobs: 0
[state-dump] - process_failed_job_config_missing: 0
[state-dump] - process_failed_rate_limited: 0
[state-dump] - process_failed_pending_registration: 0
[state-dump] - process_failed_runtime_env_setup_failed: 0
[state-dump] - num PYTHON workers: 0
[state-dump] - num PYTHON drivers: 0
[state-dump] - num object spill callbacks queued: 0
[state-dump] - num object restore queued: 0
[state-dump] - num util functions queued: 0
[state-dump] - num idle workers: 0
[state-dump] TaskDependencyManager:
[state-dump] - task deps map size: 0
[state-dump] - get req map size: 0
[state-dump] - wait req map size: 0
[state-dump] - local objects map size: 0
[state-dump] WaitManager:
[state-dump] - num active wait requests: 0
[state-dump] Subscriber:
[state-dump] Channel WORKER_OBJECT_EVICTION
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_REF_REMOVED_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] Channel WORKER_OBJECT_LOCATIONS_CHANNEL
[state-dump] - cumulative subscribe requests: 0
[state-dump] - cumulative unsubscribe requests: 0
[state-dump] - active subscribed publishers: 0
[state-dump] - cumulative published messages: 0
[state-dump] - cumulative processed messages: 0
[state-dump] num async plasma notifications: 0
[state-dump] Remote node managers:
[state-dump] Event stats:
[state-dump] Global stats: 20 total (11 active)
[state-dump] Queueing time: mean = 1.646 ms, max = 14.738 ms, min = 11.394 us, total = 32.922 ms
[state-dump] Execution time: mean = 876.586 us, total = 17.532 ms
[state-dump] Event stats:
[state-dump] PeriodicalRunner.RunFnPeriodically - 8 total (1 active, 1 running), CPU time: mean = 248.649 us, total = 1.989 ms
[state-dump] UNKNOWN - 2 total (2 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeInfoGcsService.grpc_client.RegisterNode - 1 total (0 active), CPU time: mean = 173.036 us, total = 173.036 us
[state-dump] NodeManagerService.grpc_server.RequestResourceReport - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.flush_free_objects - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] MemoryMonitor.CheckIsMemoryUsageAboveThreshold - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.debug_state_dump - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeManager.deadline_timer.record_metrics - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] NodeInfoGcsService.grpc_client.GetInternalConfig - 1 total (0 active), CPU time: mean = 15.369 ms, total = 15.369 ms
[state-dump] RayletWorkerPool.deadline_timer.kill_idle_workers - 1 total (1 active), CPU time: mean = 0.000 s, total = 0.000 s
[state-dump] DebugString() time ms: 1
[state-dump]
[state-dump]
[2023-04-07 13:39:45,981 I 3658508 3658508] (raylet) accessor.cc:590: Received notification for node id = 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189, IsAlive = 1
[2023-04-07 13:39:46,151 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658578, the token is 0
[2023-04-07 13:39:46,153 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658579, the token is 1
[2023-04-07 13:39:46,154 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658580, the token is 2
[2023-04-07 13:39:46,156 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658581, the token is 3
[2023-04-07 13:39:46,158 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658582, the token is 4
[2023-04-07 13:39:46,159 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658583, the token is 5
[2023-04-07 13:39:46,160 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658584, the token is 6
[2023-04-07 13:39:46,161 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658585, the token is 7
[2023-04-07 13:39:46,162 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658586, the token is 8
[2023-04-07 13:39:46,163 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658587, the token is 9
[2023-04-07 13:39:46,164 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658588, the token is 10
[2023-04-07 13:39:46,165 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658589, the token is 11
[2023-04-07 13:39:46,166 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658590, the token is 12
[2023-04-07 13:39:46,167 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658591, the token is 13
[2023-04-07 13:39:46,169 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658592, the token is 14
[2023-04-07 13:39:46,170 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658593, the token is 15
[2023-04-07 13:39:46,171 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658594, the token is 16
[2023-04-07 13:39:46,172 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658595, the token is 17
[2023-04-07 13:39:46,174 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658596, the token is 18
[2023-04-07 13:39:46,175 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658597, the token is 19
[2023-04-07 13:39:46,176 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658598, the token is 20
[2023-04-07 13:39:46,177 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658599, the token is 21
[2023-04-07 13:39:46,178 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658600, the token is 22
[2023-04-07 13:39:46,180 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658601, the token is 23
[2023-04-07 13:39:46,181 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658602, the token is 24
[2023-04-07 13:39:46,182 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658603, the token is 25
[2023-04-07 13:39:46,183 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658604, the token is 26
[2023-04-07 13:39:46,184 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658605, the token is 27
[2023-04-07 13:39:46,186 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658606, the token is 28
[2023-04-07 13:39:46,187 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658607, the token is 29
[2023-04-07 13:39:46,188 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658608, the token is 30
[2023-04-07 13:39:46,189 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658609, the token is 31
[2023-04-07 13:39:46,190 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658610, the token is 32
[2023-04-07 13:39:46,191 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658611, the token is 33
[2023-04-07 13:39:46,192 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658612, the token is 34
[2023-04-07 13:39:46,193 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658613, the token is 35
[2023-04-07 13:39:46,195 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658614, the token is 36
[2023-04-07 13:39:46,196 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658615, the token is 37
[2023-04-07 13:39:46,197 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658616, the token is 38
[2023-04-07 13:39:46,198 I 3658508 3658508] (raylet) worker_pool.cc:470: Started worker process with pid 3658617, the token is 39
[2023-04-07 13:39:46,871 I 3658508 3658520] (raylet) object_store.cc:35: Object store current usage 8e-09 / 76.9443 GB.
[2023-04-07 13:39:47,076 I 3658508 3658508] (raylet) agent_manager.cc:40: HandleRegisterAgent, ip: 162.105.250.159, port: 49926, id: 424238335
[2023-04-07 13:39:47,166 I 3658508 3658508] (raylet) node_manager.cc:590: New job has started. Job id 01000000 Driver pid 3620699 is dead: 0 driver address: 162.105.250.159
[2023-04-07 13:39:47,166 I 3658508 3658508] (raylet) worker_pool.cc:653: Job 01000000 already started in worker pool.
[2023-04-07 13:39:55,968 E 3658508 3658521] (raylet) file_system_monitor.cc:105: /tmp/ray/session_2023-04-07_13-39-43_510811_3620699 is over 95% full, available space: 21831651328; capacity: 470428008448. Object creation will fail if spilling is required.
[2023-04-07 13:39:56,141 I 3658508 3658552] (raylet) agent_manager.cc:131: Agent process with id 424238335 exited, exit code 0. ip 162.105.250.159. id 424238335
[2023-04-07 13:39:56,141 E 3658508 3658552] (raylet) agent_manager.cc:135: The raylet exited immediately because the Ray agent failed. The raylet fate shares with the agent. This can happen because the Ray agent was unexpectedly killed or failed. Agent can fail when
- The version of `grpcio` doesn't follow Ray's requirement. Agent can segfault with the incorrect `grpcio` version. Check the grpcio version `pip freeze | grep grpcio`.
- The agent failed to start because of unexpected error or port conflict. Read the log `cat /tmp/ray/session_latest/dashboard_agent.log`. You can find the log file structure here https://docs.ray.io/en/master/ray-observability/ray-logging.html#logging-directory-structure.
- The agent is killed by the OS (e.g., out of memory).
[2023-04-07 13:39:56,141 I 3658508 3658508] (raylet) main.cc:300: Raylet received SIGTERM, shutting down...
[2023-04-07 13:39:56,141 I 3658508 3658508] (raylet) accessor.cc:435: Unregistering node info, node id = 03faa854da38b999de7cb54e4570720ba36273cd4a972ed6076e7189
[2023-04-07 13:39:56,141 W 3658508 3658514] (raylet) metric_exporter.cc:209: [1] Export metrics to agent failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: . This won't affect Ray, but you can lose metrics from the cluster.
[2023-04-07 13:39:56,142 I 3658508 3658508] (raylet) io_service_pool.cc:47: IOServicePool is stopped.
My code is :
@ray.remote
def cal_distance(a,
b,
alpha:Optional[float]=0.5,
) -> Tuple[float, np.ndarray]:
return distance(a, b, alpha=alpha)
@ray.remote
def compute_distances(start, end, elements) -> List:
distances = []
for i in range(start, end):
for j in range(i+1, len(elements)):
# remote function return ray.ObjectRef object
distances.append((cal_distance.remote(elements[i], elements[j]),i,j))
return distances
def parallel_compute_distances(elements:List[Graph],
cpus:Optional[int]=-1
) -> Tuple[np.ndarray, Dict]:
start = time.time()
ray.shutdown()
if cpus>0:
ray.init(num_cpus=cpus)
else:
ray.init()
# compute ot distances in parallel
results = []
step = max(int(len(elements) / ray.cluster_resources()['CPU']), 1)
for i in range(0, len(elements), step):
end = min(i+step, len(elements))
results.append(compute_distances.remote(i, end, elements))
# unpack
distance_matrix = np.zeros((len(elements), len(elements)))
match_dict = Dict()
for r in results: # a row
for d, i, j in ray.get(r):
# print(f"i={i},j={j},d={d}")
distance, match = ray.get(d)
distance_matrix[i][j] = distance
distance_matrix[j][i] = distance
match_dict[i][j] = match
ray.shutdown()
return distance_matrix, match_dict