Hello,
The run stopped in the morning.
The original error:
2022-09-13 21:31:06,324 WARNING worker.py:1829 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffc00821a3d8c8e614fe5fde3601000000 Worker ID: 4534ef165311b4d3047042e27ef5b0521e4eb43f3ba671105d592847 Node ID: a561b0101bc5614e82e2d5b4f90dec765174a37a62e847e9de0dc96f Worker IP address: 192.168.0.42 Worker port: 10005 Worker PID: 374 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2022-09-13 21:31:12,929 ERROR trial_runner.py:980 -- Trial PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5): Error processing event.
ray.tune.error._TuneNoNextExecutorEventError: Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/execution/ray_trial_executor.py", line 989, in get_next_executor_event
future_result = ray.get(ready_future)
File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py", line 2277, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: c00821a3d8c8e614fe5fde3601000000
pid: 374
namespace: 2a3ac498-b1a4-4122-a924-87602cd4a194
ip: 192.168.0.42
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
2022-09-13 21:32:09,939 ERROR ray_trial_executor.py:104 -- An exception occurred when trying to stop the Ray actor:Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/ray/tune/execution/ray_trial_executor.py", line 94, in _post_stop_cleanup
ray.get(future, timeout=0)
File "/usr/local/lib/python3.6/dist-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/ray/_private/worker.py", line 2277, in get
raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: PPO
actor_id: c00821a3d8c8e614fe5fde3601000000
pid: 374
namespace: 2a3ac498-b1a4-4122-a924-87602cd4a194
ip: 192.168.0.42
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
File "train.py", line 326, in <module>
_main()
File "train.py", line 318, in _main
local_dir='~/devel/rllibsumoutils/pheromone-RL/pheromone-PPO/tune-results',
File "/usr/local/lib/python3.6/dist-packages/ray/tune/tune.py", line 752, in run
raise TuneError("Trials did not complete", incomplete_trials)
ray.tune.error.TuneError: ('Trials did not complete', [PPO(nv-1000_nl-3_nz-10_ls-5000.0_nb-1_bls-500.0_blp-2750.0_bll-0_bdur-END_puf-1_ec-0.3_df-0.5)])
There are two python core driver logs for some reason. I don’t understand why…both their logs started at the same time.
python-core-driver-01000000ffffffffffffffffffffffffffffffffffffffffffffffff_279.log - PID: 279
python-core-driver-02000000ffffffffffffffffffffffffffffffffffffffffffffffff_58.log - PID: 58
Some error messages were found in the first one, though they are timestamped at 2022-09-13 21:31:03, which is yesterday in the evening, when it was seeming to run smoothly.:
[2022-09-13 21:31:03,394 I 279 301] task_manager.cc:446: Task failed: GrpcUnavailable: RPC Error message: Socket closed; RPC Error details: : Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.rllib.algorithms.ppo.ppo, class_name=PPO, function_name=train, function_hash=}, task_id=dc746dc61b2c1923c00821a3d8c8e614fe5fde3601000000, task_name=PPO.train(), job_id=01000000, num_args=0, num_returns=2, depth=0, actor_task_spec={actor_id=c00821a3d8c8e614fe5fde3601000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff01000000, actor_counter=19}
[2022-09-13 21:31:03,396 I 279 301] direct_actor_task_submitter.cc:539: PushActorTask failed because of network error, this task will be stashed away and waiting for Death info from GCS, task_id=dc746dc61b2c1923c00821a3d8c8e614fe5fde3601000000, wait_queue_size=133799
[2022-09-13 21:31:03,407 I 279 301] actor_manager.cc:214: received notification on actor, state: DEAD, actor_id: c00821a3d8c8e614fe5fde3601000000, ip address: 192.168.0.42, port: 10005, worker_id: 4534ef165311b4d3047042e27ef5b0521e4eb43f3ba671105d592847, raylet_id: a561b0101bc5614e82e2d5b4f90dec765174a37a62e847e9de0dc96f, num_restarts: 0, death context type=ActorDiedErrorContext
[2022-09-13 21:31:03,407 I 279 301] direct_actor_task_submitter.cc:278: Failing pending tasks for actor c00821a3d8c8e614fe5fde3601000000 because the actor is already dead.33801
[2022-09-13 21:31:12,998 I 279 279] task_manager.cc:446: Task failed: IOError: cancelling task of dead actor: Type=ACTOR_TASK, Language=PYTHON, Resources: {}, function_descriptor={type=PythonFunctionDescriptor, module_name=ray.rllib.algorithms.ppo.ppo, class_name=PPO, function_name=stop, function_hash=}, task_id=c76a79b2875a7251c00821a3d8c8e614fe5fde3601000000, task_name=PPO.stop(), job_id=01000000, num_args=0, num_returns=2, depth=0, actor_task_spec={actor_id=c00821a3d8c8e614fe5fde3601000000, actor_caller_id=ffffffffffffffffffffffffffffffffffffffff01000000, actor_counter=20}33802
[2022-09-13 21:31:49,659 I 279 301] core_worker.cc:476: Event stats:
Queueing time: mean = 151.141 us, max = 4.874 s, min = -0.094 s, total = 123.346 s
Execution time: mean = 82.065 us, total = 66.973 s
Event stats:
UNKNOWN - 641306 total (4 active, 1 running), CPU time: mean = 18.916 us, total = 12.131 s
CoreWorker.deadline_timer.flush_profiling_events - 58242 total (1 active), CPU time: mean = 4.910 us, total = 285.959 ms
NodeManagerService.grpc_client.ReportWorkerBacklog - 58241 total (0 active), CPU time: mean = 8.039 us, total = 468.190 ms
CoreWorkerService.grpc_server.GetCoreWorkerStats - 57882 total (0 active), CPU time: mean = 51.602 us, total = 2.987 s
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 120 total (1 active), CPU time: mean = 295.524 us, total = 35.463 ms
CoreWorkerService.grpc_server.LocalGC - 97 total (0 active), CPU time: mean = 526.200 ms, total = 51.041 s
CoreWorkerService.grpc_server.PubsubCommandBatch - 55 total (0 active), CPU time: mean = 52.427 us, total = 2.883 ms
CoreWorkerService.grpc_server.PubsubLongPolling - 45 total (0 active), CPU time: mean = 7.810 us, total = 351.431 us
CoreWorkerService.grpc_server.UpdateObjectLocationBatch - 27 total (0 active), CPU time: mean = 66.997 us, total = 1.809 ms
CoreWorkerService.grpc_client.PushTask - 21 total (0 active), CPU time: mean = 494.822 us, total = 10.391 ms
CoreWorkerDirectActorTaskSubmitter::SubmitTask - 20 total (0 active), CPU time: mean = 179.994 us, total = 3.600 ms
StatsGcsService.grpc_client.AddProfileData - 20 total (0 active), CPU time: mean = 13.241 us, total = 264.825 us
PeriodicalRunner.RunFnPeriodically - 5 total (0 active), CPU time: mean = 44.210 us, total = 221.050 us
Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 4 total (0 active), CPU time: mean = 630.415 us, total = 2.522 ms
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 2 total (0 active), CPU time: mean = 100.660 us, total = 201.320 us
CoreWorkerService.grpc_server.WaitForActorOutOfScope - 1 total (0 active), CPU time: mean = 12.157 us, total = 12.157 us
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 36.506 us, total = 36.506 us
ActorInfoGcsService.grpc_client.CreateActor - 1 total (0 active), CPU time: mean = 32.697 us, total = 32.697 us
NodeManagerService.grpc_client.RequestWorkerLease - 1 total (0 active), CPU time: mean = 260.194 us, total = 260.194 us
PlacementGroupInfoGcsService.grpc_client.CreatePlacementGroup - 1 total (0 active), CPU time: mean = 66.894 us, total = 66.894 us
ActorCreator.AsyncRegisterActor - 1 total (0 active), CPU time: mean = 126.684 us, total = 126.684 us
CoreWorker.SubmitTask - 1 total (0 active), CPU time: mean = 713.898 us, total = 713.898 us
ActorInfoGcsService.grpc_client.RegisterActor - 1 total (0 active), CPU time: mean = 94.697 us, total = 94.697 us
NodeManagerService.grpc_client.ReturnWorker - 1 total (0 active), CPU time: mean = 5.393 us, total = 5.393 us
ActorInfoGcsService.grpc_client.GetActorInfo - 1 total (0 active), CPU time: mean = 469.095 us, total = 469.095 us
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 11.321 us, total = 11.321 us
[2022-09-13 21:32:11,078 I 279 279] core_worker.cc:593: Disconnecting to the raylet.33838
[2022-09-13 21:32:11,078 I 279 279] raylet_client.cc:163: RayletClient::Disconnect, exit_type=INTENDED_USER_EXIT, exit_detail=Shutdown by ray.shutdown()., has creation_task_exception_pb_bytes=0
[2022-09-13 21:32:11,078 I 279 279] core_worker.cc:540: Shutting down a core worker.33840
[2022-09-13 21:32:11,078 I 279 279] core_worker.cc:564: Disconnecting a GCS client.33841
[2022-09-13 21:32:11,078 I 279 279] core_worker.cc:568: Waiting for joining a core worker io thread. If it hangs here, there might be deadlock or a high load in the core worker io service.33842
[2022-09-13 21:32:11,078 I 279 301] core_worker.cc:691: Core worker main io service stopped.33843
[2022-09-13 21:32:11,079 I 279 279] core_worker.cc:577: Core worker ready to be deallocated.33844
[2022-09-13 21:32:11,079 I 279 279] core_worker.cc:531: Core worker is destructed33845
[2022-09-13 21:32:11,226 I 279 279] core_worker_process.cc:144: Destructing CoreWorkerProcessImpl. pid: 27933846
[2022-09-13 21:32:11,226 I 279 279] io_service_pool.cc:47: IOServicePool is stopped.
Comparison of two points in time (earlier and end) in the log file for the worker ID: 4534ef165311b4d3047042e27ef5b0521e4eb43f3ba671105d592847 (as noted in the original error) as found in file python-core-worker-4534ef165311b4d3047042e27ef5b0521e4eb43f3ba671105d592847_374.log:
[2022-09-13 08:04:50,628 I 374 394] core_worker.cc:476: Event stats:
Global stats: 147632 total (7 active)
Queueing time: mean = 270.607 us, max = 210.064 ms, min = -0.011 s, total = 39.950 s
Execution time: mean = 42.538 us, total = 6.280 s
Event stats:
UNKNOWN - 117965 total (5 active, 1 running), CPU time: mean = 20.876 us, total = 2.463 s
CoreWorker.deadline_timer.flush_profiling_events - 9837 total (1 active), CPU time: mean = 8.733 us, total = 85.909 ms
NodeManagerService.grpc_client.ReportWorkerBacklog - 9836 total (0 active), CPU time: mean = 10.073 us, total = 99.076 ms
CoreWorkerService.grpc_server.GetCoreWorkerStats - 9787 total (0 active), CPU time: mean = 103.127 us, total = 1.009 s
CoreWorkerService.grpc_server.PubsubCommandBatch - 39 total (0 active), CPU time: mean = 40.725 us, total = 1.588 ms
CoreWorkerService.grpc_server.PubsubLongPolling - 29 total (0 active), CPU time: mean = 9.669 us, total = 280.390 us
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 24 total (1 active), CPU time: mean = 178.458 us, total = 4.283 ms
CoreWorkerService.grpc_server.UpdateObjectLocationBatch - 21 total (0 active), CPU time: mean = 64.287 us, total = 1.350 ms
CoreWorkerService.grpc_server.LocalGC - 16 total (0 active), CPU time: mean = 163.032 ms, total = 2.609 s
CoreWorkerService.grpc_client.PushTask - 15 total (0 active), CPU time: mean = 104.711 us, total = 1.571 ms
CoreWorkerDirectActorTaskSubmitter::SubmitTask - 15 total (0 active), CPU time: mean = 90.475 us, total = 1.357 ms
StatsGcsService.grpc_client.AddProfileData - 11 total (0 active), CPU time: mean = 26.087 us, total = 286.960 us
Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 6 total (0 active), CPU time: mean = 98.578 us, total = 591.468 us
PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 44.730 us, total = 268.378 us
NodeManagerService.grpc_client.PinObjectIDs - 5 total (0 active), CPU time: mean = 20.112 us, total = 100.562 us
CoreWorkerService.grpc_server.PushTask - 5 total (0 active), CPU time: mean = 60.971 us, total = 304.856 us
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 3 total (0 active), CPU time: mean = 72.790 us, total = 218.371 us
ActorInfoGcsService.grpc_client.GetActorInfo - 2 total (0 active), CPU time: mean = 385.013 us, total = 770.026 us
ActorInfoGcsService.grpc_client.RegisterActor - 2 total (0 active), CPU time: mean = 331.548 us, total = 663.095 us
ActorInfoGcsService.grpc_client.CreateActor - 2 total (0 active), CPU time: mean = 86.492 us, total = 172.983 us
ActorCreator.AsyncRegisterActor - 2 total (0 active), CPU time: mean = 274.019 us, total = 548.038 us
CoreWorkerService.grpc_server.WaitForActorOutOfScope - 2 total (0 active), CPU time: mean = 14.777 us, total = 29.555 us
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 41.465 us, total = 41.465 us
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 11.189 us, total = 11.189 us
.
.
.
Global stats: 872170 total (8 active)
Queueing time: mean = 416.588 us, max = 1.372 s, min = -0.027 s, total = 363.336 s
Execution time: mean = 63.779 us, total = 55.626 s
Event stats:
UNKNOWN - 696904 total (5 active, 1 running), CPU time: mean = 23.622 us, total = 16.462 s
CoreWorker.deadline_timer.flush_profiling_events - 58166 total (1 active), CPU time: mean = 7.395 us, total = 430.154 ms
NodeManagerService.grpc_client.ReportWorkerBacklog - 58165 total (0 active), CPU time: mean = 11.117 us, total = 646.627 ms
CoreWorkerService.grpc_server.GetCoreWorkerStats - 57825 total (0 active), CPU time: mean = 106.332 us, total = 6.149 s
CoreWorkerService.grpc_server.PubsubCommandBatch - 264 total (0 active), CPU time: mean = 56.212 us, total = 14.840 ms
CoreWorkerService.grpc_server.PubsubLongPolling - 194 total (0 active), CPU time: mean = 16.010 us, total = 3.106 ms
CoreWorkerService.grpc_server.UpdateObjectLocationBatch - 143 total (0 active), CPU time: mean = 139.546 us, total = 19.955 ms
InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 106 total (1 active), CPU time: mean = 317.772 us, total = 33.684 ms
CoreWorkerService.grpc_server.LocalGC - 97 total (0 active), CPU time: mean = 328.146 ms, total = 31.830 s
CoreWorkerService.grpc_client.PushTask - 78 total (1 active), CPU time: mean = 222.635 us, total = 17.366 ms
CoreWorkerDirectActorTaskSubmitter::SubmitTask - 78 total (0 active), CPU time: mean = 147.249 us, total = 11.485 ms
StatsGcsService.grpc_client.AddProfileData - 70 total (0 active), CPU time: mean = 21.250 us, total = 1.487 ms
NodeManagerService.grpc_client.PinObjectIDs - 32 total (0 active), CPU time: mean = 32.621 us, total = 1.044 ms
CoreWorkerService.grpc_server.PushTask - 21 total (0 active), CPU time: mean = 85.503 us, total = 1.796 ms
Subscriber.HandlePublishedMessage_GCS_ACTOR_CHANNEL - 6 total (0 active), CPU time: mean = 98.578 us, total = 591.468 us
PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 44.730 us, total = 268.378 us
InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 3 total (0 active), CPU time: mean = 72.790 us, total = 218.371 us
ActorInfoGcsService.grpc_client.GetActorInfo - 2 total (0 active), CPU time: mean = 385.013 us, total = 770.026 us
ActorInfoGcsService.grpc_client.RegisterActor - 2 total (0 active), CPU time: mean = 331.548 us, total = 663.095 us
ActorInfoGcsService.grpc_client.CreateActor - 2 total (0 active), CPU time: mean = 86.492 us, total = 172.983 us
ActorCreator.AsyncRegisterActor - 2 total (0 active), CPU time: mean = 274.019 us, total = 548.038 us
CoreWorkerService.grpc_server.WaitForActorOutOfScope - 2 total (0 active), CPU time: mean = 14.777 us, total = 29.555 us
NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 41.465 us, total = 41.465 us
WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 11.189 us, total = 11.189 us
I am not able to figure out the actual nature of the problem, though the hint is that it has something to do with memory. I have the logs with me, any suggestions what I should be looking at from here? Thank you.