We are running a ray tune job with ray 2.1.
See this error
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
class_name: RayTrainWorker
actor_id: 689396639442604bd3ecda2203000000
namespace: 85f118ec-0996-4649-a5af-650b677be9b2
ip: 100.97.243.215
The actor is dead because its node has died. Node Id: 9e902b5752f374fda97028c09b7654d8cce33d1f56825df84c657d23
The actor never ran - it was cancelled before it started running.
GCS log
[2023-02-06 19:02:02,770 I 40 40] (gcs_server) gcs_server.cc:186: GcsNodeManager:
- RegisterNode request count: 7
- DrainNode request count: 0
- GetAllNodeInfo request count: 120346
- GetInternalConfig request count: 13
GcsActorManager:
- RegisterActor request count: 23
- CreateActor request count: 23
- GetActorInfo request count: 23
- GetNamedActorInfo request count: 1
- GetAllActorInfo request count: 1
- KillActor request count: 10
- ListNamedActors request count: 0
- Registered actors count: 5
- Destroyed actors count: 18
- Named actors count: 1
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 3
- owners_: 2
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 18
GcsResourceManager:
- GetResources request count: 42
- GetAllAvailableResources request count1
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 48067
GcsPlacementGroupManager:
- CreatePlacementGroup request count: 7
- RemovePlacementGroup request count: 6
- GetPlacementGroup request count: 28
- GetAllPlacementGroup request count: 1
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 55986
- Registered placement groups count: 1
- Named placement group count: 1
- Pending placement groups count: 0
- Infeasible placement groups count: 0
GcsPublisher {}
[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:
GrpcBasedResourceBroadcaster:
- Tracked nodes: 6
[2023-02-06 19:02:02,770 I 40 40] (gcs_server) gcs_server.cc:736: Event stats:
Global stats: 51990358 total (8 active)
Queueing time: mean = 14.341 us, max = 224.680 ms, min = -0.001 s, total = 745.585 s
Execution time: mean = 19.440 us, total = 1010.715 s
Event stats:
NodeManagerService.grpc_client.UpdateResourceUsage - 14431916 total (0 active), CPU time: mean = 5.685 us, total = 82.043 s
NodeManagerService.grpc_client.RequestResourceReport - 14194457 total (0 active), CPU time: mean = 27.345 us, total = 388.152 s
ResourceUpdate - 14194172 total (0 active), CPU time: mean = 16.054 us, total = 227.874 s
RaySyncer.deadline_timer.report_resource_report - 2406898 total (1 active), CPU time: mean = 16.135 us, total = 38.835 s
NodeManagerService.grpc_client.GetResourceLoad - 1443994 total (0 active), CPU time: mean = 5.802 us, total = 8.378 s
GcsInMemoryStore.Get - 1240217 total (0 active), CPU time: mean = 57.093 us, total = 70.807 s
InternalKVGcsService.grpc_server.InternalKVGet - 1240203 total (0 active), CPU time: mean = 23.666 us, total = 29.351 s
GcsInMemoryStore.Put - 1057590 total (0 active), CPU time: mean = 37.481 us, total = 39.639 s
InternalKVGcsService.grpc_server.InternalKVPut - 576996 total (0 active), CPU time: mean = 25.472 us, total = 14.697 s
StatsGcsService.grpc_server.AddProfileData - 480384 total (0 active), CPU time: mean = 34.541 us, total = 16.593 s
RayletLoadPulled - 240719 total (1 active), CPU time: mean = 238.224 us, total = 57.345 s
NodeResourceInfoGcsService.grpc_server.GetGcsSchedulingStats - 230908 total (0 active), CPU time: mean = 44.740 us, total = 10.331 s
NodeInfoGcsService.grpc_server.GetAllNodeInfo - 120346 total (0 active), CPU time: mean = 73.181 us, total = 8.807 s
GcsPlacementGroupManager.SchedulePendingPlacementGroups - 54420 total (0 active), CPU time: mean = 2.071 us, total = 112.716 ms
NodeResourceInfoGcsService.grpc_server.GetAllResourceUsage - 48067 total (0 active), CPU time: mean = 137.989 us, total = 6.633 s
GCSServer.deadline_timer.debug_state_dump - 24078 total (1 active), CPU time: mean = 401.596 us, total = 9.670 s
GCSServer.deadline_timer.debug_state_event_stats_print - 4013 total (1 active, 1 running), CPU time: mean = 342.855 us, total = 1.376 s
NodeManagerService.grpc_client.RequestWorkerLease - 164 total (0 active), CPU time: mean = 43.989 us, total = 7.214 ms
GcsInMemoryStore.Keys - 139 total (0 active), CPU time: mean = 20.924 us, total = 2.908 ms
InternalKVGcsService.grpc_server.InternalKVKeys - 135 total (0 active), CPU time: mean = 13.228 us, total = 1.786 ms
GcsInMemoryStore.GetAll - 83 total (0 active), CPU time: mean = 164.924 us, total = 13.689 ms
JobInfoGcsService.grpc_server.GetAllJobInfo - 75 total (0 active), CPU time: mean = 46.367 us, total = 3.478 ms
NodeResourceInfoGcsService.grpc_server.GetResources - 42 total (0 active), CPU time: mean = 31.875 us, total = 1.339 ms
WorkerInfoGcsService.grpc_server.AddWorkerInfo - 36 total (0 active), CPU time: mean = 25.788 us, total = 928.368 us
PlacementGroupInfoGcsService.grpc_server.GetPlacementGroup - 28 total (0 active), CPU time: mean = 56.384 us, total = 1.579 ms
WorkerInfoGcsService.grpc_server.ReportWorkerFailure - 26 total (0 active), CPU time: mean = 89.464 us, total = 2.326 ms
ActorInfoGcsService.grpc_server.GetActorInfo - 23 total (0 active), CPU time: mean = 29.231 us, total = 672.320 us
ActorInfoGcsService.grpc_server.CreateActor - 23 total (0 active), CPU time: mean = 195.973 us, total = 4.507 ms
ActorInfoGcsService.grpc_server.RegisterActor - 23 total (0 active), CPU time: mean = 330.017 us, total = 7.590 ms
CoreWorkerService.grpc_client.WaitForActorOutOfScope - 22 total (4 active), CPU time: mean = 218.624 us, total = 4.810 ms
CoreWorkerService.grpc_client.PushTask - 22 total (0 active), CPU time: mean = 297.975 us, total = 6.555 ms
GcsInMemoryStore.BatchDelete - 19 total (0 active), CPU time: mean = 2.571 us, total = 48.858 us
NodeManagerService.grpc_client.PrepareBundleResources - 14 total (0 active), CPU time: mean = 18.408 us, total = 257.715 us
NodeManagerService.grpc_client.CommitBundleResources - 14 total (0 active), CPU time: mean = 46.366 us, total = 649.121 us
NodeInfoGcsService.grpc_server.GetInternalConfig - 13 total (0 active), CPU time: mean = 17.101 us, total = 222.309 us
CoreWorkerService.grpc_client.KillActor - 12 total (0 active), CPU time: mean = 71.586 us, total = 859.035 us
NodeManagerService.grpc_client.CancelResourceReserve - 11 total (0 active), CPU time: mean = 12.699 us, total = 139.690 us
ActorInfoGcsService.grpc_server.KillActorViaGcs - 10 total (0 active), CPU time: mean = 474.954 us, total = 4.750 ms
PlacementGroupInfoGcsService.grpc_server.CreatePlacementGroup - 7 total (0 active), CPU time: mean = 26.808 us, total = 187.654 us
NodeInfoGcsService.grpc_server.RegisterNode - 7 total (0 active), CPU time: mean = 75.305 us, total = 527.136 us
PlacementGroupInfoGcsService.grpc_server.RemovePlacementGroup - 6 total (0 active), CPU time: mean = 153.198 us, total = 919.188 us
PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 55.205 us, total = 220.820 us
JobInfoGcsService.grpc_server.GetNextJobID - 3 total (0 active), CPU time: mean = 33.248 us, total = 99.743 us
GcsInMemoryStore.Exists - 3 total (0 active), CPU time: mean = 38.510 us, total = 115.530 us
JobInfoGcsService.grpc_server.AddJob - 3 total (0 active), CPU time: mean = 62.363 us, total = 187.089 us
InternalKVGcsService.grpc_server.InternalKVExists - 3 total (0 active), CPU time: mean = 28.281 us, total = 84.843 us
GcsInMemoryStore.Delete - 2 total (0 active), CPU time: mean = 472.273 us, total = 944.546 us
InternalKVGcsService.grpc_server.InternalKVDel - 1 total (0 active), CPU time: mean = 18.735 us, total = 18.735 us
ActorInfoGcsService.grpc_server.GetNamedActorInfo - 1 total (0 active), CPU time: mean = 43.538 us, total = 43.538 us
JobInfoGcsService.grpc_server.MarkJobFinished - 1 total (0 active), CPU time: mean = 7.775 us, total = 7.775 us
ActorInfoGcsService.grpc_server.GetAllActorInfo - 1 total (0 active), CPU time: mean = 61.957 us, total = 61.957 us
NodeResourceInfoGcsService.grpc_server.GetAllAvailableResources - 1 total (0 active), CPU time: mean = 66.423 us, total = 66.423 us
GcsServer.NodeDeathCallback - 1 total (0 active), CPU time: mean = 465.493 us, total = 465.493 us
PlacementGroupInfoGcsService.grpc_server.GetAllPlacementGroup - 1 total (0 active), CPU time: mean = 10.850 us, total = 10.850 us
JobInfoGcsService.grpc_server.ReportJobError - 1 total (0 active), CPU time: mean = 129.891 us, total = 129.891 us
ray tune log shows the job is running, so it is retried.
(TunerInternal pid=1296) Current time: 2023-02-06 19:08:50 (running for 2 days, 18:57:19.65)
(TunerInternal pid=1296) Memory usage on this node: 95.3/251.4 GiB
(TunerInternal pid=1296) Using FIFO scheduling algorithm.
(TunerInternal pid=1296) Resources requested: 2.0/13 CPUs, 4.0/10 GPUs, 0.0/328.0 GiB heap, 0.0/98.03 GiB objects (0.0/5.0 accelerator_type:V100)
(TunerInternal pid=1296) Current best trial: 799c8_00001 with val_accuracy=0.925000011920929 and parameters={'train_loop_config': {'learning_rate': 0.001}}
(TunerInternal pid=1296) Result logdir: /home/jobuser/ray_results/TensorflowTrainer_2023-02-04_00-11-29
(TunerInternal pid=1296) Number of trials: 6/6 (1 RUNNING, 5 TERMINATED)
(TunerInternal pid=1296) +-------------------------------+------------+---------------------+------------------------+--------+------------------+----------+------------+----------+
(TunerInternal pid=1296) | Trial name | status | loc | train_loop_config/le | iter | total time (s) | loss | accuracy | auc |
(TunerInternal pid=1296) | | | | arning_rate | | | | | |
(TunerInternal pid=1296) |-------------------------------+------------+---------------------+------------------------+--------+------------------+----------+------------+----------|
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00000 | RUNNING | 100.97.236.141:302 | 0.1 | | | | | |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00001 | TERMINATED | 100.97.236.141:408 | 0.001 | 10 | 397.492 | 0.618917 | 0.915625 | 0.435659 |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00003 | TERMINATED | 100.98.5.123:1018 | 0.001 | 10 | 374.472 | 0.641801 | 0.909375 | 0.532172 |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00004 | TERMINATED | 100.96.199.114:1432 | 0.1 | 10 | 323.039 | 0.313237 | 0.91875 | 0.436271 |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00005 | TERMINATED | 100.96.199.114:1673 | 0.001 | 10 | 354.687 | 0.580083 | 0.934375 | 0.529861 |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00002 | TERMINATED | 100.98.5.123:1241 | 0.1 | 10 | 310.033 | 0.279023 | 0.921875 | 0.575028 |
(TunerInternal pid=1296) +-------------------------------+------------+---------------------+------------------------+--------+------------------+----------+------------+----------+
(TunerInternal pid=1296) Number of errored trials: 1
(TunerInternal pid=1296) +-------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
(TunerInternal pid=1296) | Trial name | # failures | error file |
(TunerInternal pid=1296) |-------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00002 | 1 | /home/jobuser/ray_results/TensorflowTrainer_2023-02-04_00-11-29/TensorflowTrainer_799c8_00002_2_learning_rate=0.1000_2023-02-04_00-14-08/error.txt |
(TunerInternal pid=1296) +-------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
(TunerInternal pid=1296)
Any possible cause of this error The actor died unexpectedly before finishing this task.
?