[tune][ray 2.1] The actor died unexpectedly before finishing this task

We are running a ray tune job with ray 2.1.

See this error

ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
	class_name: RayTrainWorker
	actor_id: 689396639442604bd3ecda2203000000
	namespace: 85f118ec-0996-4649-a5af-650b677be9b2
	ip: 100.97.243.215
The actor is dead because its node has died. Node Id: 9e902b5752f374fda97028c09b7654d8cce33d1f56825df84c657d23
The actor never ran - it was cancelled before it started running.

GCS log

[2023-02-06 19:02:02,770 I 40 40] (gcs_server) gcs_server.cc:186: GcsNodeManager:
- RegisterNode request count: 7
- DrainNode request count: 0
- GetAllNodeInfo request count: 120346
- GetInternalConfig request count: 13

GcsActorManager:
- RegisterActor request count: 23
- CreateActor request count: 23
- GetActorInfo request count: 23
- GetNamedActorInfo request count: 1
- GetAllActorInfo request count: 1
- KillActor request count: 10
- ListNamedActors request count: 0
- Registered actors count: 5
- Destroyed actors count: 18
- Named actors count: 1
- Unresolved actors count: 0
- Pending actors count: 0
- Created actors count: 3
- owners_: 2
- actor_to_register_callbacks_: 0
- actor_to_create_callbacks_: 0
- sorted_destroyed_actor_list_: 18

GcsResourceManager:
- GetResources request count: 42
- GetAllAvailableResources request count1
- ReportResourceUsage request count: 0
- GetAllResourceUsage request count: 48067

GcsPlacementGroupManager:
- CreatePlacementGroup request count: 7
- RemovePlacementGroup request count: 6
- GetPlacementGroup request count: 28
- GetAllPlacementGroup request count: 1
- WaitPlacementGroupUntilReady request count: 0
- GetNamedPlacementGroup request count: 0
- Scheduling pending placement group count: 55986
- Registered placement groups count: 1
- Named placement group count: 1
- Pending placement groups count: 0
- Infeasible placement groups count: 0

GcsPublisher {}

[runtime env manager] ID to URIs table:
[runtime env manager] URIs reference table:

GrpcBasedResourceBroadcaster:
- Tracked nodes: 6
[2023-02-06 19:02:02,770 I 40 40] (gcs_server) gcs_server.cc:736: Event stats:


Global stats: 51990358 total (8 active)
Queueing time: mean = 14.341 us, max = 224.680 ms, min = -0.001 s, total = 745.585 s
Execution time:  mean = 19.440 us, total = 1010.715 s
Event stats:
	NodeManagerService.grpc_client.UpdateResourceUsage - 14431916 total (0 active), CPU time: mean = 5.685 us, total = 82.043 s
	NodeManagerService.grpc_client.RequestResourceReport - 14194457 total (0 active), CPU time: mean = 27.345 us, total = 388.152 s
	ResourceUpdate - 14194172 total (0 active), CPU time: mean = 16.054 us, total = 227.874 s
	RaySyncer.deadline_timer.report_resource_report - 2406898 total (1 active), CPU time: mean = 16.135 us, total = 38.835 s
	NodeManagerService.grpc_client.GetResourceLoad - 1443994 total (0 active), CPU time: mean = 5.802 us, total = 8.378 s
	GcsInMemoryStore.Get - 1240217 total (0 active), CPU time: mean = 57.093 us, total = 70.807 s
	InternalKVGcsService.grpc_server.InternalKVGet - 1240203 total (0 active), CPU time: mean = 23.666 us, total = 29.351 s
	GcsInMemoryStore.Put - 1057590 total (0 active), CPU time: mean = 37.481 us, total = 39.639 s
	InternalKVGcsService.grpc_server.InternalKVPut - 576996 total (0 active), CPU time: mean = 25.472 us, total = 14.697 s
	StatsGcsService.grpc_server.AddProfileData - 480384 total (0 active), CPU time: mean = 34.541 us, total = 16.593 s
	RayletLoadPulled - 240719 total (1 active), CPU time: mean = 238.224 us, total = 57.345 s
	NodeResourceInfoGcsService.grpc_server.GetGcsSchedulingStats - 230908 total (0 active), CPU time: mean = 44.740 us, total = 10.331 s
	NodeInfoGcsService.grpc_server.GetAllNodeInfo - 120346 total (0 active), CPU time: mean = 73.181 us, total = 8.807 s
	GcsPlacementGroupManager.SchedulePendingPlacementGroups - 54420 total (0 active), CPU time: mean = 2.071 us, total = 112.716 ms
	NodeResourceInfoGcsService.grpc_server.GetAllResourceUsage - 48067 total (0 active), CPU time: mean = 137.989 us, total = 6.633 s
	GCSServer.deadline_timer.debug_state_dump - 24078 total (1 active), CPU time: mean = 401.596 us, total = 9.670 s
	GCSServer.deadline_timer.debug_state_event_stats_print - 4013 total (1 active, 1 running), CPU time: mean = 342.855 us, total = 1.376 s
	NodeManagerService.grpc_client.RequestWorkerLease - 164 total (0 active), CPU time: mean = 43.989 us, total = 7.214 ms
	GcsInMemoryStore.Keys - 139 total (0 active), CPU time: mean = 20.924 us, total = 2.908 ms
	InternalKVGcsService.grpc_server.InternalKVKeys - 135 total (0 active), CPU time: mean = 13.228 us, total = 1.786 ms
	GcsInMemoryStore.GetAll - 83 total (0 active), CPU time: mean = 164.924 us, total = 13.689 ms
	JobInfoGcsService.grpc_server.GetAllJobInfo - 75 total (0 active), CPU time: mean = 46.367 us, total = 3.478 ms
	NodeResourceInfoGcsService.grpc_server.GetResources - 42 total (0 active), CPU time: mean = 31.875 us, total = 1.339 ms
	WorkerInfoGcsService.grpc_server.AddWorkerInfo - 36 total (0 active), CPU time: mean = 25.788 us, total = 928.368 us
	PlacementGroupInfoGcsService.grpc_server.GetPlacementGroup - 28 total (0 active), CPU time: mean = 56.384 us, total = 1.579 ms
	WorkerInfoGcsService.grpc_server.ReportWorkerFailure - 26 total (0 active), CPU time: mean = 89.464 us, total = 2.326 ms
	ActorInfoGcsService.grpc_server.GetActorInfo - 23 total (0 active), CPU time: mean = 29.231 us, total = 672.320 us
	ActorInfoGcsService.grpc_server.CreateActor - 23 total (0 active), CPU time: mean = 195.973 us, total = 4.507 ms
	ActorInfoGcsService.grpc_server.RegisterActor - 23 total (0 active), CPU time: mean = 330.017 us, total = 7.590 ms
	CoreWorkerService.grpc_client.WaitForActorOutOfScope - 22 total (4 active), CPU time: mean = 218.624 us, total = 4.810 ms
	CoreWorkerService.grpc_client.PushTask - 22 total (0 active), CPU time: mean = 297.975 us, total = 6.555 ms
	GcsInMemoryStore.BatchDelete - 19 total (0 active), CPU time: mean = 2.571 us, total = 48.858 us
	NodeManagerService.grpc_client.PrepareBundleResources - 14 total (0 active), CPU time: mean = 18.408 us, total = 257.715 us
	NodeManagerService.grpc_client.CommitBundleResources - 14 total (0 active), CPU time: mean = 46.366 us, total = 649.121 us
	NodeInfoGcsService.grpc_server.GetInternalConfig - 13 total (0 active), CPU time: mean = 17.101 us, total = 222.309 us
	CoreWorkerService.grpc_client.KillActor - 12 total (0 active), CPU time: mean = 71.586 us, total = 859.035 us
	NodeManagerService.grpc_client.CancelResourceReserve - 11 total (0 active), CPU time: mean = 12.699 us, total = 139.690 us
	ActorInfoGcsService.grpc_server.KillActorViaGcs - 10 total (0 active), CPU time: mean = 474.954 us, total = 4.750 ms
	PlacementGroupInfoGcsService.grpc_server.CreatePlacementGroup - 7 total (0 active), CPU time: mean = 26.808 us, total = 187.654 us
	NodeInfoGcsService.grpc_server.RegisterNode - 7 total (0 active), CPU time: mean = 75.305 us, total = 527.136 us
	PlacementGroupInfoGcsService.grpc_server.RemovePlacementGroup - 6 total (0 active), CPU time: mean = 153.198 us, total = 919.188 us
	PeriodicalRunner.RunFnPeriodically - 4 total (0 active), CPU time: mean = 55.205 us, total = 220.820 us
	JobInfoGcsService.grpc_server.GetNextJobID - 3 total (0 active), CPU time: mean = 33.248 us, total = 99.743 us
	GcsInMemoryStore.Exists - 3 total (0 active), CPU time: mean = 38.510 us, total = 115.530 us
	JobInfoGcsService.grpc_server.AddJob - 3 total (0 active), CPU time: mean = 62.363 us, total = 187.089 us
	InternalKVGcsService.grpc_server.InternalKVExists - 3 total (0 active), CPU time: mean = 28.281 us, total = 84.843 us
	GcsInMemoryStore.Delete - 2 total (0 active), CPU time: mean = 472.273 us, total = 944.546 us
	InternalKVGcsService.grpc_server.InternalKVDel - 1 total (0 active), CPU time: mean = 18.735 us, total = 18.735 us
	ActorInfoGcsService.grpc_server.GetNamedActorInfo - 1 total (0 active), CPU time: mean = 43.538 us, total = 43.538 us
	JobInfoGcsService.grpc_server.MarkJobFinished - 1 total (0 active), CPU time: mean = 7.775 us, total = 7.775 us
	ActorInfoGcsService.grpc_server.GetAllActorInfo - 1 total (0 active), CPU time: mean = 61.957 us, total = 61.957 us
	NodeResourceInfoGcsService.grpc_server.GetAllAvailableResources - 1 total (0 active), CPU time: mean = 66.423 us, total = 66.423 us
	GcsServer.NodeDeathCallback - 1 total (0 active), CPU time: mean = 465.493 us, total = 465.493 us
	PlacementGroupInfoGcsService.grpc_server.GetAllPlacementGroup - 1 total (0 active), CPU time: mean = 10.850 us, total = 10.850 us
	JobInfoGcsService.grpc_server.ReportJobError - 1 total (0 active), CPU time: mean = 129.891 us, total = 129.891 us

ray tune log shows the job is running, so it is retried.

(TunerInternal pid=1296) Current time: 2023-02-06 19:08:50 (running for 2 days, 18:57:19.65)
(TunerInternal pid=1296) Memory usage on this node: 95.3/251.4 GiB
(TunerInternal pid=1296) Using FIFO scheduling algorithm.
(TunerInternal pid=1296) Resources requested: 2.0/13 CPUs, 4.0/10 GPUs, 0.0/328.0 GiB heap, 0.0/98.03 GiB objects (0.0/5.0 accelerator_type:V100)
(TunerInternal pid=1296) Current best trial: 799c8_00001 with val_accuracy=0.925000011920929 and parameters={'train_loop_config': {'learning_rate': 0.001}}
(TunerInternal pid=1296) Result logdir: /home/jobuser/ray_results/TensorflowTrainer_2023-02-04_00-11-29
(TunerInternal pid=1296) Number of trials: 6/6 (1 RUNNING, 5 TERMINATED)
(TunerInternal pid=1296) +-------------------------------+------------+---------------------+------------------------+--------+------------------+----------+------------+----------+
(TunerInternal pid=1296) | Trial name                    | status     | loc                 |   train_loop_config/le |   iter |   total time (s) |     loss |   accuracy |      auc |
(TunerInternal pid=1296) |                               |            |                     |            arning_rate |        |                  |          |            |          |
(TunerInternal pid=1296) |-------------------------------+------------+---------------------+------------------------+--------+------------------+----------+------------+----------|
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00000 | RUNNING    | 100.97.236.141:302  |                  0.1   |        |                  |          |            |          |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00001 | TERMINATED | 100.97.236.141:408  |                  0.001 |     10 |          397.492 | 0.618917 |   0.915625 | 0.435659 |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00003 | TERMINATED | 100.98.5.123:1018   |                  0.001 |     10 |          374.472 | 0.641801 |   0.909375 | 0.532172 |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00004 | TERMINATED | 100.96.199.114:1432 |                  0.1   |     10 |          323.039 | 0.313237 |   0.91875  | 0.436271 |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00005 | TERMINATED | 100.96.199.114:1673 |                  0.001 |     10 |          354.687 | 0.580083 |   0.934375 | 0.529861 |
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00002 | TERMINATED | 100.98.5.123:1241   |                  0.1   |     10 |          310.033 | 0.279023 |   0.921875 | 0.575028 |
(TunerInternal pid=1296) +-------------------------------+------------+---------------------+------------------------+--------+------------------+----------+------------+----------+
(TunerInternal pid=1296) Number of errored trials: 1
(TunerInternal pid=1296) +-------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
(TunerInternal pid=1296) | Trial name                    |   # failures | error file                                                                                                                                         |
(TunerInternal pid=1296) |-------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------|
(TunerInternal pid=1296) | TensorflowTrainer_799c8_00002 |            1 | /home/jobuser/ray_results/TensorflowTrainer_2023-02-04_00-11-29/TensorflowTrainer_799c8_00002_2_learning_rate=0.1000_2023-02-04_00-14-08/error.txt |
(TunerInternal pid=1296) +-------------------------------+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
(TunerInternal pid=1296)

Any possible cause of this error The actor died unexpectedly before finishing this task.?

Are you using spot instance?

no. It runs in our own k8s cluster.

Is it always happening at the start of one (and only one) of the trial? And other trials finish fine?
When this happens can you also check the status of all the nodes in your cluster? Is one of the node actually gone?

Do you always see “The actor never ran - it was cancelled before it started running”?

it happens occasionally. and after that, tuner will retry.

Is it always happening at the start of one (and only one) of the trial? And other trials finish fine? → No

When this happens can you also check the status of all the nodes in your cluster? Is one of the node actually gone? → I will check that when it happens next time.

Do you always see “The actor never ran - it was cancelled before it started running”? → it happens occasionally.

Ok I see. I am suspecting occasionally the node that one of the training worker is on is gone, resulting in Tuner retrying the corresponding trial.

1 Like