Creation an Actor on Worker nodes takes long time

Hey guys,

creation an actor on worker node after it’s been running a while takes 5-7 minutes. There is no problem with creation of actor on head node.

python-core-worker log:

[2022-02-06 07:02:12,887 I 11033 11033] core_worker.cc:2320: Creating actor: b0f9db98654cd3b2d462f7d630000000
[2022-02-06 07:09:35,118 I 11033 11080] actor_manager.cc:228: received notification on actor, state: ALIVE, actor_id: a6c31f8498b256bc927bad0518000000, ip address: 10.0.1.157, port: 10112, worker_id: 9ad6581ff8f2ea6a612bc17cf34a79605bc592aa642b1f096809188f, raylet_id: 182d8e77a79f9f9bd7b4ca8666021fda57e426e62928917ff90ea5a1, num_restarts: 0, death context type=CONTEXT_NOT_SET

raylet.out log:

[2022-02-06 07:02:08,393 I 567 567] worker_pool.cc:418: Started worker process of 1 worker(s) with pid 11033

gcs_server.out log on head node:

[2022-02-06 07:02:12,226 I 336273 336273] gcs_actor_scheduler.cc:328: Start creating actor b0f9db98654cd3b2d462f7d630000000 on worker 8ce8e8cf274cf9d9a9d5d42176abe7c623a93ac980341014b55dee60 at node 182d8e77a79f9f9bd7b4ca8666021fda57e426e62928917ff90ea5a1, job id = 30000000
[2022-02-06 07:09:39,618 I 336273 336273] gcs_actor_scheduler.cc:367: Succeeded in creating actor b0f9db98654cd3b2d462f7d630000000 on worker 8ce8e8cf274cf9d9a9d5d42176abe7c623a93ac980341014b55dee60 at node 182d8e77a79f9f9bd7b4ca8666021fda57e426e62928917ff90ea5a1, job id = 30000000

Everything works smoothly when I start the cluster but after some time, it gets delayed. The longer cluster runs, the bigger the delay. Would someone be able to help me debug it?

OK, I think I know where is the problem. I enabled debugging and checked logs carefully and this line gets delayed:

gcs_server.out on head node starts leasing a worker:

[2022-02-07 12:18:23,902 I 22 22] gcs_actor_scheduler.cc:213: Start leasing worker from node a11b49d4b78608d66c5a77426d9dface0dcdf39a531c9531b6067362 for actor cca798cde7e4d97c2538a85901000000, job id = 01000000
[2022-02-07 12:18:28,134 I 22 22] gcs_actor_scheduler.cc:536: Finished leasing worker from a11b49d4b78608d66c5a77426d9dface0dcdf39a531c9531b6067362 for actor cca798cde7e4d97c2538a85901000000, job id = 01000000
[2022-02-07 12:18:28,134 I 22 22] gcs_actor_scheduler.cc:328: Start creating actor cca798cde7e4d97c2538a85901000000 on worker 1b7b70073364c0daf878fa8aa6ca2bed44673c49bbbbe2c941c95e99 at node a11b49d4b78608d66c5a77426d9dface0dcdf39a531c9531b6067362, job id = 01000000190[2022-02-07 12:18:29,712 I 22 22] 
[2022-02-07 12:18:29,712 I 22 22] gcs_actor_scheduler.cc:367: Succeeded in creating actor cca798cde7e4d97c2538a85901000000 on worker 1b7b70073364c0daf878fa8aa6ca2bed44673c49bbbbe2c941c95e99 at node a11b49d4b78608d66c5a77426d9dface0dcdf39a531c9531b6067362, job id = 01000000191[2022-02-07 12:18:29,712 I 22 22] 
[2022-02-07 12:18:29,712 I 22 22] gcs_actor_manager.cc:990: Actor created successfully, actor id = cca798cde7e4d97c2538a85901000000, job id = 01000000192[2022-02-07 12:18:29,713 I 22 22] gcs_actor_manager.cc:226: Finished creating actor, job id = 01000000, actor id = cca798cde7e4d97c2538a85901000000

raylet.out on worker node is popping a worker:

[2022-02-07 12:18:24,186 D 20 20] worker_pool.cc:1086: Pop worker for task ffffffffffffffffcca798cde7e4d97c2538a85901000000 task name {type=PythonFunctionDescriptor, module_name=__main__, class_name=Some, function_name=__init__, function_hash=faac983683bf4ff3aa8db07d16532523}`Preformatted text`
[2022-02-07 12:18:28,003 D 20 20] cluster_task_manager.cc:256: Dispatching task ffffffffffffffffcca798cde7e4d97c2538a85901000000 to worker 1b7b70073364c0daf878fa8aa6ca2bed44673c49bbbbe2c941c95e99

and this is where it gets delayed. In between pop and dispatch and also between dispatching to finishing a task although the task is as simple as generate an actor and don’t do anything. Python core worker indeed starts few seconds later:

[2022-02-07 12:18:27,689 I 109 109] core_worker_process.cc:120: Constructing CoreWorkerProcess. pid: 1092[2022-02-07 12:18:27,691 D 109 118] 
[2022-02-07 12:18:27,691 D 109 118] core_worker_process.cc:196: Getting system config from raylet, remaining retries = 10

and also has this delay (taken from another example):

[2022-02-07 17:29:39,887 I 1727 1727] core_worker.cc:2106: Creating actor: bf1a2a3f03b2705a93ee00ea05000000
[2022-02-07 17:29:55,708 D 1727 1727] core_worker.cc:2175: Finished executing task ffffffffffffffffbf1a2a3f03b2705a93ee00ea05000000, status=OK

So it seems like delay is caused by Pop worker function at worker_pool.cc:1086 then it is caused by the time it takes to create an actor / execute init task.

Would anyone be able to tell me what’s going on now? This is all on new cluster with dummy Class which doesn’t do a thing. In my application where I run Cluster for few hours the delay hits few minutes :frowning:

I would appreciate any suggestions. Thanks.

I added third node to the server in the same location as a head and it works smoothly between these two, 3rd in different location still lags. Could distance from the nodes be an issue? Because they are in different continents.

Hi Dawid, from your log, I saw a 4s delay between pop worker and dispatch. Do you have detailed logs for the minutes delay?

I’ve been running cluster and apps for about 3 hours on the cluster and right now the lag hits about 1.5 minute. Tomorrow will be more.

raylet.out on worker node:

69935[2022-02-07 19:15:44,431 D 20 20] cluster_task_manager.cc:478: Queuing and scheduling task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000
69936[2022-02-07 19:15:44,431 D 20 20] cluster_task_manager.cc:77: Scheduling pending task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000
69948[2022-02-07 19:15:44,431 D 20 20] cluster_task_manager.cc:146: No args, task can be dispatched ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000
69949[2022-02-07 19:15:44,431 D 20 20] cluster_task_manager.cc:568: RayTask ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000 has args of size 0
69951[2022-02-07 19:15:44,431 D 20 20] worker_pool.cc:1086: Pop worker for task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000 task name {type=PythonFunctionDescriptor, module_name=__main__, class_name=Some, function_name=__init__, function_hash=853acd2ec7984d378dd3957968770881}
70153[2022-02-07 19:15:48,112 D 20 20] cluster_task_manager.cc:256: Dispatching task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000 to worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee
71038[2022-02-07 19:17:13,393 D 20 20] node_manager.cc:1896: Finished task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000

python-core-worker on worker node:

16[2022-02-07 19:15:48,884 D 2070 2070] core_worker.cc:2051: Executing task, task info = Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {CPU: 1, India: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Some, function_name=__init__, function_hash=853acd2ec7984d378dd3957968770881}, task_id=ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000, task_name=Some.__init__(), job_id=0d000000, num_args=0, num_returns=1, depth=1, actor_creation_task_spec={actor_id=44b6c62a1b2ea9f9d17bbe880d000000, max_restarts=0, max_concurrency=1, is_asyncio_actor=0, is_detached=0}
17[2022-02-07 19:15:48,884 D 2070 2070] reference_count.cc:238: Add local reference ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d00000001000000
18[2022-02-07 19:15:48,884 D 2070 2070] reference_count.cc:239: REF ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d00000001000000 borrowers: 0 local_ref_count: 1 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
19[2022-02-07 19:15:48,884 I 2070 2070] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 44b6c62a1b2ea9f9d17bbe880d000000
20[2022-02-07 19:15:48,884 D 2070 2070] direct_actor_task_submitter.cc:166: Connecting to actor 44b6c62a1b2ea9f9d17bbe880d000000 at worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee
22[2022-02-07 19:15:48,885 D 2070 2070] sequential_actor_submit_queue.cc:86: Resetting caller starts at for actor 44b6c62a1b2ea9f9d17bbe880d000000 from 0 to 0
23[2022-02-07 19:15:48,885 I 2070 2070] direct_actor_task_submitter.cc:214: Connecting to actor 44b6c62a1b2ea9f9d17bbe880d000000 at worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee
24[2022-02-07 19:15:48,885 D 2070 2070] reference_count.cc:104: Adding borrowed object ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d00000001000000
25[2022-02-07 19:15:48,885 I 2070 2070] core_worker.cc:2106: Creating actor: 44b6c62a1b2ea9f9d17bbe880d000000
29[2022-02-07 19:17:13,329 D 2070 2070] core_worker.cc:2175: Finished executing task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000, status=OK
30[2022-02-07 19:17:13,329 I 2070 2070] direct_actor_transport.cc:139: Actor creation task finished, task_id: ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000, actor_id: 44b6c62a1b2ea9f9d17bbe880d000000

gcs_server.out on head node:

865750[2022-02-07 19:15:44,302 I 22 22] gcs_actor_manager.cc:197: Registering actor, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865751[2022-02-07 19:15:44,302 D 22 22] gcs_actor_manager.cc:244: Getting actor info, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865752[2022-02-07 19:15:44,302 D 22 22] gcs_actor_manager.cc:259: Finished getting actor info, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865754[2022-02-07 19:15:44,302 I 22 22] gcs_actor_manager.cc:202: Registered actor, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865757[2022-02-07 19:15:44,303 I 22 22] gcs_actor_manager.cc:221: Creating actor, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865758[2022-02-07 19:15:44,303 I 22 22] gcs_actor_scheduler.cc:213: Start leasing worker from node e4d7f45979199c0c70bca7d2369be1cdf348d3c719d0dd50044673c9 for actor 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
865759[2022-02-07 19:15:44,309 I 22 22] gcs_actor_scheduler.cc:536: Finished leasing worker from e4d7f45979199c0c70bca7d2369be1cdf348d3c719d0dd50044673c9 for actor 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
865760[2022-02-07 19:15:44,309 I 22 22] gcs_actor_scheduler.cc:213: Start leasing worker from node 45fa84c3558b67b00af8a9890b1018eb74256198b625060f5bfaf6ab for actor 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
866401[2022-02-07 19:15:48,244 I 22 22] gcs_actor_scheduler.cc:536: Finished leasing worker from 45fa84c3558b67b00af8a9890b1018eb74256198b625060f5bfaf6ab for actor 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
866403[2022-02-07 19:15:48,244 I 22 22] gcs_actor_scheduler.cc:328: Start creating actor 44b6c62a1b2ea9f9d17bbe880d000000 on worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee at node 45fa84c3558b67b00af8a9890b1018eb74256198b625060f5bfaf6ab, job id = 0d000000
879425[2022-02-07 19:17:13,463 I 22 22] gcs_actor_scheduler.cc:367: Succeeded in creating actor 44b6c62a1b2ea9f9d17bbe880d000000 on worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee at node 45fa84c3558b67b00af8a9890b1018eb74256198b625060f5bfaf6ab, job id = 0d000000
879426[2022-02-07 19:17:13,463 I 22 22] gcs_actor_manager.cc:990: Actor created successfully, actor id = 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
879427[2022-02-07 19:17:13,464 I 22 22] gcs_actor_manager.cc:226: Finished creating actor, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000

and code of the Actor is as simple as:

@ray.remote
class Some:
    def __init__(self):
        self.x = 0
    
    def get_x(self):
        return self.x
    
    def set_x(self, x):
        self.x = x
        return self.x

Invocation

some = Some.options(resources={'Tokyo': 1}).remote()
some2 = Some.options(resources={'India': 1}).remote()

that one on head node gets started immediately, second one which gets started on worker node has this lag.

Seems like the lag is caused not by popping the worker but by dispatching the task init task. Every actor method works smoothly, I can call actors and their methods on Worker node and get replies immediately, just there is a delay with initialization.

It hits 3.5 min now.

621474[2022-02-08 09:30:11,663 D 20 20] cluster_task_manager.cc:478: Queuing and scheduling task ffffffffffffffff5d01acb35c352b1e27805ab90d000000
621475[2022-02-08 09:30:11,663 D 20 20] cluster_task_manager.cc:77: Scheduling pending task ffffffffffffffff5d01acb35c352b1e27805ab90d000000
621487[2022-02-08 09:30:11,663 D 20 20] cluster_task_manager.cc:146: No args, task can be dispatched ffffffffffffffff5d01acb35c352b1e27805ab90d000000
621488[2022-02-08 09:30:11,664 D 20 20] cluster_task_manager.cc:568: RayTask ffffffffffffffff5d01acb35c352b1e27805ab90d000000 has args of size 0
621490[2022-02-08 09:30:11,664 D 20 20] worker_pool.cc:1086: Pop worker for task ffffffffffffffff5d01acb35c352b1e27805ab90d000000 task name {type=PythonFunctionDescriptor, module_name=__main__, class_name=Some, function_name=__init__, function_hash=035caf6558f6471f9989f4c028eccb60}
621574[2022-02-08 09:30:15,391 D 20 20] cluster_task_manager.cc:256: Dispatching task ffffffffffffffff5d01acb35c352b1e27805ab90d000000 to worker abe58f07c2b31aa0768894cd31521a5f5f8893a7cf9338070e084acc
623996[2022-02-08 09:34:41,813 D 20 20] node_manager.cc:1896: Finished task ffffffffffffffff5d01acb35c352b1e27805ab90d000000

Thanks for the detailed report @Dawid!! This is really helpful!! What’s the version of Ray you are using btw?

It seems like the root problem is

5[2022-02-07 19:15:48,885 I 2070 2070] core_worker.cc:2106: Creating actor: 44b6c62a1b2ea9f9d17bbe880d000000
29[2022-02-07 19:17:13,329 D 2070 2070] core_worker.cc:2175: Finished executing task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000, status=OK

It seems like what’s happening between this log is to “initialize actors”. It basically goes to our Cython code to handle it. My suspicion is some of operation to initialize actors have significant delay (maybe the call to GCS server).

Is there a way to reproduce this issue btw? Otherwise, I think we can try adding some debug logs in this code path, and if you can help telling us where the bottleneck is instead.

1 Like

This is basically method that should be delayed for 2 minutes here.

As Sang said, we need to add more logs between

5[2022-02-07 19:15:48,885 I 2070 2070] core_worker.cc:2106: Creating actor: 44b6c62a1b2ea9f9d17bbe880d000000
29[2022-02-07 19:17:13,329 D 2070 2070] core_worker.cc:2175: Finished executing task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000, status=OK

to see where time is spent. Is it possible that we give you some custom wheel with more logs added and you run with it?

Hi @sangcho and @jjyao. Thank you for your time and responses.

The bug is solved in new 2.0.0 release. Earlier I was using 1.10.0. When I compiled 2.0.0 with additional data logs, the delay stopped happening and cluster runs smoothly for over a day now. Both options were running on python 3.7 in docker container. Do you guys want me to run the tests on 1.10.0?

The only one thing that I wasn’t able to figure out is why I’m getting these strange messages. They were occurring on 1.10.0 too. These actors do not call any remote tasks, but they are being called a lot and when called they use only their own methods and are not waiting for any results. These actors are running on Head node and are being called from Head node Actors.

Warning: More than 5000 tasks are pending submission to actor 89c8f50530c69a01126ca2cf10000000. To reduce memory usage, wait for these tasks to finish before sending more.
Warning: More than 40000 tasks are pending submission to actor 89c8f50530c69a01126ca2cf10000000. To reduce memory usage, wait for these tasks to finish before sending more.
Warning: More than 5000 tasks are pending submission to actor 89c8f50530c69a01126ca2cf10000000. To reduce memory usage, wait for these tasks to finish before sending more.
Warning: More than 20000 tasks are pending submission to actor 89c8f50530c69a01126ca2cf10000000. To reduce memory usage, wait for these tasks to finish before sending more.

This is an event log from core worker of this Actor.

16351Global stats: 1172898 total (7 active)
16352Queueing time: mean = 166.327 us, max = 834.979 ms, min = -0.000 s, total = 195.085 s
16353Execution time: mean = 90.791 us, total = 106.489 s
16354Event stats:
16355 CoreWorkerService.grpc_server.PushTask - 881397 total (0 active), CPU time: mean = 34.850 us, total = 30.716 s
16356 UNKNOWN - 91338 total (4 active, 1 running), CPU time: mean = 32.571 us, total = 2.975 s
16357 GcsClient.deadline_timer.check_gcs_service_address - 44555 total (1 active), CPU time: mean = 4.877 us, total = 217.292 ms
16358 NodeManagerService.grpc_client.ReportWorkerBacklog - 44554 total (0 active), CPU time: mean = 5.782 us, total = 257.631 ms
16359 CoreWorker.deadline_timer.flush_profiling_events - 44554 total (1 active), CPU time: mean = 323.688 us, total = 14.422 s
16360 StatsGcsService.grpc_client.AddProfileData - 44458 total (0 active), CPU time: mean = 54.349 us, total = 2.416 s
16361 CoreWorkerService.grpc_server.GetCoreWorkerStats - 20805 total (0 active), CPU time: mean = 58.701 us, total = 1.221 s
16362 InternalPubSubGcsService.grpc_client.GcsSubscriberPoll - 1154 total (1 active), CPU time: mean = 73.683 us, total = 85.030 ms
16363 CoreWorkerService.grpc_server.LocalGC - 74 total (0 active), CPU time: mean = 732.136 ms, total = 54.178 s
16364 PeriodicalRunner.RunFnPeriodically - 6 total (0 active), CPU time: mean = 28.214 us, total = 169.285 us
16365 InternalPubSubGcsService.grpc_client.GcsSubscriberCommandBatch - 1 total (0 active), CPU time: mean = 108.990 us, total = 108.990 us
16366 NodeInfoGcsService.grpc_client.GetAllNodeInfo - 1 total (0 active), CPU time: mean = 34.077 us, total = 34.077 us
16367 WorkerInfoGcsService.grpc_client.AddWorkerInfo - 1 total (0 active), CPU time: mean = 5.242 us, total = 5.242 us

is there a way to check what are these tasks and why they wait for the submission?

Hmm this is pretty interesting actually. I heard from other users about this issue as well.

I think if you turn on debug logs, you can see this log;

  RAY_LOG(DEBUG) << "Pushing task " << task_id << " to actor " << actor_id
                 << " actor counter " << actor_counter << " seq no "
                 << request->sequence_number() << " num queued " << num_queued;

Are you seeing a lot of this?

Also, is it possible to share your script?