I’ve been running cluster and apps for about 3 hours on the cluster and right now the lag hits about 1.5 minute. Tomorrow will be more.
raylet.out on worker node:
69935[2022-02-07 19:15:44,431 D 20 20] cluster_task_manager.cc:478: Queuing and scheduling task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000
69936[2022-02-07 19:15:44,431 D 20 20] cluster_task_manager.cc:77: Scheduling pending task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000
69948[2022-02-07 19:15:44,431 D 20 20] cluster_task_manager.cc:146: No args, task can be dispatched ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000
69949[2022-02-07 19:15:44,431 D 20 20] cluster_task_manager.cc:568: RayTask ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000 has args of size 0
69951[2022-02-07 19:15:44,431 D 20 20] worker_pool.cc:1086: Pop worker for task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000 task name {type=PythonFunctionDescriptor, module_name=__main__, class_name=Some, function_name=__init__, function_hash=853acd2ec7984d378dd3957968770881}
70153[2022-02-07 19:15:48,112 D 20 20] cluster_task_manager.cc:256: Dispatching task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000 to worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee
71038[2022-02-07 19:17:13,393 D 20 20] node_manager.cc:1896: Finished task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000
python-core-worker on worker node:
16[2022-02-07 19:15:48,884 D 2070 2070] core_worker.cc:2051: Executing task, task info = Type=ACTOR_CREATION_TASK, Language=PYTHON, Resources: {CPU: 1, India: 1, }, function_descriptor={type=PythonFunctionDescriptor, module_name=__main__, class_name=Some, function_name=__init__, function_hash=853acd2ec7984d378dd3957968770881}, task_id=ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000, task_name=Some.__init__(), job_id=0d000000, num_args=0, num_returns=1, depth=1, actor_creation_task_spec={actor_id=44b6c62a1b2ea9f9d17bbe880d000000, max_restarts=0, max_concurrency=1, is_asyncio_actor=0, is_detached=0}
17[2022-02-07 19:15:48,884 D 2070 2070] reference_count.cc:238: Add local reference ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d00000001000000
18[2022-02-07 19:15:48,884 D 2070 2070] reference_count.cc:239: REF ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d00000001000000 borrowers: 0 local_ref_count: 1 submitted_count: 0 contained_in_owned: 0 contained_in_borrowed: 0 contains: 0 stored_in: 0 lineage_ref_count: 0
19[2022-02-07 19:15:48,884 I 2070 2070] direct_actor_task_submitter.cc:33: Set max pending calls to -1 for actor 44b6c62a1b2ea9f9d17bbe880d000000
20[2022-02-07 19:15:48,884 D 2070 2070] direct_actor_task_submitter.cc:166: Connecting to actor 44b6c62a1b2ea9f9d17bbe880d000000 at worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee
22[2022-02-07 19:15:48,885 D 2070 2070] sequential_actor_submit_queue.cc:86: Resetting caller starts at for actor 44b6c62a1b2ea9f9d17bbe880d000000 from 0 to 0
23[2022-02-07 19:15:48,885 I 2070 2070] direct_actor_task_submitter.cc:214: Connecting to actor 44b6c62a1b2ea9f9d17bbe880d000000 at worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee
24[2022-02-07 19:15:48,885 D 2070 2070] reference_count.cc:104: Adding borrowed object ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d00000001000000
25[2022-02-07 19:15:48,885 I 2070 2070] core_worker.cc:2106: Creating actor: 44b6c62a1b2ea9f9d17bbe880d000000
29[2022-02-07 19:17:13,329 D 2070 2070] core_worker.cc:2175: Finished executing task ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000, status=OK
30[2022-02-07 19:17:13,329 I 2070 2070] direct_actor_transport.cc:139: Actor creation task finished, task_id: ffffffffffffffff44b6c62a1b2ea9f9d17bbe880d000000, actor_id: 44b6c62a1b2ea9f9d17bbe880d000000
gcs_server.out on head node:
865750[2022-02-07 19:15:44,302 I 22 22] gcs_actor_manager.cc:197: Registering actor, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865751[2022-02-07 19:15:44,302 D 22 22] gcs_actor_manager.cc:244: Getting actor info, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865752[2022-02-07 19:15:44,302 D 22 22] gcs_actor_manager.cc:259: Finished getting actor info, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865754[2022-02-07 19:15:44,302 I 22 22] gcs_actor_manager.cc:202: Registered actor, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865757[2022-02-07 19:15:44,303 I 22 22] gcs_actor_manager.cc:221: Creating actor, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
865758[2022-02-07 19:15:44,303 I 22 22] gcs_actor_scheduler.cc:213: Start leasing worker from node e4d7f45979199c0c70bca7d2369be1cdf348d3c719d0dd50044673c9 for actor 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
865759[2022-02-07 19:15:44,309 I 22 22] gcs_actor_scheduler.cc:536: Finished leasing worker from e4d7f45979199c0c70bca7d2369be1cdf348d3c719d0dd50044673c9 for actor 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
865760[2022-02-07 19:15:44,309 I 22 22] gcs_actor_scheduler.cc:213: Start leasing worker from node 45fa84c3558b67b00af8a9890b1018eb74256198b625060f5bfaf6ab for actor 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
866401[2022-02-07 19:15:48,244 I 22 22] gcs_actor_scheduler.cc:536: Finished leasing worker from 45fa84c3558b67b00af8a9890b1018eb74256198b625060f5bfaf6ab for actor 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
866403[2022-02-07 19:15:48,244 I 22 22] gcs_actor_scheduler.cc:328: Start creating actor 44b6c62a1b2ea9f9d17bbe880d000000 on worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee at node 45fa84c3558b67b00af8a9890b1018eb74256198b625060f5bfaf6ab, job id = 0d000000
879425[2022-02-07 19:17:13,463 I 22 22] gcs_actor_scheduler.cc:367: Succeeded in creating actor 44b6c62a1b2ea9f9d17bbe880d000000 on worker 351989f32558c72a09428a88f3a128391d68c36977f1428bb3ece8ee at node 45fa84c3558b67b00af8a9890b1018eb74256198b625060f5bfaf6ab, job id = 0d000000
879426[2022-02-07 19:17:13,463 I 22 22] gcs_actor_manager.cc:990: Actor created successfully, actor id = 44b6c62a1b2ea9f9d17bbe880d000000, job id = 0d000000
879427[2022-02-07 19:17:13,464 I 22 22] gcs_actor_manager.cc:226: Finished creating actor, job id = 0d000000, actor id = 44b6c62a1b2ea9f9d17bbe880d000000
and code of the Actor is as simple as:
@ray.remote
class Some:
def __init__(self):
self.x = 0
def get_x(self):
return self.x
def set_x(self, x):
self.x = x
return self.x
Invocation
some = Some.options(resources={'Tokyo': 1}).remote()
some2 = Some.options(resources={'India': 1}).remote()
that one on head node gets started immediately, second one which gets started on worker node has this lag.