Those log entries indicate that the current pool size is 0, and it’s in the process of starting one new worker process, so I don’t think that is inconsistent state between those two log statements.
That makes sense.
I believe that prestarting num_cpus
workers at node startup time only happens if you start Ray on a single node (your local mac)
If I start Ray with ray start on a cluster (two nodes). Do you mean no worker processes are pre-started?
I assume that you’re not giving a runtime environment in ray.init()
?
No
Also, are you seeing those log entries at node startup time (time of ray.init()
), or when submitting a task?
At node startup time, i.e., ray.init()
When submitting a task, the raylet.out has the following logs:
[2022-01-10 11:56:30,493 D 14563 5040023] cluster_task_manager.cc:478: Queuing and scheduling task d92fce596cade5cd17b0cea61222523a1df0bddd01000000
[2022-01-10 11:56:30,493 D 14563 5040023] cluster_task_manager.cc:77: Scheduling pending task d92fce596cade5cd17b0cea61222523a1df0bddd01000000
[2022-01-10 11:56:30,493 D 14563 5040023] scheduling_policy.cc:113: Node 734144275483469687 is available for request {(10000) (0) (0) (0) } []
with critical resource utilization 0 based on local view {
CPU: (120000:120000)
MEM: (107432263680000:107432263680000)
GPU: (0:0)
OBJECT_STORE_MEM: (53716131840000:53716131840000)
-1:(10000:10000)
}
[2022-01-10 11:56:30,493 D 14563 5040023] cluster_resource_scheduler.cc:262: Scheduling decision. forcing spillback: 0. Best node: 734144275483469687 0078b396f8673cd20ba4618b5619014b663ff36296adcf957b05bce5, is infeasible: 0
[2022-01-10 11:56:30,493 D 14563 5040023] cluster_task_manager.cc:146: No args, task can be dispatched d92fce596cade5cd17b0cea61222523a1df0bddd01000000
[2022-01-10 11:56:30,493 D 14563 5040023] cluster_task_manager.cc:568: RayTask d92fce596cade5cd17b0cea61222523a1df0bddd01000000 has args of size 0
[2022-01-10 11:56:30,493 D 14563 5040023] cluster_task_manager.cc:570: Size of pinned task args is now 0
[2022-01-10 11:56:30,493 D 14563 5040023] worker_pool.cc:1086: Pop worker for task d92fce596cade5cd17b0cea61222523a1df0bddd01000000 task name {type=PythonFunctionDescriptor, module_name=__main__, class_name=, function_name=fun, function_hash=15b1bb967e104b9ab04cc38f4714aba6}
[2022-01-10 11:56:30,493 D 14563 5040023] cluster_task_manager.cc:256: Dispatching task d92fce596cade5cd17b0cea61222523a1df0bddd01000000 to worker 7dfc1e0e15023caed4a34f99120ff5e38bd6e62f33bab81d5237d2bc
[2022-01-10 11:56:30,502 D 14563 5040023] node_manager.cc:997: [Worker] Message NotifyDirectCallTaskUnblocked(12) from worker with PID 14577
[2022-01-10 11:56:30,502 D 14563 5040023] dependency_manager.cc:148: Canceling get request for worker 7dfc1e0e15023caed4a34f99120ff5e38bd6e62f33bab81d5237d2bc
[2022-01-10 11:56:30,506 D 14563 5040023] node_manager.cc:1888: Finished task d92fce596cade5cd17b0cea61222523a1df0bddd01000000
[2022-01-10 11:56:30,506 D 14563 5040023] dependency_manager.cc:94: Canceling wait request for worker 7dfc1e0e15023caed4a34f99120ff5e38bd6e62f33bab81d5237d2bc
[2022-01-10 11:56:30,506 D 14563 5040023] dependency_manager.cc:148: Canceling get request for worker 7dfc1e0e15023caed4a34f99120ff5e38bd6e62f33bab81d5237d2bc
[2022-01-10 11:56:30,507 D 14563 5040023] node_manager.cc:997: [Worker] Message NotifyUnblocked(10) from worker with PID nil
[2022-01-10 11:56:30,507 D 14563 5040023] dependency_manager.cc:148: Canceling get request for worker 01000000ffffffffffffffffffffffffffffffffffffffffffffffff
[2022-01-10 12:02:43,534 I 14563 5040023] node_manager.cc:628: Sending Python GC request to 13 local workers to clean up Python cyclic references.