I have configured upscaling speed to 10000, but the cluster still only starts like 5 nodes in the beginning and then just launches one node at a time which is super slow because it needs to download a large image and configure the node. Only when it is up, a new one is scheduled. Any idea why this is? Looks like there are some sequential operations blocking each other.
Here is the output of the monitor log where one can see the sequential start of the workers:
======== Autoscaler status: 2021-12-08 12:37:06.532972 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
6 ray_worker_gpu
Pending:
10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/28.0 CPU
6.0/6.0 GPU
0.0/6.0 accelerator_type:V100
0.00/115.699 GiB memory
0.00/50.212 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:13,469 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:13.469184 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
6 ray_worker_gpu
Pending:
10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/28.0 CPU
6.0/6.0 GPU
0.0/6.0 accelerator_type:V100
0.00/115.699 GiB memory
0.00/50.212 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:20,402 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:20.402410 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
6 ray_worker_gpu
Pending:
10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/28.0 CPU
6.0/6.0 GPU
0.0/6.0 accelerator_type:V100
0.00/115.699 GiB memory
0.00/50.212 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:27,304 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:27.304457 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
6 ray_worker_gpu
Pending:
10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/28.0 CPU
6.0/6.0 GPU
0.0/6.0 accelerator_type:V100
0.00/115.699 GiB memory
0.00/50.212 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:34,355 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:34.355468 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
6 ray_worker_gpu
Pending:
10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/28.0 CPU
6.0/6.0 GPU
0.0/6.0 accelerator_type:V100
0.00/115.699 GiB memory
0.00/50.212 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:41,281 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:41.280858 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
6 ray_worker_gpu
Pending:
10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/28.0 CPU
6.0/6.0 GPU
0.0/6.0 accelerator_type:V100
0.00/115.699 GiB memory
0.00/50.212 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:48,215 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:48.215529 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
6 ray_worker_gpu
Pending:
10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/28.0 CPU
6.0/6.0 GPU
0.0/6.0 accelerator_type:V100
0.00/115.699 GiB memory
0.00/50.212 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:53,299 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1638963472720-5d2a0eb9761e3-846b123f-20dc80ef to finish...
2021-12-08 12:37:58,883 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1638963472720-5d2a0eb9761e3-846b123f-20dc80ef finished.
2021-12-08 12:38:00,159 INFO autoscaler.py:990 -- StandardAutoscaler: Queue 1 new nodes for launch
2021-12-08 12:38:00,165 INFO node_launcher.py:99 -- NodeLauncher0: Got 1 nodes to launch.
2021-12-08 12:38:00,781 INFO node_launcher.py:99 -- NodeLauncher0: Launching 1 nodes, type ray_worker_gpu.
2021-12-08 12:38:02,265 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1638963480799-5d2a0ec12a99a-8365dbff-4b7c25a5 to finish...
2021-12-08 12:38:39,737 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1638963480799-5d2a0ec12a99a-8365dbff-4b7c25a5 finished.
2021-12-08 12:38:40,366 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:38:40.366262 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
7 ray_worker_gpu
Pending:
10.164.0.63: ray_worker_gpu, uninitialized
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/32.0 CPU
7.0/7.0 GPU
0.0/7.0 accelerator_type:V100
0.00/133.519 GiB memory
0.00/57.849 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:38:40,690 INFO monitor.py:328 -- :event_summary:Resized to 32 CPUs, 7 GPUs.
2021-12-08 12:38:40,690 INFO monitor.py:328 -- :event_summary:Adding 1 nodes of type ray_worker_gpu.
2021-12-08 12:38:46,952 INFO autoscaler.py:942 -- Creating new (spawn_updater) updater thread for node ray-dirk-default-worker-d4143c34-compute.
2021-12-08 12:38:47,909 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1638963527285-5d2a0eed7fa2a-5458f1d2-52536285 to finish...
2021-12-08 12:38:53,487 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1638963527285-5d2a0eed7fa2a-5458f1d2-52536285 finished.
2021-12-08 12:38:53,829 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:38:53.829788 ========
Node status
---------------------------------------------------------------
Healthy:
1 ray_head_default
7 ray_worker_gpu
Pending:
10.164.0.63: ray_worker_gpu, waiting-for-ssh
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/32.0 CPU
7.0/7.0 GPU
0.0/7.0 accelerator_type:V100
0.00/133.519 GiB memory
0.00/57.849 GiB object_store_memory
Demands:
{'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:39:00,995 INFO autoscaler.py:267 --```