[GCP] Ray Cluster on GCP scales up very slowly

I have configured upscaling speed to 10000, but the cluster still only starts like 5 nodes in the beginning and then just launches one node at a time which is super slow because it needs to download a large image and configure the node. Only when it is up, a new one is scheduled. Any idea why this is? Looks like there are some sequential operations blocking each other.

Here is the output of the monitor log where one can see the sequential start of the workers:

======== Autoscaler status: 2021-12-08 12:37:06.532972 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 6 ray_worker_gpu
Pending:
 10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/28.0 CPU
 6.0/6.0 GPU
 0.0/6.0 accelerator_type:V100
 0.00/115.699 GiB memory
 0.00/50.212 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:13,469 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:13.469184 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 6 ray_worker_gpu
Pending:
 10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/28.0 CPU
 6.0/6.0 GPU
 0.0/6.0 accelerator_type:V100
 0.00/115.699 GiB memory
 0.00/50.212 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:20,402 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:20.402410 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 6 ray_worker_gpu
Pending:
 10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/28.0 CPU
 6.0/6.0 GPU
 0.0/6.0 accelerator_type:V100
 0.00/115.699 GiB memory
 0.00/50.212 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:27,304 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:27.304457 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 6 ray_worker_gpu
Pending:
 10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/28.0 CPU
 6.0/6.0 GPU
 0.0/6.0 accelerator_type:V100
 0.00/115.699 GiB memory
 0.00/50.212 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:34,355 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:34.355468 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 6 ray_worker_gpu
Pending:
 10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/28.0 CPU
 6.0/6.0 GPU
 0.0/6.0 accelerator_type:V100
 0.00/115.699 GiB memory
 0.00/50.212 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:41,281 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:41.280858 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 6 ray_worker_gpu
Pending:
 10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/28.0 CPU
 6.0/6.0 GPU
 0.0/6.0 accelerator_type:V100
 0.00/115.699 GiB memory
 0.00/50.212 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:48,215 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:37:48.215529 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 6 ray_worker_gpu
Pending:
 10.164.0.62: ray_worker_gpu, setting-up
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/28.0 CPU
 6.0/6.0 GPU
 0.0/6.0 accelerator_type:V100
 0.00/115.699 GiB memory
 0.00/50.212 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:37:53,299 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1638963472720-5d2a0eb9761e3-846b123f-20dc80ef to finish...
2021-12-08 12:37:58,883 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1638963472720-5d2a0eb9761e3-846b123f-20dc80ef finished.
2021-12-08 12:38:00,159 INFO autoscaler.py:990 -- StandardAutoscaler: Queue 1 new nodes for launch
2021-12-08 12:38:00,165 INFO node_launcher.py:99 -- NodeLauncher0: Got 1 nodes to launch.
2021-12-08 12:38:00,781 INFO node_launcher.py:99 -- NodeLauncher0: Launching 1 nodes, type ray_worker_gpu.
2021-12-08 12:38:02,265 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1638963480799-5d2a0ec12a99a-8365dbff-4b7c25a5 to finish...
2021-12-08 12:38:39,737 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1638963480799-5d2a0ec12a99a-8365dbff-4b7c25a5 finished.
2021-12-08 12:38:40,366 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:38:40.366262 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 7 ray_worker_gpu
Pending:
 10.164.0.63: ray_worker_gpu, uninitialized
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/32.0 CPU
 7.0/7.0 GPU
 0.0/7.0 accelerator_type:V100
 0.00/133.519 GiB memory
 0.00/57.849 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:38:40,690 INFO monitor.py:328 -- :event_summary:Resized to 32 CPUs, 7 GPUs.
2021-12-08 12:38:40,690 INFO monitor.py:328 -- :event_summary:Adding 1 nodes of type ray_worker_gpu.
2021-12-08 12:38:46,952 INFO autoscaler.py:942 -- Creating new (spawn_updater) updater thread for node ray-dirk-default-worker-d4143c34-compute.
2021-12-08 12:38:47,909 INFO node.py:295 -- wait_for_compute_zone_operation: Waiting for operation operation-1638963527285-5d2a0eed7fa2a-5458f1d2-52536285 to finish...
2021-12-08 12:38:53,487 INFO node.py:307 -- wait_for_compute_zone_operation: Operation operation-1638963527285-5d2a0eed7fa2a-5458f1d2-52536285 finished.
2021-12-08 12:38:53,829 INFO autoscaler.py:267 --
======== Autoscaler status: 2021-12-08 12:38:53.829788 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 7 ray_worker_gpu
Pending:
 10.164.0.63: ray_worker_gpu, waiting-for-ssh
Recent failures:
 (no failures)

Resources
---------------------------------------------------------------
Usage:
 0.0/32.0 CPU
 7.0/7.0 GPU
 0.0/7.0 accelerator_type:V100
 0.00/133.519 GiB memory
 0.00/57.849 GiB object_store_memory

Demands:
 {'GPU': 1.0}: 1+ pending tasks/actors
2021-12-08 12:39:00,995 INFO autoscaler.py:267 --```

I guess I figured out the answer:
The autoscaler initially (i.e., in the first round) creates 5 worker nodes of the requested resources if min-workers is not set to something higher in the config. This number is unfortunately hard-coded. After the first round the autoscaler will be able to upscale directly to the number of nodes required by your job so be patient.

I think it would be better to immediately assume num_workers=1 and use the upscale factor.

The sequential creation of workers was our fault. We were running ray.remote in a loop which seems to block until the cluster scales up to the resource demands. Also here I think this block shouldn’t happen. Workaround is calling ray.remote functions in parallel in the background.