[Clusters] [SGD] Cluster setup speed

Hi! We are using RaySGD+GCP for distributed neural networks training. Our goal is to test hypotheses faster (during 1-2 hours of training) using a cluster with about 16 instances. The issue is that such a cluster instantiates slowly (now it is about 30 minutes, i.e. 20-30% of general usage time). We suppose that there are 2 main problems:

  1. Currently workers are started only after the head node is completely set up.
  2. GCP doesn’t create all workers at the time, they are divided into groups of 4-5 instances and these groups are started consistently.
    So the question is can we start the head node and all workers simultaneously? If you have any other suggestions on cluster setup acceleration, you are welcome.

@Ameer_Haj_Ali are there specific options that should be set to increase the speed of scaling up the cluster?

Hi @Vanster and @eoakes . Thanks for asking.

  1. You can set upscaling_speed to 10000 in the cluster yaml to make scale up node faster.
  2. Unfortunately, head and workers can’t start simultaneously as is now.
  3. To make startup faster, bake all your setup/init commands into the docker image, things will be much faster.

Hi @Ameer_Haj_Ali. Thanks for your answer.
Seems like I really missed upscaling speed parameter. But I’ve tried it and it didn’t help :frowning: Probably the reason is scaling absence in our use-case: we just create a cluster with min_workers=max_workers and do not scale it in the runtime.
We already created the image with all needed libraries and our repository, the only thing we need to do is to call git pull on each worker, so our setup/init commands are not too slow.
I will give you also some information. ray up takes 3-5 minutes, and it’s fast enough. But when I attach to the head node and initialize the session with ray.init(address='auto'), it takes 10-20 minutes for all cluster resources to gather. In the meantime, worker nodes are ready 3-5 minutes after the main node. So there is an interval of 5-15 minutes when all machines are already running, but not all cluster resources are collected. Maybe we can somehow speed up this process?

Are you saying that after the workers are up and running and connected to the cluster there is a 5-15 minutes delay before the application starts running? This is very unexpected. Do you have some yaml + repro code so we can investigate?

CC @ijrsvt

After the workers are up (marked with green color in GCP console), I start Ray session, but len(ray.nodes()) doesn’t show all of them during 5-15 minutes. Code:
ray up issue.yaml -y
ray attach issue.yaml
python
import ray
import time
ray.init(address=‘auto’)
while True:
print(len(ray.nodes()))
time.sleep(10)
YAML:
cluster_name: issue

max_workers: 15

upscaling_speed: 10000

provider:
    type: gcp
    region: asia-east1
    availability_zone: asia-east1-a
    project_id: your_id

auth:
    ssh_user: ubuntu

available_node_types:
    node:        
        min_workers: 15
        max_workers: 15
        resources: {"CPU": 4, "GPU": 1}
        node_config:
            machineType: n1-standard-4
            disks:
              - type: PERSISTENT
                initializeParams:
                  sourceImage: global/images/your_image
                boot: True
                autoDelete: true
            guestAccelerators:
              - acceleratorType: projects/your_id/zones/asia-east1-a/acceleratorTypes/nvidia-tesla-p100
                acceleratorCount: 1
            scheduling:
              - preemptible: true
              - onHostMaintenance: TERMINATE 
            serviceAccounts:
              - email: ray-autoscaler-sa-v1@your_id.iam.gserviceaccount.com
                scope:
                  - https://www.googleapis.com/auth/cloud-platform
        
head_node_type: node

initialization_commands:
    - >-
      timeout 300 bash -c "
          command -v nvidia-smi && nvidia-smi
          until [ \$? -eq 0 ]; do
              command -v nvidia-smi && nvidia-smi
          done"

setup_commands:
  - git --git-dir=/home/ubuntu/your_project/.git --work-tree=/home/ubuntu/your_project pull

head_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --head
      --port=6379
      --object-manager-port=8076
      --autoscaling-config=~/ray_bootstrap_config.yaml

worker_start_ray_commands:
    - ray stop
    - >-
      ulimit -n 65536;
      ray start
      --address=$RAY_HEAD_IP:6379
      --object-manager-port=8076

To use it as is, you will need to create your GCP project, your repository, and your image. Our image is based on pytorch-latest-gpu-debian-10, with installed nvidia-driver, downloaded repository, and libraries (including ray nightly version).

Hi @Vanster , Marked in green color in GCP console does not mean they are attached to the GCS address. Can you remove all your setup and initialization commands and try again (how long it takes after you delete them)?

Hi @Ameer_Haj_Ali ! I removed my setup and initialization commands, it gave about 1-minute acceleration. I also tried to remove head_start_ray_commands and worker_start_ray_commands, the results were the same.
As far as I understand, we start paying for an instance already at the moment when it is marked in green color, and we would like to pay less for downtime while all the resources are being collected.

I give you some additional information. I start my ray session on nodes with n1-standard-32 machine type and 4 GPUs (16 nodes in total, so 64GPUs and 512 CPUs expected), Logs:

2021-04-06 09:42:57,999 INFO worker.py:654 – Connecting to existing Ray cluster at address: 10.140.0.3:6379
(autoscaler +3m3s) Tip: use ray status to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +3m3s) Resized to 0 CPUs.
(autoscaler +3m3s) Adding 15 nodes of type node.
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,035 C 4422 4422] service_based_gcs_client.cc:228: Couldn’t reconnect to GCS server. The last attempted GCS server address was 10.140.0.3:33897
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,035 E 4422 4422] logging.cc:441: *** Aborted at 1617702575 (unix time) try “date -d 1617702575” if you are using GNU date ***
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,035 E 4422 4422] logging.cc:441: PC: 0x0 (unknown)
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,078 E 4422 4422] logging.cc:441: *** SIGABRT (0x3f600001146) received by PID 4422 (TID 0x7fb9f347c800) from PID 4422; stack trace: ***
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,152 E 4422 4422] logging.cc:441: 0x55cfb950374f google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,152 E 4422 4422] logging.cc:441: 0x7fb9f397f730 (unknown)
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,153 E 4422 4422] logging.cc:441: 0x7fb9f34b87bb gsignal
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,153 E 4422 4422] logging.cc:441: 0x7fb9f34a3535 abort
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,297 E 4422 4422] logging.cc:441: 0x55cfb94ef8be ray::SpdLogMessage::Flush()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,309 E 4422 4422] logging.cc:441: 0x55cfb94ef98d ray::RayLog::~RayLog()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,312 E 4422 4422] logging.cc:441: 0x55cfb924a72f ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,312 E 4422 4422] logging.cc:441: 0x55cfb924a845 ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,313 E 4422 4422] logging.cc:441: 0x55cfb924a9bb ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,361 E 4422 4422] logging.cc:441: 0x55cfb94aefe4 ray::PeriodicalRunner::DoRunFnPeriodically()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,363 E 4422 4422] logging.cc:441: 0x55cfb94af9af ray::PeriodicalRunner::RunFnPeriodically()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,363 E 4422 4422] logging.cc:441: 0x55cfb924c0e4 ray::gcs::ServiceBasedGcsClient::Connect()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,375 E 4422 4422] logging.cc:441: 0x55cfb90b0071 main
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,375 E 4422 4422] logging.cc:441: 0x7fb9f34a509b __libc_start_main
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,377 E 4422 4422] logging.cc:441: 0x55cfb90c9425 (unknown)
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,243 C 4555 4555] service_based_gcs_client.cc:228: Couldn’t reconnect to GCS server. The last attempted GCS server address was 10.140.0.3:33897
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,243 E 4555 4555] logging.cc:441: *** Aborted at 1617702720 (unix time) try “date -d 1617702720” if you are using GNU date ***
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,243 E 4555 4555] logging.cc:441: PC: 0x0 (unknown)
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,268 E 4555 4555] logging.cc:441: *** SIGABRT (0x3f6000011cb) received by PID 4555 (TID 0x7f5c112de800) from PID 4555; stack trace: ***
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,296 E 4555 4555] logging.cc:441: 0x5640b22bb74f google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,296 E 4555 4555] logging.cc:441: 0x7f5c117e1730 (unknown)
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,296 E 4555 4555] logging.cc:441: 0x7f5c1131a7bb gsignal
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,296 E 4555 4555] logging.cc:441: 0x7f5c11305535 abort
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,314 E 4555 4555] logging.cc:441: 0x5640b22a78be ray::SpdLogMessage::Flush()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,336 E 4555 4555] logging.cc:441: 0x5640b22a798d ray::RayLog::~RayLog()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,337 E 4555 4555] logging.cc:441: 0x5640b200272f ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,338 E 4555 4555] logging.cc:441: 0x5640b2002845 ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,338 E 4555 4555] logging.cc:441: 0x5640b20029bb ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,353 E 4555 4555] logging.cc:441: 0x5640b2266fe4 ray::PeriodicalRunner::DoRunFnPeriodically()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,355 E 4555 4555] logging.cc:441: 0x5640b22679af ray::PeriodicalRunner::RunFnPeriodically()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,355 E 4555 4555] logging.cc:441: 0x5640b20040e4 ray::gcs::ServiceBasedGcsClient::Connect()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,356 E 4555 4555] logging.cc:441: 0x5640b1e68071 main
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,356 E 4555 4555] logging.cc:441: 0x7f5c1130709b __libc_start_main
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,360 E 4555 4555] logging.cc:441: 0x5640b1e81425 (unknown)
(autoscaler +10m4s) Resized to 32 CPUs, 4 GPUs.
2021-04-06 09:54:54,440 WARNING worker.py:1083 – The node with node id: 8400e8523ff9c2753e8daa9f2ee059b59ec847a29764bb65570916d8 and ip: 10.140.15.193 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
2021-04-06 09:55:38,858 WARNING worker.py:1083 – The node with node id: 7a60d785df8bade24cec57153493b1629775b383b4a74c60c28a66d7 and ip: 10.140.15.241 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +14m9s) Resized to 416 CPUs, 52 GPUs.
(autoscaler +14m9s) Removing 5 nodes of type node (launch failed).
2021-04-06 09:57:02,668 WARNING worker.py:1083 – The node with node id: b37507e5e5c725189c5db589610f3e918556402e8fcb20f17334adb7 and ip: 10.140.0.8 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
2021-04-06 09:58:00,769 WARNING worker.py:1083 – The node with node id: 21824064185463d6cfcc6efe9e43bf9c33e76e32f8b9eb34df6b7f18 and ip: 10.140.0.29 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +16m53s) Resized to 448 CPUs, 56 GPUs.
(autoscaler +16m53s) Adding 5 nodes of type node.
(autoscaler +16m53s) Removing 1 nodes of type node (launch failed).
(autoscaler +20m36s) Resized to 352 CPUs, 44 GPUs.
(autoscaler +20m36s) Adding 1 nodes of type node.
2021-04-06 10:04:34,218 WARNING worker.py:1083 – The node with node id: 6c1d29173a83b1e44253ff52cb222ae179eb0bda7c22c41ddbc8853c and ip: 10.140.15.216 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +22m16s) Removing 1 nodes of type node (launch failed).
2021-04-06 10:07:49,024 WARNING worker.py:1083 – The node with node id: 5483f66cc0e9258881ebd26e0e95039e7569780851e3726b8ad9cffa and ip: 10.140.15.212 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +25m40s) Resized to 480 CPUs, 60 GPUs.
(autoscaler +25m40s) Adding 1 nodes of type node.
(autoscaler +25m40s) Removing 2 nodes of type node (launch failed).
2021-04-06 10:08:33,630 WARNING worker.py:1083 – The node with node id: 7edb23299ab153380f30f7736f1f5c9093c40c4a67701b1ac0e64ffa and ip: 10.140.15.208 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +28m4s) Adding 2 nodes of type node.
2021-04-06 10:11:57,501 WARNING worker.py:1083 – The node with node id: 14df820b807a5c6f15b2ed12b2ff50f0afe252185cc4a45ea8885d15 and ip: 10.140.15.205 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +29m14s) Resized to 448 CPUs, 56 GPUs.
(autoscaler +29m14s) Removing 1 nodes of type node (launch failed).
(autoscaler +31m30s) Adding 1 nodes of type node.
(autoscaler +32m31s) Resized to 480 CPUs, 60 GPUs.
(autoscaler +32m31s) Removing 1 nodes of type node (launch failed).

So besides some other logs we can see autoscaler messages. It tells that only 14 minutes after the session starts 13 nodes are attached, and 5 nodes are removed and then restarted. We’ve got our 64GPUs after 40 minutes of such downtime and we were paying money during this time. Any suggestions on how to fix it?

@ijrsvt @Alex , Do you know what is going on? I don’t know why it takes so long for the instances to launch.