I give you some additional information. I start my ray session on nodes with n1-standard-32 machine type and 4 GPUs (16 nodes in total, so 64GPUs and 512 CPUs expected), Logs:
2021-04-06 09:42:57,999 INFO worker.py:654 – Connecting to existing Ray cluster at address: 10.140.0.3:6379
(autoscaler +3m3s) Tip: use ray status
to view detailed autoscaling status. To disable autoscaler event messages, you can set AUTOSCALER_EVENTS=0.
(autoscaler +3m3s) Resized to 0 CPUs.
(autoscaler +3m3s) Adding 15 nodes of type node.
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,035 C 4422 4422] service_based_gcs_client.cc:228: Couldn’t reconnect to GCS server. The last attempted GCS server address was 10.140.0.3:33897
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,035 E 4422 4422] logging.cc:441: *** Aborted at 1617702575 (unix time) try “date -d 1617702575” if you are using GNU date ***
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,035 E 4422 4422] logging.cc:441: PC: 0x0 (unknown)
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,078 E 4422 4422] logging.cc:441: *** SIGABRT (0x3f600001146) received by PID 4422 (TID 0x7fb9f347c800) from PID 4422; stack trace: ***
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,152 E 4422 4422] logging.cc:441: 0x55cfb950374f google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,152 E 4422 4422] logging.cc:441: 0x7fb9f397f730 (unknown)
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,153 E 4422 4422] logging.cc:441: 0x7fb9f34b87bb gsignal
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,153 E 4422 4422] logging.cc:441: 0x7fb9f34a3535 abort
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,297 E 4422 4422] logging.cc:441: 0x55cfb94ef8be ray::SpdLogMessage::Flush()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,309 E 4422 4422] logging.cc:441: 0x55cfb94ef98d ray::RayLog::~RayLog()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,312 E 4422 4422] logging.cc:441: 0x55cfb924a72f ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,312 E 4422 4422] logging.cc:441: 0x55cfb924a845 ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,313 E 4422 4422] logging.cc:441: 0x55cfb924a9bb ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,361 E 4422 4422] logging.cc:441: 0x55cfb94aefe4 ray::PeriodicalRunner::DoRunFnPeriodically()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,363 E 4422 4422] logging.cc:441: 0x55cfb94af9af ray::PeriodicalRunner::RunFnPeriodically()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,363 E 4422 4422] logging.cc:441: 0x55cfb924c0e4 ray::gcs::ServiceBasedGcsClient::Connect()
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,375 E 4422 4422] logging.cc:441: 0x55cfb90b0071 main
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,375 E 4422 4422] logging.cc:441: 0x7fb9f34a509b __libc_start_main
(raylet, ip=10.140.0.61) [2021-04-06 09:49:35,377 E 4422 4422] logging.cc:441: 0x55cfb90c9425 (unknown)
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,243 C 4555 4555] service_based_gcs_client.cc:228: Couldn’t reconnect to GCS server. The last attempted GCS server address was 10.140.0.3:33897
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,243 E 4555 4555] logging.cc:441: *** Aborted at 1617702720 (unix time) try “date -d 1617702720” if you are using GNU date ***
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,243 E 4555 4555] logging.cc:441: PC: 0x0 (unknown)
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,268 E 4555 4555] logging.cc:441: *** SIGABRT (0x3f6000011cb) received by PID 4555 (TID 0x7f5c112de800) from PID 4555; stack trace: ***
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,296 E 4555 4555] logging.cc:441: 0x5640b22bb74f google::(anonymous namespace)::FailureSignalHandler()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,296 E 4555 4555] logging.cc:441: 0x7f5c117e1730 (unknown)
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,296 E 4555 4555] logging.cc:441: 0x7f5c1131a7bb gsignal
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,296 E 4555 4555] logging.cc:441: 0x7f5c11305535 abort
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,314 E 4555 4555] logging.cc:441: 0x5640b22a78be ray::SpdLogMessage::Flush()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,336 E 4555 4555] logging.cc:441: 0x5640b22a798d ray::RayLog::~RayLog()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,337 E 4555 4555] logging.cc:441: 0x5640b200272f ray::gcs::ServiceBasedGcsClient::ReconnectGcsServer()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,338 E 4555 4555] logging.cc:441: 0x5640b2002845 ray::gcs::ServiceBasedGcsClient::GcsServiceFailureDetected()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,338 E 4555 4555] logging.cc:441: 0x5640b20029bb ray::gcs::ServiceBasedGcsClient::PeriodicallyCheckGcsServerAddress()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,353 E 4555 4555] logging.cc:441: 0x5640b2266fe4 ray::PeriodicalRunner::DoRunFnPeriodically()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,355 E 4555 4555] logging.cc:441: 0x5640b22679af ray::PeriodicalRunner::RunFnPeriodically()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,355 E 4555 4555] logging.cc:441: 0x5640b20040e4 ray::gcs::ServiceBasedGcsClient::Connect()
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,356 E 4555 4555] logging.cc:441: 0x5640b1e68071 main
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,356 E 4555 4555] logging.cc:441: 0x7f5c1130709b __libc_start_main
(raylet, ip=10.140.15.207) [2021-04-06 09:52:00,360 E 4555 4555] logging.cc:441: 0x5640b1e81425 (unknown)
(autoscaler +10m4s) Resized to 32 CPUs, 4 GPUs.
2021-04-06 09:54:54,440 WARNING worker.py:1083 – The node with node id: 8400e8523ff9c2753e8daa9f2ee059b59ec847a29764bb65570916d8 and ip: 10.140.15.193 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
2021-04-06 09:55:38,858 WARNING worker.py:1083 – The node with node id: 7a60d785df8bade24cec57153493b1629775b383b4a74c60c28a66d7 and ip: 10.140.15.241 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +14m9s) Resized to 416 CPUs, 52 GPUs.
(autoscaler +14m9s) Removing 5 nodes of type node (launch failed).
2021-04-06 09:57:02,668 WARNING worker.py:1083 – The node with node id: b37507e5e5c725189c5db589610f3e918556402e8fcb20f17334adb7 and ip: 10.140.0.8 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
2021-04-06 09:58:00,769 WARNING worker.py:1083 – The node with node id: 21824064185463d6cfcc6efe9e43bf9c33e76e32f8b9eb34df6b7f18 and ip: 10.140.0.29 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +16m53s) Resized to 448 CPUs, 56 GPUs.
(autoscaler +16m53s) Adding 5 nodes of type node.
(autoscaler +16m53s) Removing 1 nodes of type node (launch failed).
(autoscaler +20m36s) Resized to 352 CPUs, 44 GPUs.
(autoscaler +20m36s) Adding 1 nodes of type node.
2021-04-06 10:04:34,218 WARNING worker.py:1083 – The node with node id: 6c1d29173a83b1e44253ff52cb222ae179eb0bda7c22c41ddbc8853c and ip: 10.140.15.216 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +22m16s) Removing 1 nodes of type node (launch failed).
2021-04-06 10:07:49,024 WARNING worker.py:1083 – The node with node id: 5483f66cc0e9258881ebd26e0e95039e7569780851e3726b8ad9cffa and ip: 10.140.15.212 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +25m40s) Resized to 480 CPUs, 60 GPUs.
(autoscaler +25m40s) Adding 1 nodes of type node.
(autoscaler +25m40s) Removing 2 nodes of type node (launch failed).
2021-04-06 10:08:33,630 WARNING worker.py:1083 – The node with node id: 7edb23299ab153380f30f7736f1f5c9093c40c4a67701b1ac0e64ffa and ip: 10.140.15.208 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +28m4s) Adding 2 nodes of type node.
2021-04-06 10:11:57,501 WARNING worker.py:1083 – The node with node id: 14df820b807a5c6f15b2ed12b2ff50f0afe252185cc4a45ea8885d15 and ip: 10.140.15.205 has been marked dead because the detector has missed too many heartbeats from it. This can happen when a raylet crashes unexpectedly or has lagging heartbeats.
(autoscaler +29m14s) Resized to 448 CPUs, 56 GPUs.
(autoscaler +29m14s) Removing 1 nodes of type node (launch failed).
(autoscaler +31m30s) Adding 1 nodes of type node.
(autoscaler +32m31s) Resized to 480 CPUs, 60 GPUs.
(autoscaler +32m31s) Removing 1 nodes of type node (launch failed).
So besides some other logs we can see autoscaler messages. It tells that only 14 minutes after the session starts 13 nodes are attached, and 5 nodes are removed and then restarted. We’ve got our 64GPUs after 40 minutes of such downtime and we were paying money during this time. Any suggestions on how to fix it?