Good GCP head node for clusters with large workloads

Dirk_Weissenborn · August 27, 2022, 4:09pm

Hi,
I wanted to ask what a good GCP head node type would be if I were to run a cluster with thousands of actors and tasks.

I recently ran a large cluster on GCP VMs with 80 connected clients (though only 50 are possible at a time, so there was a queue) each running several actorrs and scheduling several jobs. I always got worker timeouts after a while.

Thanks!

ckw017 · August 30, 2022, 9:09pm

This is workload dependent, so there isn’t an exact answer. I would recommend watching resource usage of the head node during your workload to see if CPU/memory are being saturated.

One problem that you might run into is the head node being overwhelmed with tasks/actors, in which case you can try adjusting the resources specified on the head node to be lower than the actual amount, which should keep the scheduler from putting too many tasks/actors on the head node.

Dirk_Weissenborn · August 31, 2022, 11:42am

Hi, thanks!

Actually, that might have not been the main issue of why my jobs crash after all. They crash because of the following error

(raylet, ip=10.164.15.201) [2022-08-31 11:40:25,493 E 7466 7466] (raylet) worker_pool.cc:481: Some workers of the worker
process(9828) have not registered within the timeout. The process is still alive, probably it’s hanging during start. (raylet, ip=10.164.15.201) [2022-08-31 11:40:25,493 E 7466 7466] (raylet) worker_pool.cc:481: Some workers of the worker
process(9828) have not registered within the timeout. The process is still alive, probably it’s hanging during start.

Topic		Replies	Views
Creation an Actor on Worker nodes takes long time Ray Clusters	11	1092	February 9, 2022
Cluster usage is not 100% rather 57% Ray Clusters	0	415	October 21, 2021
[Clusters] Preemptible machines stop unexpectedly in GCP Ray Clusters	1	330	June 2, 2021
Head node failed to connect to all its worker nodes Ray Clusters	1	971	October 6, 2023
How to debug when node dies? Ray Clusters	3	1183	July 27, 2023

Good GCP head node for clusters with large workloads

Related topics