Good GCP head node for clusters with large workloads

Hi,
I wanted to ask what a good GCP head node type would be if I were to run a cluster with thousands of actors and tasks.

I recently ran a large cluster on GCP VMs with 80 connected clients (though only 50 are possible at a time, so there was a queue) each running several actorrs and scheduling several jobs. I always got worker timeouts after a while.

Thanks!

This is workload dependent, so there isn’t an exact answer. I would recommend watching resource usage of the head node during your workload to see if CPU/memory are being saturated.

One problem that you might run into is the head node being overwhelmed with tasks/actors, in which case you can try adjusting the resources specified on the head node to be lower than the actual amount, which should keep the scheduler from putting too many tasks/actors on the head node.

Hi, thanks!

Actually, that might have not been the main issue of why my jobs crash after all. They crash because of the following error :confused:

(raylet, ip=10.164.15.201) [2022-08-31 11:40:25,493 E 7466 7466] (raylet) worker_pool.cc:481: Some workers of the worker
process(9828) have not registered within the timeout. The process is still alive, probably it’s hanging during start. (raylet, ip=10.164.15.201) [2022-08-31 11:40:25,493 E 7466 7466] (raylet) worker_pool.cc:481: Some workers of the worker
process(9828) have not registered within the timeout. The process is still alive, probably it’s hanging during start.