GCP custom machine type and best practices for node choice

Hi everyone,

I have a question regarding the cluster config example for GCP. The reason is that I run into memory problems when running my RLlib training jobs.

I see in the ray_head_gpu node description:

node_config:
     machineType: custom-6-16384

This custom machine type contains, I guess 6 cores and 16GB memory? Is this the general way to tell compute engine how to put together my virtual machine - like when I write custom-6-32768 I get double the memory?

Are there any guidelines or best practices as to how to choose the memory and core resources on head and worker nodes?

Thanks in advance!
Simon

Your method of selecting cores/memory is correct, the fields under node_config are based off of this API: Method: instances.insert  |  Compute Engine Documentation  |  Google Cloud

To create a custom machine type, provide a URL to a machine type in the following format, where CPUS is 1 or an even number up to 32 (2, 4, 6, … 24, etc), and MEMORY is the total memory for this instance. Memory must be a multiple of 256 MB and must be supplied in MB (e.g. 5 GB of memory is 5120 MB):

zones/zone/machineTypes/custom-CPUS-MEMORY

Note that you’ll also want to update this line to match your CPU count.

Resources needs will be dependent on the application.

1 Like

Hi @ckw017 ,

thank you for the quick reply and the worthful links. I read on GCP about custom machine types and what I could not find was: any information about the specifics of the naming (custom-cpu-memory). Where is said that this is the way how the custom machine has to be named. If I write e.g. this-is-some-machine I guess I get an error as either a custom type has to be provided or a standard machine type.

Second: My question was ill formulated. What I wanted to know in my second question was, if there are some guidelines or best practices in regard to choosing node sizes for a ray cluster (head/worker)?

For the first question, the details for custom node naming should be on the page I linked here. I’ve included a screenshot of the section detailing it (try just searching for machineType on the page)

We have some advice for picking node types here. The node types are specific to AWS, but should have equivalents on GCP. The " How many CPUs/GPUs?" should be useful here. In particular, Ray Dashboard can give you an idea of what resources your workload is consuming or bottlenecking on

1 Like

Hi @ckw017 ,

my fault. I did follow the link, but was unsure about what to look for. Thank you for coming back to this and thank you for the info about ray cluster nodes.