I have a question regarding the cluster config example for GCP. The reason is that I run into memory problems when running my RLlib training jobs.
I see in the ray_head_gpu node description:
This custom machine type contains, I guess 6 cores and 16GB memory? Is this the general way to tell compute engine how to put together my virtual machine - like when I write custom-6-32768 I get double the memory?
Are there any guidelines or best practices as to how to choose the memory and core resources on head and worker nodes?
To create a custom machine type, provide a URL to a machine type in the following format, where CPUS is 1 or an even number up to 32 (2, 4, 6, … 24, etc), and MEMORY is the total memory for this instance. Memory must be a multiple of 256 MB and must be supplied in MB (e.g. 5 GB of memory is 5120 MB):
Note that you’ll also want to update this line to match your CPU count.
Resources needs will be dependent on the application.
thank you for the quick reply and the worthful links. I read on GCP about custom machine types and what I could not find was: any information about the specifics of the naming (custom-cpu-memory). Where is said that this is the way how the custom machine has to be named. If I write e.g. this-is-some-machine I guess I get an error as either a custom type has to be provided or a standard machine type.
Second: My question was ill formulated. What I wanted to know in my second question was, if there are some guidelines or best practices in regard to choosing node sizes for a ray cluster (head/worker)?
We have some advice for picking node types here. The node types are specific to AWS, but should have equivalents on GCP. The " How many CPUs/GPUs?" should be useful here. In particular, Ray Dashboard can give you an idea of what resources your workload is consuming or bottlenecking on