Unable to increase the number of cores for head node

Hi, I’m running a ray cluster using k8s and try to assign more cpu cores to the head node. I use the official chart configurations for deployment and here’s my values.yaml

However, after the cluster is started, from the head node’s pod yaml configuration, the request CPU is some wired number that doesn’t appear anywhere in my configs.

As a result, there’s only one 1 core assigned to ray head.

I don’t know what else might control this config and hope people could shed some lights.

Thanks,
-BS

That’s weird. I’ll take a look.

Thanks Dmitri. Just for your information, I’m using ray 1.9 for now and please let me know if you need more information.

Is the chart pulled from the Ray master branch?

I was not able to reproduce this on my first attempt.

The chart’s logic sets limits equal to requests:

so this is very strange.

The Helm chart configures a “RayCluster” custom resource which is then processed by an operator. Maybe we can take a look at the intermediate RayCluster object first.

After installing the chart, could you kubectl -n <your namespace> get raycluster <your release name> -o yaml and see what the requests and limits look like in that configuration?

'kubectl get raycluster -o yaml ’ shows the correct cpu ‘request’ and ‘limit’ (which is 2 in my case). Only the started head pod’s yaml somehow has a weird request number

How should I take a look at ‘RayCluster’ object?

Yes. We use the chart from ray-1.9.0 release.

I did a file by file diff (ray-1.9.0 v.s ours):





One theory is that something in your K8s environment is mutating the requested pod.
What kind of K8s environment are you running in?

Thanks for the config details!

The operator image should also be pinned to Ray 1.9.0. (operatorImage: rayproject/ray:1.9.0).
Is that the case in your configs?

Just as a sanity check, what happens if you try to create a scratch pod (say, with a busybox image),
with cpu requests=limits=2?

double check with the operator image, it’s indeed ray:1.9.0. And a scratch pod seems running fine (with correct number of requested cpu) and we have a bunch of other services/pods running under this k8s platform.

Regarding to the k8s environment, it’s a vendor K8S platform, TKE: GitHub - tkestack/tke: Native Kubernetes container management platform supporting multi-tenant and multi-cluster

Interestingly, we just tried set request cpu to 4, and in this case, the head pod’s request cpu becomes 1 … not 664m …

I understand you might not be able to upgrade to a newer Ray for your application, but what happens if you use base Ray 1.13.0 images instead?

Thanks Dmitri, after some investigation, it turned out to be an issue of our k8s platform (they put some hard-coded limit on the cpu resource for test env) …

Appreciate your help anyway!!

Ok, thanks for letting me know. This is good for my sanity :slight_smile: