Resource utilization for RayServe in Kubernetes (AKS)

akelloway · June 23, 2022, 5:00pm

Trying to better understand resource utilization in Kubernetes cluster for Ray Serve deployments. Assume relatively lightweight CPU only deployments (e.g., {num_cpus: 1} per deployment).

How does the cluster config yaml specs for worker cpu resource limits interact with my @serve.deployment deployments? For example, if my cluster config says workers resource limits are CPU=4 does that mean that I can have 4 deployments per worker? At the 5th deployment(/deployment replica) does Ray spin up a new worker (assuming less than max workers)? Does each deployment “reserve” its “num_cpus” from the worker pool and hold on to them forever? How would this affect CPU utilization numbers from a kubernetes admin viewpoint?

I am asking because we are seeing high cpu consumption and I was wondering if perhaps I did not have optimal/best practice Cluster Config settings for my use case.

Thanks

ckw017 · June 23, 2022, 9:20pm

cc @Dmitri for cluster config question, @eoakes for serve questions

Dmitri · June 24, 2022, 6:09pm

This sounds correct.

How would this affect CPU utilization numbers from a kubernetes admin viewpoint?

One deployment occupies one “logical” CPU, as tracked by the Ray scheduler. The actual CPU usage depends on what the deployment is actually doing.
Allocating 1 logical CPU to a Ray task/actor that actually uses significantly more than 1 core in terms of actual usage would cause problems. If you have 4 such over-active tasks/actors in a 4CPU K8s pod, K8s’s CPU throttling mechanisms would kick in.

akelloway · June 24, 2022, 6:28pm

Thanks @Dmitri all that information helps a lot. So you are saying we should be picky about getting that “num_cpus” on the ray side approximately correct to the work being done so that additional K8S workers/pods can be spun as/if needed?
Thanks again

Dmitri · June 24, 2022, 7:07pm

Exactly. Underestimating num_cpus is bad for performance. Overestimating is safe but bad from a cost/utilization perspective.

Topic		Replies	Views
Ray Serve Pods Scheduling Failing Ray Serve	3	115	July 26, 2024
Resources allocation of ray serve in k8s Ray Serve	0	275	August 17, 2023
Ray on k8s, how to properly config head node Ray Clusters	4	941	June 24, 2022
What is the rationale for recommending one worker per k8s node Kubernetes	3	245	August 6, 2024
Is there a way to limit resources used by a ray job? Kubernetes	0	169	January 15, 2024

Resource utilization for RayServe in Kubernetes (AKS)

Related topics