RayOutOfMemoryError: Why is autoscaler not creating new pods?

Haripraghash · March 29, 2022, 2:20pm

I am running a ray cluster using helm charts. 1 Head pod. And worker pods are left to autoscale between 0 to 6. Worker node resource request and limits are set to 1 CPU and 2048Mi

When I submit a job, 1 worker pod gets created. However, the ray CLI job logs show OOM.

Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::ConsumerActor.run_behaviour() (pid=586, ip=10.240.0.117, repr=<xxx.actors.consumer_actor.ConsumerActor object at 0x7fcdf1ebf790>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ray-cluster-ray-worker-type-9w5p5 is used (1.96 / 2.0 GB). The top 10 memory consumers are:

If the memory is not enough for the tasks to execute, should the auto scaler not kick in and create more pods for the tasks?

Ameer_Haj_Ali · April 25, 2022, 11:47am

@Dmitri , can you please take a look?

Dmitri · April 28, 2022, 2:59am

If the memory is not enough for the tasks to execute, should the auto scaler not kick in and create more pods for the tasks?

The autoscaler scales based on virtual resources specified on Ray task and actor annotations.
It does not scale based on resource utilization.

Note that for a tasks @ray.remote is equivalent to @ray.remote(num_cpus=1).
So, if you have workers with 1 CPU and you submit 6 tasks decorated with @ray.remote, the autoscaler will attempt to create 6 workers to run those tasks. (That’s is assuming max workers setting is set to at least 6 in the relevant config.)
However, if each of those tasks is insanely memory consumptive, each task will fail with an OOM exception.

The solution is to do some manual vertical scaling – try allocating more memory for your workers.

Dmitri · April 28, 2022, 2:59am

Topic		Replies	Views
Autoscaling not working with ray.util.multiprocessing Kubernetes	5	774	June 17, 2021
[Cluster][Autoscaler-v2]-Autoscaler v2 does not honor minReplicas/replicas count of the worker nodes and constantly terminates after idletimeout Kubernetes	0	34	September 10, 2024
Autoscaler not scaling up the worker node when using image rayproject/ray:1.11.0-py38 Kubernetes	3	886	July 2, 2022
Testing autoscaler Kubernetes	15	1527	March 16, 2021
Autoscaler does not scale in ray1.4 with 0 CPUs allocated head node Kubernetes	1	471	July 27, 2021

RayOutOfMemoryError: Why is autoscaler not creating new pods?

Related topics