I am running a ray cluster using helm charts. 1 Head pod. And worker pods are left to autoscale between 0 to 6. Worker node resource request and limits are set to 1 CPU and 2048Mi
When I submit a job, 1 worker pod gets created. However, the ray CLI job logs show OOM.
Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::ConsumerActor.run_behaviour() (pid=586, ip=10.240.0.117, repr=<xxx.actors.consumer_actor.ConsumerActor object at 0x7fcdf1ebf790>)
ray._private.memory_monitor.RayOutOfMemoryError: More than 95% of the memory on node ray-cluster-ray-worker-type-9w5p5 is used (1.96 / 2.0 GB). The top 10 memory consumers are:
If the memory is not enough for the tasks to execute, should the auto scaler not kick in and create more pods for the tasks?
If the memory is not enough for the tasks to execute, should the auto scaler not kick in and create more pods for the tasks?
The autoscaler scales based on virtual resources specified on Ray task and actor annotations.
It does not scale based on resource utilization.
Note that for a tasks @ray.remote is equivalent to @ray.remote(num_cpus=1).
So, if you have workers with 1 CPU and you submit 6 tasks decorated with @ray.remote, the autoscaler will attempt to create 6 workers to run those tasks. (That’s is assuming max workers setting is set to at least 6 in the relevant config.)
However, if each of those tasks is insanely memory consumptive, each task will fail with an OOM exception.
The solution is to do some manual vertical scaling – try allocating more memory for your workers.