Is there a way to stall jobs or to check the resource usage before submitting a job?

Huaiwei_Sun · April 8, 2023, 6:29am

A question was asked in this Slack thread:

I have a long running script that keeps pulling messages from pubsub and submits a Ray job per message to the Ray cluster. I have ray autoscaling in place but some of the jobs still fail with the error
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Is there a way to stall jobs or to check the resource usage before submitting a job?

Huaiwei_Sun · April 8, 2023, 6:36am

First, make sure you are aware of the differences between the ray (logical) resources and physical hardware usage
When you see Task was killed due to the node running low on memory., it means that the Ray memory monitor kills the task because your physical memory usage is too high
Ray autoscaler uses ray resources to make decisions.

One potential solution

Try to figure out the physical usage of your single job and allocate proper ray resources to those tasks/actors of your job so that the logical allocation and physical usage are consistent. If that’s the case, autoscaler can properly autoscale when there are not enough ray resources (physical resource usage is high)

cc: @sangcho @jjyao

praveeng · April 10, 2023, 4:17pm

cc @Chen_Shen @Alex For scheduler and autoscaler decisions using ray resources

Topic		Replies	Views
Controlling Scaling based on jobs in queue Ray Core	2	276	April 8, 2021
Will the task/actors queued if there are not enough resources? Ray Core	4	441	April 18, 2024
Problem node running low on memory	3	1929	April 11, 2023
Is there a way to limit resources used by a ray job? Kubernetes	0	164	January 15, 2024
How does the scheduler "decide" to send jobs based on resources/plasma storage? Ray Core	1	503	April 5, 2021

Is there a way to stall jobs or to check the resource usage before submitting a job?

Related topics