Is there a way to stall jobs or to check the resource usage before submitting a job?

A question was asked in this Slack thread:

I have a long running script that keeps pulling messages from pubsub and submits a Ray job per message to the Ray cluster. I have ray autoscaling in place but some of the jobs still fail with the error
ray.exceptions.OutOfMemoryError: Task was killed due to the node running low on memory. Is there a way to stall jobs or to check the resource usage before submitting a job?

  1. First, make sure you are aware of the differences between the ray (logical) resources and physical hardware usage
  2. When you see Task was killed due to the node running low on memory., it means that the Ray memory monitor kills the task because your physical memory usage is too high
  3. Ray autoscaler uses ray resources to make decisions.

One potential solution

  • Try to figure out the physical usage of your single job and allocate proper ray resources to those tasks/actors of your job so that the logical allocation and physical usage are consistent. If that’s the case, autoscaler can properly autoscale when there are not enough ray resources (physical resource usage is high)

cc: @sangcho @jjyao

cc @Chen_Shen @Alex For scheduler and autoscaler decisions using ray resources