Ray Blocking Spark Jobs

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have setup ray cluster on Databricks runtime 15.4 LTS without specifying min_worker_nodes and max_worker_nodes to dynamically allocate resources. But it’s blocking other spark tasks/jobs which wait indefinitely to start.

I was not facing this issue while running it on 12.2 LTS.

ray version: 2.33.0

Hi! Are there any error messages happening while you’re waiting for the Spark jobs to start? I’m guessing this might be due to how Ray is using resources, so that means there might not be any resources for Spark to start if Ray is using them all. Have you tried manually setting min_worker_nodes and max_worker_nodes to different values to see if that unblocks it?

I’m not too familiar with 15.4 LTS, did they change resource allocation or anything when upgrading from 12.2?

There is no error. Spark jobs are stuck in waiting state indefinitely. I have tried setting min_worker_nodes and max_worker_nodes which works fine as ray scales up/down depending upon load leaving resources for spark jobs to execute.

When I don’t specify number of worker nodes, ray is starting with all available worker nodes available in databricks spark cluster to meet minimum number of worker nodes

If you allocate all available workers, the Databricks Spark driver appears to have no resources left, causing it to get stuck.

To avoid this, I use the following approach:

ray_worker_count = max(1, int(total_workers * 0.75))

For instance, when my driver is assigned 3 workers, I set max_worker_nodes to 2, which allows it to function properly.