Little speed up from 100 to 300 cores

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi Ray team, I’m experiencing some scalability issue when I’m testing my cluster with more nodes. Here’s a bit of context.

  • ray version: 1.9
  • workload: 300 tasks each of which takes about a minute to complete in a single core
  • cluster: around 300 cores running on a k8s platform. Each node is configured with 15 cores. Before I run my test, all nodes are brought up and all 300 cores are available.
  • execution pattern: I use a very simple parallel map pattern that sends out 300 tasks at once and wait for them all.
def price(...):
    ...
    pricing_tasks = list(
        map(lambda batch: price_remote.remote(batch), pricing_batches)
    )
    return ray.get(pricing_tasks)
  • performance
    • 100 cores:
      • this is the case that time can add up. For example, 300 tasks (each task takes about 1 min on a single core) in total take less than 3 min to finish.
      • in addition, from Ray Dashboard, I see all nodes are taken up quickly.
    • 300 cores:
      • when I increase my cores to 300, I’m expecting the total amount of time is around 1 min. However, it takes more than 2 min to finish.
      • In this case, from Ray Dashboard, I see tasks are sent to worker much slower than the case of 100 cores and at most it takes around 200 cores at the same time.
      • I also tried getting timeline stat by setting RAY_PROFILINEG=1 and the timeline tells a similar story as what I observed from dashboard

        Notice that some workers are scheduled much much later than others, though I send all tasks at the same time and the cluster has enough cores.

I also tried “gang scheduling” following instructions from this page using placement group (I’m using ray 1.9) : Placement Group Examples — Ray v1.9.2
but it doesn’t make much difference.

We found this issue when we did our final regression test before first time go-live with Ray, and it’s now a blocker. I would be really appreciate if any one could provide any suggestion. I’m glad to provide any further information if needed.

Thanks,
-BS

Thanks for all of the detailed information! This definitely should not be happening and I believe you’ve unfortunately triggered a bug in Ray core. I went ahead and filed a GitHub issue on your behalf where we can continue the conversation.

First, could you rerun your workload on the latest Ray version (v1.13) to verify the issue if possible? Please let us know what you find on GitHub.

In the meantime, once you’ve upgraded to 1.13, you can also try this temporary workaround to force tasks to be spread evenly (round-robin) in your cluster. Here are some examples from that page that you can use:

@ray.remote(scheduling_strategy="SPREAD")
def spread_function():
    return 2

# Spread tasks across the cluster.
[spread_function.remote() for i in range(100)]
1 Like

Thanks Stephanie. I will try upgrading to 1.13 but unfortunately it requires a bit of code change. Will update the result as soon as we get it.

The workaround works for now by upgrading to 1.13 and use spread scheduling policy. Let me know whether you need more information in order to identify the root cause

Great to hear that, @bishao84!

The Ray team will also try to reproduce your original issue on 1.13, but in case you’ve already done it, did you see the same scheduling issue without the SPREAD workaround?