How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi Ray team, I’m experiencing some scalability issue when I’m testing my cluster with more nodes. Here’s a bit of context.
- ray version: 1.9
- workload: 300 tasks each of which takes about a minute to complete in a single core
- cluster: around 300 cores running on a k8s platform. Each node is configured with 15 cores. Before I run my test, all nodes are brought up and all 300 cores are available.
- execution pattern: I use a very simple parallel map pattern that sends out 300 tasks at once and wait for them all.
def price(...): ... pricing_tasks = list( map(lambda batch: price_remote.remote(batch), pricing_batches) ) return ray.get(pricing_tasks)
- 100 cores:
- this is the case that time can add up. For example, 300 tasks (each task takes about 1 min on a single core) in total take less than 3 min to finish.
- in addition, from Ray Dashboard, I see all nodes are taken up quickly.
- 300 cores:
- when I increase my cores to 300, I’m expecting the total amount of time is around 1 min. However, it takes more than 2 min to finish.
- In this case, from Ray Dashboard, I see tasks are sent to worker much slower than the case of 100 cores and at most it takes around 200 cores at the same time.
- I also tried getting timeline stat by setting RAY_PROFILINEG=1 and the timeline tells a similar story as what I observed from dashboard
Notice that some workers are scheduled much much later than others, though I send all tasks at the same time and the cluster has enough cores.
- 100 cores:
I also tried “gang scheduling” following instructions from this page using placement group (I’m using ray 1.9) : Placement Group Examples — Ray v1.9.2
but it doesn’t make much difference.
We found this issue when we did our final regression test before first time go-live with Ray, and it’s now a blocker. I would be really appreciate if any one could provide any suggestion. I’m glad to provide any further information if needed.