Speeding up ray scheduler

Hello everybody!

I have a simple question regarding the ray scheduler.

Little bit of context, my workflow is kind of many fine grained tasks, generally this number can go as high as 500-600 or 1000 and ideally even more. I host the ray cluster on Kubernetes, and my autoscaler uses AWS Karpenter to request for resources, and this part is super fast I go from 1 CPU to 200 or 300 or more CPUs very fast. However, once my worker nodes are running, my ray scheduler seems to be trailing behind, it needs quite a bit of time to fill up the available CPUs with tasks, although there are plenty of tasks available. The problem with this is that in impedes further scaling up, and ideally I would like a lot tasks to be running concurrently, faster than my current solution.

Main thing that is my pain point is speeding up the ray scheduler towards utilizing more of the available resources, faster. I cannot find anything such as maybe some way of allocating more resources to the scheduler? I couldn’t quite find something like this on the forums and/or documentation. Would be happy to hear how people solved similar issues.