Making Ray scheduler to Pack the workloads

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Using Ray Serve for serving LLMs per serve deployment each of which may require a different GPU quantity (8, 4 or 1 GPUs), when the small models are spread across of the nodes, deployments of big models cannot find a node for themselves with enough number of GPUs (we don’t use distributed inference because it was observed to affect the performance with the network overhead).

So is there a way to tweak the Ray scheduling so that it will try to schedule new deployments (or actors) on the nodes that are already being used in order to prioritize the utilization of the nodes first rather than spreading them across the nodes? (I asked the question on Ray Core topic since this is about the scheduler itself)