How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Currently, I’m spinning up a Ray Tuning job that trains and tunes very large model with concurrency set to 3. An individual training of the model takes a huge amount of memory, so the worker nodes are allocated with 350G of memory each. The head node is only allocated with 8G memory and 8 CPU.
For some reason, Ray is scheduling tasks or actors that are using a lot of the memory of the head nodes, while a number of the worker nodes’ memory remains mostly unused.
I tried to force them to use worker nodes by setting cpu to 0, but that didn’t seem to work: the memory would still explode on the head node.
I saw Resources — Ray 2.5.1, which could definitely work for me, but I am not sure how to apply it to Ray Tuning. Where should I pass those specifications?