1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.48.0
- Python version: 3.12.11
- OS: linux
- Other libs/tools (if relevant): docker
**
Hi there.**
I am currently trying to setup a test/poc cluster on a local machine. We have a bunch of containers that have their own diverging environments, running their own code. All on the same host machine.
Currently we have a local single node cluster in EACH AND EVERY container, just so we can work on our big datasets.
This obviously sucks, we cant run multiple container tasks in parallel because on ray instance doesnt know about the others, and running in resource issues. Obviously.
So I want to switch to using 1 ray cluster, where every container is a ray worker. Still on the same machine (for now).
So Starting the head node and connecting the containers as work nodes works fine. But every worker obviously reports the full amount of cpu cores to the cluster, so the sum off all those is far greater then the host machine can actually provide.
I could limit the actual hardware for the containers, or just define less cores in the workers, but I would like every worker to be still able to use all the cores, if they are available. Limiting worker resources directly would lead to idle cores if just one container is doing work.
I managed to solve this using placement groups. Setting a placement groups resources to the actually available cores, and then starting all remotes in this group, practically limits the cluster to only use as much resources as I want it to, without limiting each worker if it is working alone.
But now I cant seam to find a way to use custom resource requirements to guide tasks to specific workers. Without placement groups this was easy, but now I cant seam to understand how that might work. Also for every pairing of resources i need its own bundle, so this stops working completely when we dont define cpu resources.
So is there a solution to this weird issue?
-
is it somehow possible to set cluster global resource limits?
This would be the easiest solution. If i could tell the cluster; no matter how many cpu cores have been reported, always just use a maximum of 32. Then i could use all the other resource systems like they are meant to. -
Is it possible to use placement groups and node affinity at the same time?
As a workaround, so I can limit global resources (to not try to use the same core mutliple times) while at the same time have control over the enviroment a task is run in. -
Can I setup the cluster so every task is actually performed by the same worker it was scheduled from?
would make no sense on a real distributed system, but would solve all my issues in this weird single host, multiple container setup
Also; for future usecases were we use VMs or multiple physical machines; it might still be necessary to run mutliple worker nodes on the same machine, using different environments (especially docker containers; that go beyond simple package requirements). It would be neat if it was possible to set maximum resources for a group of worker nodes.
I would love to set this up as a test, figuring it all out, before actually building nodes, or going into the cloud. Also it would already enable out local setup to efficiently deal with resources.
I hope someone can help me out there!
Greetings, Marcel