How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I’m attempting to perform some large scale training runs using RLlib’s client-server model. I have an external environment for which I’ve created an API and hooked it into RLlib’s MultiAgentEnv
. I run this as the client. Then I have a sever that pretty much looks just like the one from the CartPole server example.
I have access to a lot of compute nodes and I need to run a lot of instances of my environment at the same time in order to achieve the training time goals. So I potentially need to launch 100s-1000s of instances of my environment across dozens of compute nodes and have them send data to the server node for training.
As a first test, I’m just running the CartPole client-server example, where both client and server are running on the same compute node. This is a node with 72 CPUs. I have created a bash script that can launch N client processes, each of which creates the CartPole game and generates data. I run the server script directly from terminal with --num-workers N
. In my understanding, for every client that generates data, I need to have a worker listening on that client’s port. When I run with 72 workers, everything works out. I can start the server with no errors and I can run my bash script which launches 72 copies of the CartPole client. I’ve created some debugging output to ensure that they are connecting, and it all works.
Now, I attempt to go to 73 workers, which is one more worker than the number of CPUs I have available. I start by launching the server, and I get the following error:
(scheduler +48s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.
I do not launch any clients because this error is enough to stop me. It looks like I can only create as many workers as CPUs, so I can only create 72 workers, and therefore I can only scale up to 72 instances of my environment. That’s not so bad if I only had a single compute node, but I have access to hundreds, and I would like to be able to launch 1000s of instances of my environment.
Questions:
- Is there a better design than to have a single server handling the training?
- Is there a way for multiple clients to send data to the same worker on the same port? For example, suppose there were three clients sending data to a worker on port 9900. Is this possible in RLlib, or does there have to be a direct 1-1 match up between workers and clients.
- Can I add an intermediary layer that gathers up the the data from multiple clients and sends the batch to a single worker?
- I know that RLlib has
num_envs_per_worker
. Is there something like this for client-server mode?
When I say “can I do this,” of course the answer is yes because all it takes is some creativity. But what I’m really asking is “Does RLlib support this off the shelf, and if not, how hard is it to implement it?”
Thanks!