Parallel workaround with client server

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m attempting to perform some large scale training runs using RLlib’s client-server model. I have an external environment for which I’ve created an API and hooked it into RLlib’s MultiAgentEnv. I run this as the client. Then I have a sever that pretty much looks just like the one from the CartPole server example.

I have access to a lot of compute nodes and I need to run a lot of instances of my environment at the same time in order to achieve the training time goals. So I potentially need to launch 100s-1000s of instances of my environment across dozens of compute nodes and have them send data to the server node for training.

As a first test, I’m just running the CartPole client-server example, where both client and server are running on the same compute node. This is a node with 72 CPUs. I have created a bash script that can launch N client processes, each of which creates the CartPole game and generates data. I run the server script directly from terminal with --num-workers N. In my understanding, for every client that generates data, I need to have a worker listening on that client’s port. When I run with 72 workers, everything works out. I can start the server with no errors and I can run my bash script which launches 72 copies of the CartPole client. I’ve created some debugging output to ensure that they are connecting, and it all works.

Now, I attempt to go to 73 workers, which is one more worker than the number of CPUs I have available. I start by launching the server, and I get the following error:

(scheduler +48s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

I do not launch any clients because this error is enough to stop me. It looks like I can only create as many workers as CPUs, so I can only create 72 workers, and therefore I can only scale up to 72 instances of my environment. That’s not so bad if I only had a single compute node, but I have access to hundreds, and I would like to be able to launch 1000s of instances of my environment.

Questions:

  1. Is there a better design than to have a single server handling the training?
  2. Is there a way for multiple clients to send data to the same worker on the same port? For example, suppose there were three clients sending data to a worker on port 9900. Is this possible in RLlib, or does there have to be a direct 1-1 match up between workers and clients.
  3. Can I add an intermediary layer that gathers up the the data from multiple clients and sends the batch to a single worker?
  4. I know that RLlib has num_envs_per_worker. Is there something like this for client-server mode?

When I say “can I do this,” of course the answer is yes because all it takes is some creativity. But what I’m really asking is “Does RLlib support this off the shelf, and if not, how hard is it to implement it?”

Thanks!

Pinging this in case it has been glanced over. Would love some guidance here. Thanks!

Hi @rusu24edward ,

This is how it is done in the example. I have not verified this yet but I would think that since the PolicyServerInput class really just implements an HTTP server with a REST API that you could have multiple clients send trajectories to that one server. I suppose it could break because the rollout worker cannot support multiple active envs but is that really different than setting num_envs_per_worker > 1?.

First I would suggest checking if that is true. Can you start the server with num_workers=0 and then start 10 clients that all send data to the same ip:port. If you try this and it works please let us know.

If that is true then the second consideration is that you might run into scaling issues with just one input server if you are really launching 1000s of clients. In that case the next step would be to create 1 PolicyServerInput servers for every n number of clients and have a way to assign a client to a port. Personally, I would benchmark the throughput of 1 PolicyServerInput before scaling up.

1 Like

Hi Manny, thanks for the suggestion. I am able to run multiple clients per trainer worker, and that should help me make progress.

Follow up question, how does RLlib handle num_envs_per_worker?