Collect samples on a remote server train on local

Hi there,

I have a few large cpu servers available to use. My local machine has a 4 cores and a good gpu. It does a good enough job of training, but is severely bottle necked by the lack of cpus to generate samples on. The cpu servers can generate samples 10x faster which is useful for running analysis and experiments on later.

I was wondering how simple it would be to have my cpu servers generate samples to send back to the local machine to be trained on and if there are any examples to work from in the documentation. I’m not really familiar with how ray works as I’ve only used to get access to rllib



Could you ask on the Ray core forum here, how to set up a ray cluster using your local GPU machine as the head node and your CPU servers as the worker nodes? Then you can connect in your RLlib (driver) script to that server via ray.init(...) and RLlib should automatically utilize your CPU servers for doing the sample collection work and your GPU for the training updates.


Hi @sven1977, I’ve managed to get a cluster working on my local network.

I gave it a try with PPO and the sample throughput has halved, despite adding 8 more cores to the cluster. I don’t know why. The performance of A3C is also significantly less with the extra machine. On the second node (it doesn’t have a gpu) the cpu utilisation is pretty low. Is it trying to do the inference on the head node’s GPU?

Any ideas?

Hmm, strange. No PPO does not do inference on the local worker (head node). A3C is a little different, but also would do inference on the workers (as well as gradient calculation, unlike PPO).

  • Could you post your config?
  • What env setup do you use? External env (via client/server setup) or “normal”?

Hi @sven1977,

Here’s my ppo config. It’s the same config I use when training on just one machine. All I have done is start ray on the two machines and then on the head node started the script with the number of workers matching the new number of cpus available

config = {
    "algorithm": "PPO",
    "env": "yaniv",
    "env_config": env_config,
    "framework": "torch",
    "num_gpus": args.num_gpus,
    "num_workers": args.num_workers,
    "num_envs_per_worker": 1,
    "num_cpus_per_worker": 1,
    "num_cpus_for_driver": 1,
    "multiagent": {
        "policies": {
            "policy_1": (None, obs_space, act_space, {}),
            "policy_2": (None, obs_space, act_space, {}),
            "policy_3": (None, obs_space, act_space, {}),
            "policy_4": (None, obs_space, act_space, {}),
        "policy_mapping_fn": policy_mapping_fn,
        "policies_to_train": ["policy_1"],
    "callbacks": YanivCallbacks,
    # "log_level": "INFO",
    "evaluation_num_workers": 0,
    "evaluation_config": {"explore": False},
    "evaluation_interval": args.eval_every,
    "custom_eval_function": make_eval_func(env_config, args.eval_num),
    # hyper params
    "model": {
        "custom_model": "yaniv_mask",
        "fcnet_hiddens": [512, 512],
    "batch_mode": "complete_episodes",
    # A3c
    # "rollout_fragment_length": 50,
    # "train_batch_size": 500,
    # "min_iter_time_s": 10,
    # ppo
    "sgd_minibatch_size": 2048,
    "train_batch_size": 65536,
    "rollout_fragment_length": 100

I think I’m using a normal env set up. It’s just a regular custom multiagent env

I see, so some thoughts:

  • Now that you are running on a cluster, all data from the workers (on your CPU machines) needs to be transferred through the network to your driver (the GPU machine). Your PPO batch size (and sgd minibatch size) seem extremely large. Remember that PPO goes num_sgd_iter times through your entire train_batch_size, splitting it up into sgd_minibatch_size chunks and calculating the PPO’s loss on these.
    Could you check the individual stats and post them here? Like
  • On A3C, having a central GPU machine actually does not make too much sense. Reason: The gradient calculations for this algo happen on the workers. A3C does single local (worker) rollouts, the same worker that did the rollout then calculates the gradients via PG, then these gradients are sent to the central learner to be averaged (in async between all the workers) and applied to the central weights, then the central weights are broadcast back to all workers. So you don’t really use the central GPU, but having many fast CPUs on the workers definitely helps this algo.