Collect samples on a remote server train on local

Rory · April 21, 2021, 3:36pm

Hi there,

I have a few large cpu servers available to use. My local machine has a 4 cores and a good gpu. It does a good enough job of training, but is severely bottle necked by the lack of cpus to generate samples on. The cpu servers can generate samples 10x faster which is useful for running analysis and experiments on later.

I was wondering how simple it would be to have my cpu servers generate samples to send back to the local machine to be trained on and if there are any examples to work from in the documentation. I’m not really familiar with how ray works as I’ve only used to get access to rllib

Cheers,

Rory

sven1977 · April 22, 2021, 8:10am

Could you ask on the Ray core forum here, how to set up a ray cluster using your local GPU machine as the head node and your CPU servers as the worker nodes? Then you can connect in your RLlib (driver) script to that server via ray.init(...) and RLlib should automatically utilize your CPU servers for doing the sample collection work and your GPU for the training updates.

Rory · April 22, 2021, 6:16pm

Hi @sven1977, I’ve managed to get a cluster working on my local network.

I gave it a try with PPO and the sample throughput has halved, despite adding 8 more cores to the cluster. I don’t know why. The performance of A3C is also significantly less with the extra machine. On the second node (it doesn’t have a gpu) the cpu utilisation is pretty low. Is it trying to do the inference on the head node’s GPU?

Any ideas?

sven1977 · April 23, 2021, 9:59am

Hmm, strange. No PPO does not do inference on the local worker (head node). A3C is a little different, but also would do inference on the workers (as well as gradient calculation, unlike PPO).

Could you post your config?
What env setup do you use? External env (via client/server setup) or “normal”?

Rory · April 24, 2021, 11:00am

Hi @sven1977,

Here’s my ppo config. It’s the same config I use when training on just one machine. All I have done is start ray on the two machines and then on the head node started the script with the number of workers matching the new number of cpus available

config = {
    "algorithm": "PPO",
    "env": "yaniv",
    "env_config": env_config,
    "framework": "torch",
    "num_gpus": args.num_gpus,
    "num_workers": args.num_workers,
    "num_envs_per_worker": 1,
    "num_cpus_per_worker": 1,
    "num_cpus_for_driver": 1,
    "multiagent": {
        "policies": {
            "policy_1": (None, obs_space, act_space, {}),
            "policy_2": (None, obs_space, act_space, {}),
            "policy_3": (None, obs_space, act_space, {}),
            "policy_4": (None, obs_space, act_space, {}),
        },
        "policy_mapping_fn": policy_mapping_fn,
        "policies_to_train": ["policy_1"],
    },
    "callbacks": YanivCallbacks,
    # "log_level": "INFO",
    "evaluation_num_workers": 0,
    "evaluation_config": {"explore": False},
    "evaluation_interval": args.eval_every,
    "custom_eval_function": make_eval_func(env_config, args.eval_num),
    
    # hyper params
    "model": {
        "custom_model": "yaniv_mask",
        "fcnet_hiddens": [512, 512],
    },
    "batch_mode": "complete_episodes",
    # A3c
    # "rollout_fragment_length": 50,
    # "train_batch_size": 500,
    # "min_iter_time_s": 10,
    
    # ppo
    "sgd_minibatch_size": 2048,
    "train_batch_size": 65536,
    "rollout_fragment_length": 100
}

I think I’m using a normal env set up. It’s just a regular custom multiagent env

sven1977 · April 27, 2021, 7:32am

I see, so some thoughts:

Now that you are running on a cluster, all data from the workers (on your CPU machines) needs to be transferred through the network to your driver (the GPU machine). Your PPO batch size (and sgd minibatch size) seem extremely large. Remember that PPO goes num_sgd_iter times through your entire train_batch_size, splitting it up into sgd_minibatch_size chunks and calculating the PPO’s loss on these.
Could you check the individual stats and post them here? Like

            "mean_raw_obs_processing_ms",
            "mean_inference_ms"
            "mean_action_processing_ms"
            "mean_env_wait_ms"
            "mean_env_render_ms"

On A3C, having a central GPU machine actually does not make too much sense. Reason: The gradient calculations for this algo happen on the workers. A3C does single local (worker) rollouts, the same worker that did the rollout then calculates the gradients via PG, then these gradients are sent to the central learner to be averaged (in async between all the workers) and applied to the central weights, then the central weights are broadcast back to all workers. So you don’t really use the central GPU, but having many fast CPUs on the workers definitely helps this algo.

Topic		Replies	Views
Most efficient way to use only a CPU for training RLlib	3	3076	April 22, 2021
Example of A3C only use CPU for trainer RLlib	10	850	July 23, 2021
Training parallelisation in RLLIB Configure Algorithm, Training, Evaluation, Scaling	3	583	December 9, 2022
Reserve workers on GPU node for trainer workers only RLlib	7	1107	June 3, 2022
CPU using all cores despite config Configure Algorithm, Training, Evaluation, Scaling	0	15	December 18, 2024

Collect samples on a remote server train on local

Related topics