Very slow run and only nan results when using cluster of 64 cpus

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hello, I am using the following code that works ok in my laptop with both both linux and windows operating systems. But when I am trying to run it on a cluster of 64 CPUs, it is extremely slow compared to my laptop and it produces only nan as episode reward, while episodes_this_iter=0 for every iteration. In my laptop I have been using num_cpus_for_local_worker=10 and num_rollout_workers=2. Also I get the warning:

The maximum number of pending trials has been automatically set to the number of available cluster CPUs, which is high (140 CPUs/pending trials). If you’re running an experiment with a large number of trials, this could lead to scheduling overhead. In this case, consider setting the TUNE_MAX_PENDING_TRIALS_PG environment variable to the desired maximum number of concurrent trials.

while the CPUs I use are 64 (so is ray seeing 140?). I tried setting TUNE_MAX_PENDING_TRIALS_PG to 1 and 30, but there was no difference.
I use ray 3.0.0 dev0, because it is the only version that currently supports waterworld.v4.

from ray import air, tune
from ray.tune.registry import register_env
from ray.rllib.algorithms.ppo import PPOConfig
from ray.rllib.env.wrappers.pettingzoo_env import PettingZooEnv
from pettingzoo.sisl import waterworld_v4


if __name__ == "__main__":
    # RDQN - Rainbow DQN
    # ADQN - Apex DQN
    for i in range(9,11):
        def env_creator(args):
            return PettingZooEnv(waterworld_v4.env(n_pursuers=5, n_evaders=5))

        env = env_creator({})
        register_env("waterworld", env_creator)

        obs_space = env.observation_space
        act_spc = env.action_space

        policies = {"shared_1": (None, obs_space, act_spc, {})
                    # "shared_2": (None, obs_space, act_spc, {})
                    # "pursuer_5": (None, obs_space, act_spc, {})
                    }

        config = (
            PPOConfig()
            .environment("waterworld")
                .resources(num_gpus=0, num_cpus_for_local_worker=32)
                .rollouts(num_rollout_workers=30)  # default = 2 (I should try it)
            .framework("torch")
            .multi_agent(
                policies=policies,
                policy_mapping_fn=(lambda agent_id, *args, **kwargs: "shared_1"),
            )
        )

        tune.Tuner(
            "PPO",
            run_config=air.RunConfig(
                name="waterworld_v4 n5 shared trial {0}w".format(i),
                stop={"training_iteration": 1500},
                checkpoint_config=air.CheckpointConfig(
                    checkpoint_frequency=10,
                ),
            ),
            param_space=config.to_dict(),
        ).fit()

Please let me know if I should use any other different resources setting, or there are other slutions I could try.
Thanks, George

Hi @george_sk,

I think you num_sgd_iter st have a lot of policies, a large training batch, and a lot of sgd iterations. You will probably need to tune (not the library) those for your resources.

As a test, how does it change if you set the training argument num_sgd_iter to 1? That is probably lower than you will ultimately want but it is good for a sanity test.

Hi @mannyv . When setting sgd_iterations_1 it runs faster, but I don’t think faster than my laptop still. I also put num_rollout_workers=0 and num_cpus_for_local_worker=64 but still it doesnt seem much faster. What is the factor that makes it run slowly? Is it the large number of num_rollout_workers or something else? I will try ray 2_3_0 and gpu as well. But setting num_sgd_iter=1 is not ideal since it is a learning parameter that can affect efficiency I suppose. For example the weird thing is that in my laptop I use num_rollout_workers=2, num_cpus_for_local_worker=10 and the preset num_sgd_iter (I think it i 30) and this is way faster than using num_rollout_workers=2 and num_cpus_for_local_worker=62 and preset num_sgd_iter on the cluster (hpc). Do you have any idea why this is happening?
Thanks.

Also something I forgot. I have read that one solution might be to set TUNE_MAX_PENDING_TRIALS_PG to 1 or the number of avalable cpus. But even though I set it on the environment I still get the warning:

The maximum number of pending trials has been automatically set to the number of available cluster CPUs, which is high (140 CPUs/pending trials). If you’re running an experiment with a large number of trials, this could lead to scheduling overhead. In this case, consider setting the TUNE_MAX_PENDING_TRIALS_PG environment variable to the desired maximum number of concurrent trials.

So it seems it does not rcognise the value I set. Is there a way to set it internally through an RLlib variable? Because I can’t find something relevant.

Hi @george_sk,

One quick note. Changing the num_cpus_per_… Should not have any affect on performance. This is because the num_cpus are only used for book keeping in ray. That controls how many workers can run simultaneously which increases parallelism during the sampling of new data phase. But within each worker adding more cpus does not affect parallelism in any way.

Hi @mannyv . Sorry if it is naive but I am new to parallelism. Increasing parallelism should improve performance right? And could also increase data samples so it could lead to bottlenecks in some cases I suppose. I understand what you say about num_cpus_per since it is on rollout workers that do the sampling. But is it the same for this?

num_cpus_for_local_worker – Number of CPUs to allocate for the algorithm.

Because I understand from the documentation that it allocates more cpus to the worker running the algorithm, so it should increase performance. I haven’t used number_cpus_per… so far.
Generally how could I utilise all 64 CPUs and what RLlib parameters should I use?
Should I use num_trainer_workers?

Thanks