Load Balance Issues with RLLIB Job

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello all,

I am running a large RLLIB job on a HPC but running into load balancing issues. My job produces a workload that looks like this:

I have a custom env, so unfortunately cannot post the code here, but it is rather a slow env that calls an externally wrapped .exe (wine) for its simulation. My current config is rather simple but does spin up quite a large number of nodes:


job_config:
  algorithm: "PPO"
  
  config:
    env: "custom_env"
    framework: "tf"
    num_workers: 238
    num_cpus_per_worker: 3
    num_gpus: 0
    num_cpus_for_driver: 6 
    lr:
      grid_search: [0.00005]
    train_batch_size:
      grid_search: [6000]
    
    placement_strategy: "PACK"
    model:
      fcnet_hiddens: [256, 256]
      fcnet_activation: "tanh"

  run_config:
    name: "custom_env"
    verbose: 0
    log_to_file: True
    stop:
      training_iteration: 20000
      episode_reward_mean: 15

  tune_config:
      max_concurrent_trials: 2
  
  # Checkpoint      
  checkpoint_config:
    checkpoint_frequency: 3
    checkpoint_at_end: True

In total I spin up 238 total rollout workers on 20 nodes (36 cores each) and give each rollout worker 3 cpus so it can call the .exe file.

My issue is - The runs are very badly load balanced. When observing the job, sometimes the entire job waits for a single rolloutworker to finish before proceeding, so that means sometimes I have 19 nodes that are just waiting for that single process to finish.

My question - How can I improve load balancing issues like this? Is there a better algorithm and configs choices I can make to improve performance of the job?

I will try to come up with a reproducible but in the meantime, any thoughts?

Thanks!

@max_ronda its hard to figure out what the issues are as we don’t know on which version of Ray you are running the experiments.

In general this will be a problem for synchronous algorithms like PPO as they wait until the train_batch_size of samples is sampled. In this regard you could try out an asynchronous algorithm like for example APPO where waiting is less a problem.

If you want to stick with PPO, you could check, if you are running already on the new stack (i.e. config.experimental(_enable_new_api_stack=True, but on the newest Ray version).

Furthermore I would check what your memory workload is - this is quite often the more limiting factor.

Hi @Lars_Simon_Zehnder , thanks for the reply. I updated ray to latest 2.9.3 to test this. Could you explain what _enable_new_api_stack will do in PPO? I am running some tests now but want to get an understanding of the setting :slight_smile:

Also, I’ve tried APPO but keep on getting this issue:

Trial task failed for trial APPO_env_84112_00000
Traceback (most recent call last):
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
 result = ray.get(future)
          ^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
 return fn(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
 return func(*args, **kwargs)
        ^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/_private/worker.py", line 2624, in get
 raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::APPO.train() (pid=182352, ip=10.74.113.160, actor_id=d4c9ebaea7cfbbef5e6d039101000000, repr=APPO)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 342, in train
 raise skipped from exception_cause(skipped)
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 339, in train
 result = self.step()
          ^^^^^^^^^^^
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 852, in step
 results, train_iter_ctx = self._run_one_training_iteration()
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 3042, in _run_one_training_iteration
 results = self.training_step()
           ^^^^^^^^^^^^^^^^^^^^
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/rllib/algorithms/appo/appo.py", line 363, in training_step
 train_results = super().training_step()
                 ^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/rllib/algorithms/impala/impala.py", line 698, in training_step
 raise RuntimeError("The learner thread died while training!")
RuntimeError: The learner thread died while training!
Trials did not complete: [APPO_floris_env_84112_00000]

Any helpful thoughts on this ? Greatly appreciated!

Hi @max_ronda, we are since last year at work to transfer RLlib to a new stack that uses a couple of new (or enhanced) APIs. One is the RLModule API that replaces the ModelV2 and is better customizable and faster. This is already in use for PPO and APPO. The other one is the EnvRunner API that replaces the RolloutWorker for sampling from the environment. The former one is light-weighted and can even be used outside of RLlib. Then there is the Learner API that enables learning in multiple parallel threads (on multiple GPUs).

_enable_new_api_stack=True enables the RLModule API and the Learner API. To use the EnvRunner API you have to import the SingleAgentEnvRunner and define the env_runner_cls in the config:

from ray.rllib.env.single_agent_env_runner import SingleAgentEnvRunner

config = (
      PPOConfig()
      .rollouts(env_runner_cls=SingleAgentEnvRunner)
      .experimental(_enable_new_api_stack=True)
      ...

In regard to the error message, again, it’s hard to tell without a reproducable example. We do not know what your environment is returning and how the algorithm’s methods are called. Here, you could also try to use the new stack and see, if it works with the new stack. The implementaiton of the EnvRunner API is happening already - I assume with the next minor version APPO might come with it.