How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello all,
I am running a large RLLIB job on a HPC but running into load balancing issues. My job produces a workload that looks like this:
I have a custom env, so unfortunately cannot post the code here, but it is rather a slow env that calls an externally wrapped .exe (wine) for its simulation. My current config is rather simple but does spin up quite a large number of nodes:
job_config:
algorithm: "PPO"
config:
env: "custom_env"
framework: "tf"
num_workers: 238
num_cpus_per_worker: 3
num_gpus: 0
num_cpus_for_driver: 6
lr:
grid_search: [0.00005]
train_batch_size:
grid_search: [6000]
placement_strategy: "PACK"
model:
fcnet_hiddens: [256, 256]
fcnet_activation: "tanh"
run_config:
name: "custom_env"
verbose: 0
log_to_file: True
stop:
training_iteration: 20000
episode_reward_mean: 15
tune_config:
max_concurrent_trials: 2
# Checkpoint
checkpoint_config:
checkpoint_frequency: 3
checkpoint_at_end: True
In total I spin up 238 total rollout workers on 20 nodes (36 cores each) and give each rollout worker 3 cpus so it can call the .exe file.
My issue is - The runs are very badly load balanced. When observing the job, sometimes the entire job waits for a single rolloutworker to finish before proceeding, so that means sometimes I have 19 nodes that are just waiting for that single process to finish.
My question - How can I improve load balancing issues like this? Is there a better algorithm and configs choices I can make to improve performance of the job?
I will try to come up with a reproducible but in the meantime, any thoughts?
Thanks!
@max_ronda its hard to figure out what the issues are as we don’t know on which version of Ray you are running the experiments.
In general this will be a problem for synchronous algorithms like PPO as they wait until the train_batch_size
of samples is sampled. In this regard you could try out an asynchronous algorithm like for example APPO where waiting is less a problem.
If you want to stick with PPO, you could check, if you are running already on the new stack (i.e. config.experimental(_enable_new_api_stack=True
, but on the newest Ray version).
Furthermore I would check what your memory workload is - this is quite often the more limiting factor.
Hi @Lars_Simon_Zehnder , thanks for the reply. I updated ray
to latest 2.9.3
to test this. Could you explain what _enable_new_api_stack
will do in PPO
? I am running some tests now but want to get an understanding of the setting 
Also, I’ve tried APPO
but keep on getting this issue:
Trial task failed for trial APPO_env_84112_00000
Traceback (most recent call last):
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/air/execution/_internal/event_manager.py", line 110, in resolve_future
result = ray.get(future)
^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/_private/worker.py", line 2624, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RuntimeError): ray::APPO.train() (pid=182352, ip=10.74.113.160, actor_id=d4c9ebaea7cfbbef5e6d039101000000, repr=APPO)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 342, in train
raise skipped from exception_cause(skipped)
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/tune/trainable/trainable.py", line 339, in train
result = self.step()
^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 852, in step
results, train_iter_ctx = self._run_one_training_iteration()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/rllib/algorithms/algorithm.py", line 3042, in _run_one_training_iteration
results = self.training_step()
^^^^^^^^^^^^^^^^^^^^
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/rllib/algorithms/appo/appo.py", line 363, in training_step
train_results = super().training_step()
^^^^^^^^^^^^^^^^^^^^^^^
File "/hpcdata/conda/x86_64/envs/lib/python3.11/site-packages/ray/rllib/algorithms/impala/impala.py", line 698, in training_step
raise RuntimeError("The learner thread died while training!")
RuntimeError: The learner thread died while training!
Trials did not complete: [APPO_floris_env_84112_00000]
Any helpful thoughts on this ? Greatly appreciated!
Hi @max_ronda, we are since last year at work to transfer RLlib to a new stack that uses a couple of new (or enhanced) APIs. One is the RLModule
API that replaces the ModelV2
and is better customizable and faster. This is already in use for PPO and APPO. The other one is the EnvRunner API
that replaces the RolloutWorker
for sampling from the environment. The former one is light-weighted and can even be used outside of RLlib. Then there is the Learner API
that enables learning in multiple parallel threads (on multiple GPUs).
_enable_new_api_stack=True
enables the RLModule
API and the Learner
API. To use the EnvRunner
API you have to import the SingleAgentEnvRunner
and define the env_runner_cls
in the config
:
from ray.rllib.env.single_agent_env_runner import SingleAgentEnvRunner
config = (
PPOConfig()
.rollouts(env_runner_cls=SingleAgentEnvRunner)
.experimental(_enable_new_api_stack=True)
...
In regard to the error message, again, it’s hard to tell without a reproducable example. We do not know what your environment is returning and how the algorithm’s methods are called. Here, you could also try to use the new stack and see, if it works with the new stack. The implementaiton of the EnvRunner
API is happening already - I assume with the next minor version APPO might come with it.