Impala Bugs and some other observations

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

System:

image

Python 3.9.16
Ray 2.3.0 and 3.0.0.dev0
Tensorflow 2.11.1

Issue may be related to this PR

Unfortunately, I’m unable to replicate this with one of the lightweight and fast standard toy environments although I tried. So instead I provide you with some Sampler_perf obtained with different settings/Algos:

Impala (train_batch_size=500, num_rollout_workers=2,rollout_fragment_length=50):

sampler_perf:
mean_action_processing_ms: 0.1270183308157854
mean_env_render_ms: 0.0
mean_env_wait_ms: 517.0239390965231
mean_inference_ms: 3.660824960339331
mean_raw_obs_processing_ms: 1.627967505159968
sampler_results:
connector_metrics:
ObsPreprocessorConnector_ms: 0.007319450378417969
StateBufferConnector_ms: 0.004112720489501953
ViewRequirementAgentConnector_ms: 0.21414756774902344

Impala (train_batch_size=800, num_rollout_workers=4,rollout_fragment_length=100):

  • only empty lists, dicts and nans returned

PPO reference example given below:

sampler_perf:
mean_action_processing_ms: 0.13562084506648384
mean_env_render_ms: 0.0
mean_env_wait_ms: 333.2677821337623
mean_inference_ms: 3.2553278776970607
mean_raw_obs_processing_ms: 0.832363541727467
sampler_results:
connector_metrics:
StateBufferConnector_ms: 0.010395050048828125
ViewRequirementAgentConnector_ms: 0.19804835319519043

I run a rather slow and computationally heavy custom environment. I’ve been running this since Ray 2.1.0 and 2.0.0 before that although as a GYM version and used the Impala algorithm with no issues whatsoever. I’ve upgraded to Gymnasium and registered the environment and got it working in it’s own right. The environment is vision based and I use a custom model as given below with framestack = 4

The observation space is:
Box(low=0, high=255, shape=(72, 128, 3), dtype=np.uint8)

def conv_layer(depth, name):
    return tf.keras.layers.Conv2D(
        filters=depth, kernel_size=3, strides=1, padding="same", name=name
    )


def residual_block(x, depth, prefix):
    inputs = x
    assert inputs.get_shape()[-1].value == depth
    x = tf.keras.layers.ReLU()(x)
    x = conv_layer(depth, name=prefix + "_conv0")(x)
    x = tf.keras.layers.ReLU()(x)
    x = conv_layer(depth, name=prefix + "_conv1")(x)
    return x + inputs


def conv_sequence(x, depth, prefix):
    x = conv_layer(depth, prefix + "_conv")(x)
    x = tf.keras.layers.MaxPool3D(pool_size=3, strides=2, padding="same")(x) # 3D for multiframe
    x = residual_block(x, depth, prefix=prefix + "_block0")
    x = residual_block(x, depth, prefix=prefix + "_block1")
    return x


class CustomModel(TFModelV2):
    """Deep residual network that produces logits for policy and value for value-function;
    Based on architecture used in IMPALA paper:https://arxiv.org/abs/1802.01561"""

    def __init__(self, obs_space, action_space, num_outputs, model_config, name):
        super().__init__(obs_space, action_space, num_outputs, model_config, name)

        depths = [16, 32, 32]

        inputs = tf.keras.layers.Input(shape=obs_space.shape, name="observations")
        scaled_inputs = tf.cast(inputs, tf.float32) / 255.0

        x = scaled_inputs
        for i, depth in enumerate(depths):
            x = conv_sequence(x, depth, prefix=f"seq{i}")

        x = tf.keras.layers.Flatten()(x)
        x = tf.keras.layers.ReLU()(x)
        x = tf.keras.layers.Dense(units=256, activation="relu", name="hidden")(x)
        logits = tf.keras.layers.Dense(units=num_outputs, name="pi")(x)
        value = tf.keras.layers.Dense(units=1, name="vf")(x)
        self.base_model = tf.keras.Model(inputs, [logits, value])

    def forward(self, input_dict, state, seq_lens):
        # explicit cast to float32 needed in eager
        obs = tf.cast(input_dict["obs"], tf.float32)
        logits, self._value = self.base_model(obs)
        return logits, state

    def value_function(self):
        return tf.reshape(self._value, [-1])
    
    def import_from_h5(self, h5_file):
        self.base_model.load_weights(h5_file)

When running RLLIB Impala in the latest versions I experience the following (strange…) issues:

1. Training and rollouts:

.rollouts(num_rollout_workers=4) with .training(train_batch_size=800,model={“custom_model”:“CustomCNN”}) and .rollouts( rollout_fragment_length=100) results in no metric output (episode_reward_mean etc). It does appear to be training as it shows policy loss etc.
Setting it to 2 workers or train_batch/rollout_fragment_length=500/50 respectively makes it work though and metrics are shown. Increasing num_rollout_workers it fails again as well as when increasing num_envs_per_worker to more than 1.

Obviously, I suspected the custom environment at first but as you see above it runs under certain conditions. Additionally, when changing to PPO I can push it pretty much to system limit with no problems like this:

algo = (
    ppo.PPOConfig()
    .training(train_batch_size=2400,model={"custom_model":"CustomCNN"})
    .environment(env="xxxenv",env_config={"dummy_param":"foo"}) 
    .rollouts(
              num_rollout_workers=6, 
              num_envs_per_worker=4,
              rollout_fragment_length=100,
              remote_worker_envs=True,
              remote_env_batch_wait_ms=10,
              preprocessor_pref=None,
              sampler_perf_stats_ema_coef=2/(200+1) # 200 ema
               )
    .resources(num_gpus=1,num_cpus_per_worker=5)
    .fault_tolerance(recreate_failed_workers=True, restart_failed_sub_environments=True)
    .build()
)

.rollouts(…,sampler_perf_stats_ema_coef=2/(200+1)) appears to have no effect. The result still appears to be just the mean of the list of episode_rewards

2. Framework

.framework(framework=“tf2”,eager_tracing=True) results in the following error (also without specifying eager_tracing):

Exception in thread Thread-18:
Traceback (most recent call last):
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/threading.py”, line 980, in _bootstrap_inner
self.run()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/execution/learner_thread.py”, line 74, in run
self.step()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/execution/learner_thread.py”, line 91, in step
multi_agent_results = self.local_worker.learn_on_batch(batch)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py”, line 1036, in learn_on_batch
info_out[pid] = policy.learn_on_batch(batch)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py”, line 139, in func
return obj(self
, *args, **kwargs)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py”, line 224, in learn_on_batch
return super(TracedEagerPolicy, self).learn_on_batch(samples)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/utils/threading.py”, line 24, in wrapper
return func(self, *a, **k)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py”, line 628, in learn_on_batch
stats = self._learn_on_batch_helper(postprocessed_batch)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py”, line 97, in _func
return func(*eager_args, **eager_kwargs)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/tensorflow/python/util/traceback_utils.py”, line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/tensorflow/python/eager/execute.py”, line 52, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InternalError: Graph execution error:

Detected at node ‘StatefulPartitionedCall_32’ defined at (most recent call last):
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/threading.py”, line 937, in _bootstrap
self._bootstrap_inner()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/threading.py”, line 980, in _bootstrap_inner
self.run()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/execution/learner_thread.py”, line 74, in run
self.step()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/execution/learner_thread.py”, line 91, in step
multi_agent_results = self.local_worker.learn_on_batch(batch)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/evaluation/rollout_worker.py”, line 1036, in learn_on_batch
info_out[pid] = policy.learn_on_batch(batch)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py”, line 139, in func
return obj(self
, *args, **kwargs)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py”, line 224, in learn_on_batch
return super(TracedEagerPolicy, self).learn_on_batch(samples)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/utils/threading.py”, line 24, in wrapper
return func(self, *a, **k)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py”, line 628, in learn_on_batch
stats = self._learn_on_batch_helper(postprocessed_batch)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy.py”, line 97, in _func
return func(*eager_args, **eager_kwargs)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py”, line 924, in _learn_on_batch_helper
self._apply_gradients_helper(grads_and_vars)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/policy/eager_tf_policy_v2.py”, line 1007, in _apply_gradients_helper
o.apply_gradients(
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py”, line 1140, in apply_gradients
return super().apply_gradients(grads_and_vars, name=name)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py”, line 634, in apply_gradients
iteration = self._internal_apply_gradients(grads_and_vars)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py”, line 1166, in _internal_apply_gradients
return tf.internal.distribute.interim.maybe_merge_call(
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py”, line 1216, in _distributed_apply_gradients_fn
distribution.extended.update(
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/keras/optimizers/optimizer_experimental/optimizer.py”, line 1211, in apply_grad_to_update_var
return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: ‘StatefulPartitionedCall_32’
libdevice not found at ./libdevice.10.bc
[[{{node StatefulPartitionedCall_32}}]] [Op:__inference__learn_on_batch_helper_11039]
Traceback (most recent call last):
File “/home/novelty/lupus_gymnasium_dev/rllib_basic_test/impala_test_5.py”, line 141, in
result = algo.train()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/tune/trainable/trainable.py”, line 384, in train
raise skipped from exception_cause(skipped)
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/tune/trainable/trainable.py”, line 381, in train
result = self.step()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py”, line 769, in step
results, train_iter_ctx = self._run_one_training_iteration()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py”, line 2754, in _run_one_training_iteration
results = self.training_step()
File “/home/novelty/miniconda3/envs/lupus_gymnasium/lib/python3.9/site-packages/ray/rllib/algorithms/impala/impala.py”, line 619, in training_step
raise RuntimeError(“The learner thread died while training!”)
RuntimeError: The learner thread died while training!

BR

Jorgen

Hi again.

I finally managed to make a reproduction code which can be found here.

As speculated above the bug seems to be linked the the speed of the environment. Hense I modified the RandomEnv and added a “sleeping” parameter which is passed through the env_config. This slows the reset and step functions to mimic a slow environment.

The bug appears when increasing the “sleeping” parameter in the env_config which in turn may be depending on the given system. There are a number of “debug measures” build into the environment in order to figure out what the Impala algo is actually doing. In my case with the system above the bug appears with the default settings as given in the linked files. I find it interesting to note that the Impala algo in fact resets the environments but still doesn’t provide the episode_reward. Moreover, setting env_config = {…, “sleeping” : 0.0} removes the bug completely which I believe support my argument.

Running the impala_test_random_env_custom_model.py as is should result in episode rewards being listed when reaching app. 12k samples but on my system they don’t.

Hopefully, this will enable you to track the problem.

BR

Jorgen

Hi,

Update:

I retested the above today with the latest Ray 3.0.0.dev0 and the errors still persist. Additionally, I now had to reduce the train_batch_size to 600 as oppose to the previous 1200 to avoid GPU OOM error. The later runs fine in Ray 2.1.0 as well. In terms of framework the default framework now seems to be “torch”. I was only able to specify “tf” but not “tf2” to get it running.

BR

Jorgen

Hi again @Jorgen_Svane ,

Thanks for the detailed description of your issue.
The error you posted contains another error outside of RLLib:

return self._update_step_xla(grad, var, id(self._var_key(var)))
Node: ‘StatefulPartitionedCall_32’
libdevice not found at ./libdevice.10.bc

You probably want to investigate that.

Concerning other elemts of your post:
IMPALA is asynchronous. Therefore, such issues are more likely to occur than with PPO.
For examples, the learner thread can time out (it has a default timeout of 5 minutes).
You can modify that with learner_queue_timeout.
A mean env wait time of 517.0239390965231ms for this rollout batch size means that your rollout fragment length will be reached in > 50s. If your batch size is even larger and you only have one rollout worker, that makes for a very long sampling time during which impala will always collect the mean episodic reward as a metric but not have enough samples available to train on since IMAPALA is asynchronous. So that’s expected.

Furthermore, if you are using only one worker, why are you using IMPALA?

Hi @arturn

You are right in terms of the tf2 error originating outside of rllib. There seems to be some debate on the web on how to solve it (interested users may have a look here) and there appears to be some updates to the Tensorflow installation page as well.
Downgrading to Tensorflow version 2.10 solves the issue though - at least for now.

The reason for me using the Impala algo is that it is asynchronous. My custom environment suffers from that some state transitions are more computational expensive than others and thus slower. Additionally, my cluster setup comprises different types of CPUs with different speeds. Hence, if I was using a synchronous algo like PPO I would have rollout workers sitting idle waiting for the slower ones to finish. As my custom environment is rather slow anyway this would not be very efficient.

Normally, I run with 28 workers with 4 envs per worker.

I think the “one worker” issue originates from when I started to investigate this issue when I tried to migrate from Ray 2.1.0 to 2.2.0 and communicated with @sven1977 on this (more details here). Again, the episode_reward_mean etc kept being just .nan and I tried various rollout worker configurations including using only one. In this process I also managed to “harass” the Impala algo enough to get the learner queue empty error. But that is not whats happening here. Nevertheless, I still think these issues are related as my custom environment runs fine in Ray 2.1.0.

I have just been running the reproduction code that I provided a link to above after downgrading Tensorflow to 2.10 in Ray 2.3.1. It runs fine but episode_reward_mean etc. should start to show around 12k time steps but again they don’t - although the algo resets the environments after termination/truncation which should indicate that the episodes ended. Moreover, Info[“learner”] and info[“learner_queue”] etc appears to me to be running as expected. See sample output below:

agent_timesteps_total: 36000
connector_metrics: {}
counters:
  num_agent_steps_sampled: 36000
  num_agent_steps_trained: 36000
  num_env_steps_sampled: 36000
  num_env_steps_trained: 36000
  num_samples_added_to_queue: 36000
  num_training_step_calls_since_last_synch_worker_weights: 75883
  num_weight_broadcasts: 26
custom_metrics: {}
date: 2023-04-21_10-10-09
done: false
episode_len_mean: .nan
episode_media: {}
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes_this_iter: 0
episodes_total: 0
experiment_id: cbb2c6a61b7a419b96ce250ed0610573
hostname: novelty-TUF-GAMING-X670E-PLUS-1002
info:
  learner:
    default_policy:
      custom_metrics: {}
      diff_num_grad_updates_vs_sampler_policy: 6.5
      grad_gnorm:
      - 1.5110050439834595
      learner_stats:
        cur_lr: 0.0004999120137654245
        entropy: 1.6087641716003418
        entropy_coeff: 0.00499859219416976
        policy_loss: 0.0005777080659754574
        var_gnorm: 29.632728576660156
        vf_explained_var: -0.04690992832183838
        vf_loss: 0.00027035464881919324
      num_agent_steps_trained: 800.0
      num_grad_updates_lifetime: 40.0
  learner_queue:
    size_count: 45
    size_mean: 0.8444444444444444
    size_quantiles: [0.0, 0.0, 1.0, 2.0, 2.0]
    size_std: 0.8152860590152952
  num_agent_steps_sampled: 36000
  num_agent_steps_trained: 36000
  num_env_steps_sampled: 36000
  num_env_steps_trained: 36000
  num_samples_added_to_queue: 36000
  num_training_step_calls_since_last_synch_worker_weights: 75883
  num_weight_broadcasts: 26
  timing_breakdown:
    learner_dequeue_time_ms: 28177.25
    learner_grad_time_ms: 108.405
    learner_load_time_ms: 0.0
    learner_load_wait_time_ms: 0.0
iterations_since_restore: 24
node_ip: 10.0.1.4
num_agent_steps_sampled: 36000
num_agent_steps_trained: 36000
num_env_steps_sampled: 36000
num_env_steps_sampled_this_iter: 0
num_env_steps_trained: 36000
num_env_steps_trained_this_iter: 0
num_faulty_episodes: 0
num_healthy_workers: 6
num_in_flight_async_reqs: 12
num_remote_worker_restarts: 0
num_steps_trained_this_iter: 0
perf:
  cpu_util_percent: 5.211250000000001
  gpu_util_percent0: 0.039625
  ram_util_percent: 56.098749999999995
  vram_util_percent0: 0.9622923790913533
pid: 810428
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf: {}
sampler_results:
  connector_metrics: {}
  custom_metrics: {}
  episode_len_mean: .nan
  episode_media: {}
  episode_reward_max: .nan
  episode_reward_mean: .nan
  episode_reward_min: .nan
  episodes_this_iter: 0
  hist_stats:
    episode_lengths: []
    episode_reward: []
  num_faulty_episodes: 0
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf: {}
time_since_restore: 1680.5213215351105
time_this_iter_s: 70.02380895614624
time_total_s: 1680.5213215351105
timers:
  synch_weights_time_ms: 0.116
  training_iteration_time_ms: 0.212
timestamp: 1682064609
timesteps_since_restore: 0
timesteps_total: 36000
training_iteration: 24
trial_id: default
warmup_time: 18.718221426010132

So something is clearly going on and learning appears to take place but the RL relevant metrics are not provided.

BR

Jorgen

Yes, this should not happen. Our release tests all depend on the episode_reward_mean metric so this problem must be specific. Could you post one small reproduction script that leaves you with .nan metrics?

Aside from this issue and out of personal interest, why not use APPO?

Hi @arturn

Starting with your last question regarding APPO I guess I just took the tip to go for Impala or PPO instead which you mention under the APPO algorithm description on your web page without thinking too much about it. Nevertheless, I just tested the issue by running the APPO. As you can see below the problem is the same with learning taking place but no RL specific metrics being provided:

...
info:
  last_target_update_ts: 31200
  learner:
    default_policy:
      custom_metrics: {}
      diff_num_grad_updates_vs_sampler_policy: 8.0
      grad_gnorm:
      - 0.011348158121109009
      learner_stats:
        cur_lr: 0.0004999240045435727
        entropy: 1.6032085418701172
        entropy_coeff: 0.0049987840466201305
        mean_IS: 1.011188268661499
        policy_loss: 0.0026907336432486773
        total_loss: -0.005186212249100208
        var_IS: 0.011368080973625183
        var_gnorm: 29.666349411010742
        vf_explained_var: -0.6952351331710815
        vf_loss: 0.0002742948126979172
      num_agent_steps_trained: 800.0
      num_grad_updates_lifetime: 39.0
  learner_queue:
    size_count: 39
    size_mean: 0.8461538461538461
    size_quantiles: [0.0, 0.0, 1.0, 2.0, 2.0]
    size_std: 0.8332347081677907
  num_agent_steps_sampled: 31200
  num_agent_steps_trained: 31200
  num_env_steps_sampled: 31200
  num_env_steps_trained: 31200
  num_samples_added_to_queue: 31200
  num_target_updates: 22
  num_training_step_calls_since_last_synch_worker_weights: 990
  num_weight_broadcasts: 22
  timing_breakdown:
    learner_dequeue_time_ms: 28847.418
    learner_grad_time_ms: 149.187
    learner_load_time_ms: 0.0
    learner_load_wait_time_ms: 0.0
iterations_since_restore: 20
node_ip: 10.0.1.4
num_agent_steps_sampled: 31200
num_agent_steps_trained: 31200
num_env_steps_sampled: 31200
num_env_steps_sampled_this_iter: 2400
num_env_steps_trained: 31200
num_env_steps_trained_this_iter: 2400
num_faulty_episodes: 0
num_healthy_workers: 6
num_in_flight_async_reqs: 12
num_remote_worker_restarts: 0
num_steps_trained_this_iter: 2400
perf:
  cpu_util_percent: 3.6792682926829268
  gpu_util_percent0: 0.04048780487804879
  ram_util_percent: 57.27439024390244
  vram_util_percent0: 0.9633664772163112
pid: 13766
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf: {}
sampler_results:
  connector_metrics: {}
  custom_metrics: {}
  episode_len_mean: .nan
  episode_media: {}
  episode_reward_max: .nan
  episode_reward_mean: .nan
  episode_reward_min: .nan
  episodes_this_iter: 0
  hist_stats:
    episode_lengths: []
    episode_reward: []
  num_faulty_episodes: 0
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
....

For completeness I also ran the PPO algo which doesn’t suffer from this issue.

The key to reproduce this issue appears to be a slow environment in terms of the reset and step functions. Therefore, I modified the standard RandomEnv to slow these two functions down by adding a sleeping parameter to the env_config.

I can imagine this could also be a problem when interacting with physical systems.

I’ve updated my github repro with the full reproduction code which can be found here.

BR

Jorgen

Hi Jorgen,

I reproduced this issue.
The issue is that RLlib collects metrics every time it collects samples.
This happens once after eachtraining_step() call.
Since the environment is very slow and has very long episodes, the training_step() calls collect a tiny bunch of samples but since IMPALA is asynchronous, it does not care and simply keeps on going and reports only on the very limited metrics it has collected.

If the sampling is extremely slow, I would increase the reporting time with
config.reporting(min_time_s_per_iteration=500). It’s normally 10. So you should choose something that fits the slowness of your env heuristically. 500 got me the metrics you are looking for when I sped up the env by 10x.

Hi @arturn

Thanks a lot!

Although your reply did not solve my problem out of the box it did send me in the right direction. I guess the solution had been staring me in the face all along …

By setting .reporting(metrics_episode_collection_timeout_s=500.0) the issue is solved. I believe this was just a warning in the past when exceeding 60 seconds.

It does appear to slow

time_this_iter_s

down - but that is not a problem.

For completeness I also retested the setup with the APPO algo which also runs fine.

Anyways, I’m not quite finished nagging you… any comments on this one:

I’m interested in this one as my custom environments episode_reward_mean has a rather high variance.

I see I can now upgrade directly to Ray 2.4.0.

Thanks again!

Jorgen

Glad to hear that you could progress.
episode_reward_mean is not a perf stat as per our definition.
The perf stats should be inside their own sub-tree of your stats, separated from others like the peisode_reward_mean.