Impala Bugs and some other observations

Hi @arturn

You are right in terms of the tf2 error originating outside of rllib. There seems to be some debate on the web on how to solve it (interested users may have a look here) and there appears to be some updates to the Tensorflow installation page as well.
Downgrading to Tensorflow version 2.10 solves the issue though - at least for now.

The reason for me using the Impala algo is that it is asynchronous. My custom environment suffers from that some state transitions are more computational expensive than others and thus slower. Additionally, my cluster setup comprises different types of CPUs with different speeds. Hence, if I was using a synchronous algo like PPO I would have rollout workers sitting idle waiting for the slower ones to finish. As my custom environment is rather slow anyway this would not be very efficient.

Normally, I run with 28 workers with 4 envs per worker.

I think the “one worker” issue originates from when I started to investigate this issue when I tried to migrate from Ray 2.1.0 to 2.2.0 and communicated with @sven1977 on this (more details here). Again, the episode_reward_mean etc kept being just .nan and I tried various rollout worker configurations including using only one. In this process I also managed to “harass” the Impala algo enough to get the learner queue empty error. But that is not whats happening here. Nevertheless, I still think these issues are related as my custom environment runs fine in Ray 2.1.0.

I have just been running the reproduction code that I provided a link to above after downgrading Tensorflow to 2.10 in Ray 2.3.1. It runs fine but episode_reward_mean etc. should start to show around 12k time steps but again they don’t - although the algo resets the environments after termination/truncation which should indicate that the episodes ended. Moreover, Info[“learner”] and info[“learner_queue”] etc appears to me to be running as expected. See sample output below:

agent_timesteps_total: 36000
connector_metrics: {}
counters:
  num_agent_steps_sampled: 36000
  num_agent_steps_trained: 36000
  num_env_steps_sampled: 36000
  num_env_steps_trained: 36000
  num_samples_added_to_queue: 36000
  num_training_step_calls_since_last_synch_worker_weights: 75883
  num_weight_broadcasts: 26
custom_metrics: {}
date: 2023-04-21_10-10-09
done: false
episode_len_mean: .nan
episode_media: {}
episode_reward_max: .nan
episode_reward_mean: .nan
episode_reward_min: .nan
episodes_this_iter: 0
episodes_total: 0
experiment_id: cbb2c6a61b7a419b96ce250ed0610573
hostname: novelty-TUF-GAMING-X670E-PLUS-1002
info:
  learner:
    default_policy:
      custom_metrics: {}
      diff_num_grad_updates_vs_sampler_policy: 6.5
      grad_gnorm:
      - 1.5110050439834595
      learner_stats:
        cur_lr: 0.0004999120137654245
        entropy: 1.6087641716003418
        entropy_coeff: 0.00499859219416976
        policy_loss: 0.0005777080659754574
        var_gnorm: 29.632728576660156
        vf_explained_var: -0.04690992832183838
        vf_loss: 0.00027035464881919324
      num_agent_steps_trained: 800.0
      num_grad_updates_lifetime: 40.0
  learner_queue:
    size_count: 45
    size_mean: 0.8444444444444444
    size_quantiles: [0.0, 0.0, 1.0, 2.0, 2.0]
    size_std: 0.8152860590152952
  num_agent_steps_sampled: 36000
  num_agent_steps_trained: 36000
  num_env_steps_sampled: 36000
  num_env_steps_trained: 36000
  num_samples_added_to_queue: 36000
  num_training_step_calls_since_last_synch_worker_weights: 75883
  num_weight_broadcasts: 26
  timing_breakdown:
    learner_dequeue_time_ms: 28177.25
    learner_grad_time_ms: 108.405
    learner_load_time_ms: 0.0
    learner_load_wait_time_ms: 0.0
iterations_since_restore: 24
node_ip: 10.0.1.4
num_agent_steps_sampled: 36000
num_agent_steps_trained: 36000
num_env_steps_sampled: 36000
num_env_steps_sampled_this_iter: 0
num_env_steps_trained: 36000
num_env_steps_trained_this_iter: 0
num_faulty_episodes: 0
num_healthy_workers: 6
num_in_flight_async_reqs: 12
num_remote_worker_restarts: 0
num_steps_trained_this_iter: 0
perf:
  cpu_util_percent: 5.211250000000001
  gpu_util_percent0: 0.039625
  ram_util_percent: 56.098749999999995
  vram_util_percent0: 0.9622923790913533
pid: 810428
policy_reward_max: {}
policy_reward_mean: {}
policy_reward_min: {}
sampler_perf: {}
sampler_results:
  connector_metrics: {}
  custom_metrics: {}
  episode_len_mean: .nan
  episode_media: {}
  episode_reward_max: .nan
  episode_reward_mean: .nan
  episode_reward_min: .nan
  episodes_this_iter: 0
  hist_stats:
    episode_lengths: []
    episode_reward: []
  num_faulty_episodes: 0
  policy_reward_max: {}
  policy_reward_mean: {}
  policy_reward_min: {}
  sampler_perf: {}
time_since_restore: 1680.5213215351105
time_this_iter_s: 70.02380895614624
time_total_s: 1680.5213215351105
timers:
  synch_weights_time_ms: 0.116
  training_iteration_time_ms: 0.212
timestamp: 1682064609
timesteps_since_restore: 0
timesteps_total: 36000
training_iteration: 24
trial_id: default
warmup_time: 18.718221426010132

So something is clearly going on and learning appears to take place but the RL relevant metrics are not provided.

BR

Jorgen