Advice on debugging extreme throughput regression using PPO when moving to new api stack

1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.

2. Environment:

  • Ray version: 2.47.1
  • Python version: 3.10
  • OS: Linux
  • Cloud/Infrastructure: hpc systems using singularity containerization
  • Other libs/tools (if relevant):

Good morning, my team has been locked to ray/rllib 2.9.3 for quite a while and recently went through the huge effort of updating our code to ray/rllib 2.47.1.

we basically use PPO 100% of the time and as far as I can tell everything is working as expected on my local laptop (20 env_runners/1 cpu per env runner, batch size 12000, 1 gpu for backwards) throughput seems to be pretty close to 2.9.3

however when moving the system to an HPC system
(150 env_runners/1 cpu per env runner, batch 1M, 1 gpu for backwards)

all of a sudden we see a massive throughput decrease in comparison to ray 2.9.3

timing metrics of training are reporting the following numbers (exact same agent/environment configurations):

ray/rllib 2.9.3:

  num_agent_steps_sampled: 1205206
    num_agent_steps_trained: 1205206
    num_env_steps_sampled: 807152
    num_env_steps_trained: 807152
  sampler_perf:
    mean_action_processing_ms: 0.873626950036023
    mean_env_render_ms: 0.0
    mean_env_wait_ms: 49.38276198896556
    mean_inference_ms: 8.130969697180445
    mean_raw_obs_processing_ms: 11.626552869819509
  time_since_restore: 781.5206921100616
  time_this_iter_s: 781.5206921100616
  time_total_s: 781.5206921100616
  timers:
    learn_throughput: 6263.067
    learn_time_ms: 128874.886
    sample_time_ms: 642445.515
    synch_weights_time_ms: 155.402
    training_iteration_time_ms: 772090.611

note: really not a fan of num_sgd_iter effecting a lot of these numbers on new rllib, it makes things very difficult to read

ray 2.47.1:

learner_connector_sum_episodes_length_in: 682386
learner_connector_sum_episodes_length_out: 682386
num_env_steps_trained: 212222046
num_env_steps_trained_lifetime: 212222046
num_env_steps_trained_lifetime_throughput: 2677953.781207868
num_module_steps_trained: 37318802
num_module_steps_trained_lifetime: 37318802
num_module_steps_trained_lifetime_throughput: 470877.56540254137
num_module_steps_trained_throughput: 470877.5038728521
timers:
    env_runner_sampling_timer: 2240.198469405994
    learner_update_timer: 272.7168108657934
    restore_env_runners: 8.420972153544426e-05
    synch_weights: 0.044216078240424395
    training_iteration: 2564.2249606600963
    training_step: 2563.132981158793
learner_connector:
        connector_pipeline_timer: 188.97170732775703
        timers:
          connectors:
            add_columns_from_episodes_to_train_batch: 55.24026194307953
            add_observations_from_episodes_to_batch: 1.0445594252087176
            add_one_ts_to_episodes_and_truncate: 11.530067444778979
            add_states_from_episodes_to_batch: 4.431560833007097
            add_time_dim_to_batch_and_zero_pad: 102.05662126094103
            agent_to_module_mapping: 0.5788974780589342
            batch_individual_items: 6.675551616586745
            general_advantage_estimation: 7.1979670389555395
            numpy_to_tensor: 0.2147320620715618
            remove_batch_data_for_untrained_modules: 1.603970304131508e-05
module_to_env_connector:
      connector_pipeline_timer: 0.0029336511515817205
      timers:
        connectors:
          get_actions: 0.001722823150969477
          listify_data_for_vector_env: 2.7680069598418587e-05
          module_to_agent_unmapping: 1.0319576357620947e-05
          normalize_and_clip_actions: 0.00041019427389107194
          remove_single_ts_time_rank_from_batch: 0.0001966772473970731
          tensor_to_numpy: 0.00021458907787074528
          un_batch_to_individual_items: 0.00014141135139788053
      rlmodule_inference_timer: 0.0011713467526806931
      sample: 22.25390422265054
      time_between_sampling: 20.692558764899296
      env_reset_timer: 0.26785381696503074
      env_step_timer: 0.04951261894976812
      env_to_module_connector:
      connector_pipeline_timer: 0.0022030824949814736
      timers:
        connectors:
          add_observations_from_episodes_to_batch: 3.956975504311733e-05
          add_states_from_episodes_to_batch: 6.706504561875302e-05
          add_time_dim_to_batch_and_zero_pad: 0.00011553349485773592
          agent_to_module_mapping: 1.4613636671375326e-05
          batch_individual_items: 9.242227091070628e-05
          flatten_space_connector: 0.0014116075936986485
          numpy_to_tensor: 0.0001413176729359396
          original_space_flatten_obserations: 8.963949903429677e-05
      

from the numbers you can see we go from 772 seconds for an interation to 2564 seconds, which unfortunatly is a major blocker. I have tried tons of things related to messing with
max_requests_in_flight_per_env_runner count_steps_by and batch_mode and no combination of things seems to make an impact to the env_runner_sampling_timer.

our action/obs spaces are complex dictionary spaces (action size 20 ish with mix of continuous + discrete action masking, obs size 400)
our environment is a competitive environment (running in a league play setup, potentially with a policy map causing self play)
episode_len_mean is approx 50 on iteration 1

our rl_module/neural network is a stateful RNN type model with 2 objects in the state dict

using the ray dashboard I was able to see a difference in how episodes are sampled depending on how batch_mode is set, but it didn’t seem to make a difference to the env_runner_sampling_timer when I tried the same thing on hpc

do you have any advice on how to debug this further and potentially find the cause of our issue? I’m currently going to try attaching a profiler to the entire stack and see if it finds anything.