1. Severity of the issue: (select one)
None: I’m just curious or want clarification.
Low: Annoying but doesn’t hinder my work.
Medium: Significantly affects my productivity but can find a workaround.
High: Completely blocks me.
2. Environment:
- Ray version: 2.47.1
- Python version: 3.10
- OS: Linux
- Cloud/Infrastructure: hpc systems using singularity containerization
- Other libs/tools (if relevant):
Good morning, my team has been locked to ray/rllib 2.9.3 for quite a while and recently went through the huge effort of updating our code to ray/rllib 2.47.1.
we basically use PPO 100% of the time and as far as I can tell everything is working as expected on my local laptop (20 env_runners/1 cpu per env runner, batch size 12000, 1 gpu for backwards) throughput seems to be pretty close to 2.9.3
however when moving the system to an HPC system
(150 env_runners/1 cpu per env runner, batch 1M, 1 gpu for backwards)
all of a sudden we see a massive throughput decrease in comparison to ray 2.9.3
timing metrics of training are reporting the following numbers (exact same agent/environment configurations):
ray/rllib 2.9.3:
num_agent_steps_sampled: 1205206
num_agent_steps_trained: 1205206
num_env_steps_sampled: 807152
num_env_steps_trained: 807152
sampler_perf:
mean_action_processing_ms: 0.873626950036023
mean_env_render_ms: 0.0
mean_env_wait_ms: 49.38276198896556
mean_inference_ms: 8.130969697180445
mean_raw_obs_processing_ms: 11.626552869819509
time_since_restore: 781.5206921100616
time_this_iter_s: 781.5206921100616
time_total_s: 781.5206921100616
timers:
learn_throughput: 6263.067
learn_time_ms: 128874.886
sample_time_ms: 642445.515
synch_weights_time_ms: 155.402
training_iteration_time_ms: 772090.611
note: really not a fan of num_sgd_iter effecting a lot of these numbers on new rllib, it makes things very difficult to read
ray 2.47.1:
learner_connector_sum_episodes_length_in: 682386
learner_connector_sum_episodes_length_out: 682386
num_env_steps_trained: 212222046
num_env_steps_trained_lifetime: 212222046
num_env_steps_trained_lifetime_throughput: 2677953.781207868
num_module_steps_trained: 37318802
num_module_steps_trained_lifetime: 37318802
num_module_steps_trained_lifetime_throughput: 470877.56540254137
num_module_steps_trained_throughput: 470877.5038728521
timers:
env_runner_sampling_timer: 2240.198469405994
learner_update_timer: 272.7168108657934
restore_env_runners: 8.420972153544426e-05
synch_weights: 0.044216078240424395
training_iteration: 2564.2249606600963
training_step: 2563.132981158793
learner_connector:
connector_pipeline_timer: 188.97170732775703
timers:
connectors:
add_columns_from_episodes_to_train_batch: 55.24026194307953
add_observations_from_episodes_to_batch: 1.0445594252087176
add_one_ts_to_episodes_and_truncate: 11.530067444778979
add_states_from_episodes_to_batch: 4.431560833007097
add_time_dim_to_batch_and_zero_pad: 102.05662126094103
agent_to_module_mapping: 0.5788974780589342
batch_individual_items: 6.675551616586745
general_advantage_estimation: 7.1979670389555395
numpy_to_tensor: 0.2147320620715618
remove_batch_data_for_untrained_modules: 1.603970304131508e-05
module_to_env_connector:
connector_pipeline_timer: 0.0029336511515817205
timers:
connectors:
get_actions: 0.001722823150969477
listify_data_for_vector_env: 2.7680069598418587e-05
module_to_agent_unmapping: 1.0319576357620947e-05
normalize_and_clip_actions: 0.00041019427389107194
remove_single_ts_time_rank_from_batch: 0.0001966772473970731
tensor_to_numpy: 0.00021458907787074528
un_batch_to_individual_items: 0.00014141135139788053
rlmodule_inference_timer: 0.0011713467526806931
sample: 22.25390422265054
time_between_sampling: 20.692558764899296
env_reset_timer: 0.26785381696503074
env_step_timer: 0.04951261894976812
env_to_module_connector:
connector_pipeline_timer: 0.0022030824949814736
timers:
connectors:
add_observations_from_episodes_to_batch: 3.956975504311733e-05
add_states_from_episodes_to_batch: 6.706504561875302e-05
add_time_dim_to_batch_and_zero_pad: 0.00011553349485773592
agent_to_module_mapping: 1.4613636671375326e-05
batch_individual_items: 9.242227091070628e-05
flatten_space_connector: 0.0014116075936986485
numpy_to_tensor: 0.0001413176729359396
original_space_flatten_obserations: 8.963949903429677e-05
from the numbers you can see we go from 772 seconds for an interation to 2564 seconds, which unfortunatly is a major blocker. I have tried tons of things related to messing with
max_requests_in_flight_per_env_runner
count_steps_by
and batch_mode
and no combination of things seems to make an impact to the env_runner_sampling_timer
.
our action/obs spaces are complex dictionary spaces (action size 20 ish with mix of continuous + discrete action masking, obs size 400)
our environment is a competitive environment (running in a league play setup, potentially with a policy map causing self play)
episode_len_mean is approx 50 on iteration 1
our rl_module/neural network is a stateful RNN type model with 2 objects in the state dict
using the ray dashboard I was able to see a difference in how episodes are sampled depending on how batch_mode
is set, but it didn’t seem to make a difference to the env_runner_sampling_timer
when I tried the same thing on hpc
do you have any advice on how to debug this further and potentially find the cause of our issue? I’m currently going to try attaching a profiler to the entire stack and see if it finds anything.