APPO Learner spent really long time in sampling/deserialization

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I’m doing an RL training locally using Rllib and Unreal in windows, I’m using APPO agent and eventually will send it to distributed training , I do also have a linux build in develop hopefully things will get better.

Anyways before going distributed I want to make sure I’m not bottlenecked by something else first and here is what I see, and yes my worker has generated way more agent_timestep than learner has consumed.

Here it shows my learner spent a ton of time in this dequeue_timer, and it iterates much slower when I added either more workers or increase rollout batch size, my computer is only 30% CPU/mem utilized. From there Learner will actually consume much less data if I increase my worker or rollout size, but it’s not sensitive to its own batchsize.

Things that I have tried and doesn’t matter:

  • tested with different game state size and it doesn’t matter if I have really big high dimensional ray cast data and CNN model or just 3 trivial number and simple NN model as my observation
  • tested set a higher num_cpus_for_driver
  • tested set a little or a lot of replay_buffer_num_slots

I’d like some suggestions on how to debug this, I want to add some tracking on the status of APPO’s replay buffer and how much time it takes to sample etc, is there some lock? Is this a windows issue about pickle?

Anyways my jops runs pretty standard, config=trial_config, **tune_config)

Here is some chunk of the config that might be relevant.

"disable_env_checking": True, 
"create_env_on_driver": False,
"framework": "torch",
"_disable_execution_plan_api": True,
"num_gpus": 0,
"num_cpus_for_driver": 3,
"num_cpus_per_worker": 1,
"num_workers": 5,
"evaluation_num_workers": 0,
"num_envs_per_worker": 1,
"min_time_s_per_iteration": 1,
# Learning
#"exploration_config": {},
"batch_mode": "truncate_episodes",
"rollout_fragment_length": 128,
"replay_buffer_num_slots": 10000,
"gamma": 0.995,
"vf_loss_coeff": 0.5,
"entropy_coeff": 3e-4,
"clip_param": 0.3,
"grad_clip": 10,
"train_batch_size": 1024,
"lr": 3e-4,

My episode is truncated to 400 steps max, and yeah if I switch from 128 rolloutbatch to 256, with same train_batch_size, learner will do only 60% of the iterations in the same 30s.

Any suggestions on how to debug this would be appreciated! :smiling_face_with_tear:

Hi want to close this issue now after understanding that training_iteration here isn’t a single batch, I found other ways to verify that learner can indeed catch up with workers data output on a single machine as it should be. I did wish this training iteration has a fixed numeric meaning though!