Scaling curiosity-like exploration modules on multiple workers


I am thinking about extending the current Curiosity exploration module for: a) multiple workers and b) further algorithms (RND, NGU, etc.).

However, currently there is an assertion that allows only a single worker. Is it because of the extra models (forward, inverse, feature), which are created inside the module and therefore are not shared between workers? Do you have on your roadmap improving this or could you share some ideas so I can work on this and prepare a PR?

As a workaround I am thinking of embedding these 3 models in my policy’s model as separate heads (with, or without, some layers being shared) and using them in the exploration module, instead of creating new models there. What do you think, is it going to work?

Hi @iamhatesz , This would be a super great enhancement to have!
The current limitation of num_workers=0 comes from the fact that - at least for PPO - the updates for the ICM module need to be done quite frequently (more frequently than the “main” policy model’s updates and also w/o the PPO-typical subsampling). This forced us to place the update step into the on_postprocess_trajectory callback (which would usually run on the remote workers and we would then have to broadcase all of the ICM’s weights back to the learner to update )

I think one possible solution for now is to make the ICM module work just like our A3C (for which the original paper was actually written):

  1. Do the ICM gradient calculations (no complete update steps) inside the on_postprocess_trajectory function calls , but allow these to happen on any number of remote workers.
  2. Send the gradients to the local worker (driver) for averaging and apply them on the main copy of the ICM in the local worker.
  3. Broadcast the “main” ICM’s new weights back to all remote workers’ ICMs.

I realize this a rough-edged description of what needs to be done. This solution would require users to change the execution plan each time, they use curiosity in their algorithms. We don’t have callbacks or hooks inside the execution plan yet, so changing only some part of these always requires you to completely provide a new plan.

As a primer, you can look at A3C’s AsyncGradients mechanism. This mechanism should be applied to the ICM updates. That way, the rest of the algo (policy learning) can stay independent of the ICM.

@sven1977 cool, thanks a lot for your answer. I started looking at the code of A3C and AsyncGradients and I think I’ve got the idea.

One thing I don’t understand from your response is that PPO+ICM requires more frequent updates of ICM networks. Could you elaborate on this or point me to the reference where I can find more info about this?

Hey @iamhatesz , cool cool, thanks for looking into this.

I did first try to place the ICM update together with the “normal” PPO update (e.g. 4000 batch → learn_on_batch causes both policy and ICM to get updated), but this didn’t work/learn well.

The ICM needs to be updated more frequently, otherwise, the calculated intrinsic rewards are advancing too fast (approaching 0.0 quickly, as the transition dynamics models get very good at predicting the next state) and PPO will not learn (given external rewards are all 0.0).

@sven1977 cool, that sounds reasonable. However, I found a following comment in RND paper (appendix, p. 15):

Initial preliminary experiments with RND were run with only 32 parallel environments. We expected that increasing the number of parallel environments would improve performance by allowing the policy to adapt more quickly to transient intrinsic rewards. This effect could have been mitigated however if the predictor network also learned more quickly. To avoid this situation when scaling up from 32 to 128 environments we kept the effective batch size for the predictor network the same by randomly dropping out elements of the batch with keep probability 0.25. Similarly in our experiments with 256 and 1,024 environments we dropped experience for the predictor with respective probabilities 0.125 and 0.03125.

This sounds like addressing similar problem. Maybe it is worth trying the idea also for ICM, i.e. training the extra models along with the policy model, but with the limited samples. I think of giving it a try by:

a) adding the extra loss calculation to Exploration.get_exploration_loss (batch limitation will happen here)
b) sharing the ICM models by adding them into the PPO’s policy model as different heads, so weights could be synced between workers.

Does it make sense to you?