Scaling curiosity-like exploration modules on multiple workers

Hi.

I am thinking about extending the current Curiosity exploration module for: a) multiple workers and b) further algorithms (RND, NGU, etc.).

However, currently there is an assertion that allows only a single worker. Is it because of the extra models (forward, inverse, feature), which are created inside the module and therefore are not shared between workers? Do you have on your roadmap improving this or could you share some ideas so I can work on this and prepare a PR?

As a workaround I am thinking of embedding these 3 models in my policy’s model as separate heads (with, or without, some layers being shared) and using them in the exploration module, instead of creating new models there. What do you think, is it going to work?

Hi @iamhatesz , This would be a super great enhancement to have!
The current limitation of num_workers=0 comes from the fact that - at least for PPO - the updates for the ICM module need to be done quite frequently (more frequently than the “main” policy model’s updates and also w/o the PPO-typical subsampling). This forced us to place the update step into the on_postprocess_trajectory callback (which would usually run on the remote workers and we would then have to broadcase all of the ICM’s weights back to the learner to update )

I think one possible solution for now is to make the ICM module work just like our A3C (for which the original paper was actually written):

  1. Do the ICM gradient calculations (no complete update steps) inside the on_postprocess_trajectory function calls , but allow these to happen on any number of remote workers.
  2. Send the gradients to the local worker (driver) for averaging and apply them on the main copy of the ICM in the local worker.
  3. Broadcast the “main” ICM’s new weights back to all remote workers’ ICMs.

I realize this a rough-edged description of what needs to be done. This solution would require users to change the execution plan each time, they use curiosity in their algorithms. We don’t have callbacks or hooks inside the execution plan yet, so changing only some part of these always requires you to completely provide a new plan.

As a primer, you can look at A3C’s AsyncGradients mechanism. This mechanism should be applied to the ICM updates. That way, the rest of the algo (policy learning) can stay independent of the ICM.

@sven1977 cool, thanks a lot for your answer. I started looking at the code of A3C and AsyncGradients and I think I’ve got the idea.

One thing I don’t understand from your response is that PPO+ICM requires more frequent updates of ICM networks. Could you elaborate on this or point me to the reference where I can find more info about this?

Hey @iamhatesz , cool cool, thanks for looking into this.

I did first try to place the ICM update together with the “normal” PPO update (e.g. 4000 batch → learn_on_batch causes both policy and ICM to get updated), but this didn’t work/learn well.

The ICM needs to be updated more frequently, otherwise, the calculated intrinsic rewards are advancing too fast (approaching 0.0 quickly, as the transition dynamics models get very good at predicting the next state) and PPO will not learn (given external rewards are all 0.0).

@sven1977 cool, that sounds reasonable. However, I found a following comment in RND paper (appendix, p. 15):

Initial preliminary experiments with RND were run with only 32 parallel environments. We expected that increasing the number of parallel environments would improve performance by allowing the policy to adapt more quickly to transient intrinsic rewards. This effect could have been mitigated however if the predictor network also learned more quickly. To avoid this situation when scaling up from 32 to 128 environments we kept the effective batch size for the predictor network the same by randomly dropping out elements of the batch with keep probability 0.25. Similarly in our experiments with 256 and 1,024 environments we dropped experience for the predictor with respective probabilities 0.125 and 0.03125.

This sounds like addressing similar problem. Maybe it is worth trying the idea also for ICM, i.e. training the extra models along with the policy model, but with the limited samples. I think of giving it a try by:

a) adding the extra loss calculation to Exploration.get_exploration_loss (batch limitation will happen here)
b) sharing the ICM models by adding them into the PPO’s policy model as different heads, so weights could be synced between workers.

Does it make sense to you?

I’m facing the same issue, have you gotten any further with this?
Also, GPU training doesn’t work with the ICM, so my training is significantly slower, so it’s not ideal.

I was wondering (based on your comment from the RND paper), how about having different learning rates instead of dropping elements from the batch? That might be simpler to implement and it’s a common approach to balance learning in GANs. For IMPALA there’s already an option to separate the actor and critic optimizers:

    # Set this to true to have two separate optimizers optimize the policy-
    # and value networks.
    "_separate_vf_optimizer": False,
    # If _separate_vf_optimizer is True, define separate learning rate
    # for the value network.
    "_lr_vf": 0.0005,

Yeah, I did some experiments at that time, but without any success, at least for my problem. I used the trick with a separate head for ICM in policy model.

Cool, thanks for the info.
By “without any success” you mean it didn’t work from an implementation point of view or from an algorithmic point of view (i.e. it didn’t solve the issue with the exploration)?