However, is there any example reference for exploration functions in the new RLModule syntax? Would be great if the users can get an example, at least for the quite common epsilon greedy variant.
You are correct: in the new RLlib API stack, the exploration_config setting (e.g., for EpsilonGreedy) is deprecated and cannot be used. Instead, you must implement custom exploration directly by overriding the _forward_exploration() method in your RLModule subclass. This method defines how actions are sampled during exploration, and is the new, recommended way to control exploration behavior—including epsilon-greedy—under the RLModule API. There is no built-in example for epsilon-greedy in the RLModule context in the official migration guide, but the approach is to implement the epsilon-greedy logic yourself inside _forward_exploration() of your RLModule class, using the current timestep to anneal epsilon as needed. See the migration guide for confirmation and further explanation.
Would you like a step-by-step code example of how to implement epsilon-greedy exploration in a custom RLModule?
Certainly! To implement epsilon-greedy exploration in the new RLModule API, subclass TorchRLModule and override the _forward_exploration method. Inside this method, compute Q-values, sample a random action with probability epsilon, and otherwise select the greedy action. You can use a schedule for epsilon decay as needed. Here’s a minimal example:
Certainly! For PPO with discrete action masking in the new RLModule API, you should subclass ActionMaskingTorchRLModule and override the _forward_exploration method to add epsilon-greedy exploration. The key is to mask invalid actions and then sample randomly with probability epsilon, or take the greedy action otherwise. Here’s a step-by-step code example based on action_masking_rlm.py: