How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
I have a custom model and environment for DQN. I do a constraint check for the logit with the highest value that follows from the forward pass of my model. I want to do this only for the logits with the highest values because my action space is large and calculating the constraint takes a relatively long time.
The strategy I use when checking is: check if the logit with the highest value meets the constraint, if the “constraint is not met” I change the value to -inf and check the next logit (the new logit with the highest value), if the “constraint is met” I stop checking. This means that there may be logits that do not meet the constraint but have a value greater than -inf. But that does of course not matter for the argmax. In exploration with the EpsilonGreedy class, this means however that these actions could possibly be chosen since there is only a filter for logits with a value equal to or smaller than FLOAT_MIN (I use PyTorch).
My idea was to create a custom exploration class and apply the constraint check in the method get_exploration_action(). However, I need the state (input_dict) for my constraint check.
I have seen that get_exploration_action() is used in the _compute_action_helper() method of the TorchPolicy class. The input_dict is also present in that method, so I could override that method in conjunction with a custom exploration class. However, I don’t know if other options might be better. Overwriting _compute_action_helper() seems like changing the internal structure of the rllib to me.
How could I solve this problem?