Use state for constraint check in exploration

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I have a custom model and environment for DQN. I do a constraint check for the logit with the highest value that follows from the forward pass of my model. I want to do this only for the logits with the highest values because my action space is large and calculating the constraint takes a relatively long time.

The strategy I use when checking is: check if the logit with the highest value meets the constraint, if the “constraint is not met” I change the value to -inf and check the next logit (the new logit with the highest value), if the “constraint is met” I stop checking. This means that there may be logits that do not meet the constraint but have a value greater than -inf. But that does of course not matter for the argmax. In exploration with the EpsilonGreedy class, this means however that these actions could possibly be chosen since there is only a filter for logits with a value equal to or smaller than FLOAT_MIN (I use PyTorch).

My idea was to create a custom exploration class and apply the constraint check in the method get_exploration_action(). However, I need the state (input_dict) for my constraint check.

I have seen that get_exploration_action() is used in the _compute_action_helper() method of the TorchPolicy class. The input_dict is also present in that method, so I could override that method in conjunction with a custom exploration class. However, I don’t know if other options might be better. Overwriting _compute_action_helper() seems like changing the internal structure of the rllib to me.

How could I solve this problem?


curious, do you want epsilon_greedy exploration or not?
it sounds to me you just want to select the Nth action that conforms to the constraints.
so why not just argmax?

During training, I want exploration with a certain probability like 5 percent. The action that goes into the environment should respect the constraints. So 95 percent of the actions follow from the argmax() and 5 percent follow from for example epsilon_greedy. In both cases, the constraint should be respected. Problem is that epsilon_greedy might chooses an invalid action because not all invalid actions have been set to -inf and therefore I want to check the constraint in a custom exploration class. I need the state for the constraint check and I wonder how I can access the state in a custom exploration class.

why don’t you do this in TorchPolicy._compute_action_helper(), basically get some actions from the exploration, check the constrains, if no good, call self.exploration.get_exploration_action() again.

I don’t actually know how you can achieve this 5% exploration ratio with the way you are sampling the actions. normally, you compute the complete action mask before hand.
Try making your action contraint computation faster sounds like a better direction.

Using self.exploration.get_exploration_action() until a (randomly) chosen action meets the constraints is indeed a solution, but this could give overhead if certain calculcations have to be done each time self.exploration.get_exploration_action() is called. I would prefer to give the input_dict as an argument to self.exploration.get_exploration_action() too. I could change that in the source code.