Does beam search makes sense in the context of RL ?
At every step of the inference time, we would take several actions (assuming the environment is inexpensive), follow k trajectories, and eventually keep the trajectory with the best reward ?
I hacked a beam search algorithm for my env in RLLib, reordering each trajectory based on the action’s logprobs (intermediate reward is 0 is my problem), but the score is lower than just using basic sampling.
Does beam search makes sense in the context of RL ?