Beam search for RL?

Does beam search makes sense in the context of RL ?

At every step of the inference time, we would take several actions (assuming the environment is inexpensive), follow k trajectories, and eventually keep the trajectory with the best reward ?

I hacked a beam search algorithm for my env in RLLib, reordering each trajectory based on the action’s logprobs (intermediate reward is 0 is my problem), but the score is lower than just using basic sampling.

Does beam search makes sense in the context of RL ?