My RLlib implementation seems to compute random actions

What I wrote is a truncated version of my problem, i am starting from there to make it more complex then. I am aware that, like that, reinforcement learning is overkill. But still, shouldn’t I be able to solve this with RL/RLLib?