I’ve been trying to use tune for RL problems for the past 6 weeks, and after all sorts of things continually not working I tried using PPO/lunar lander continuous for a proof of concept. That still isn’t working and I have absolutely no idea why, so I was hoping someone could tell me what I’m doing wrong here:
My code (it’s short):
Parameter/Reward outputs (nothing gets above 0 reward, solved is 200):
I started a run as soon as you mentioned that, and since then it’s gone through 16 sequential updates (it’s RL so it’s slow). The expected reward of the policy has also not improved in that time.
Hmm ok. I think the hyperparameter space that you’re working with is way too large. Could you try narrowing this to say 2 or 3 hyperparameters and trying again?
The hyperparameter search space I was originally working with is also not way too large. Other projects using straight optuna for this exact environment and similar search spaces have had it work just fine:
Either you, I, and a bunch of my colleagues who have looked at my script are missing something or there’s some sort of weird bug in Tune’s support for Optuna.
It’s been so long I don’t know which file was for the single hyperparameter anymore (sorry about that) but it never improved or decreased in performance from the initial random policy.
I switched to using straight optuna with their own distribution system instead of rays and functionally the same hyperparameters and it works fine, which furthers my belief that a problem exists in tune.