[Tune] Optuna/Tune Hyperparameter Search for Lunar Lander Continuous Not Working

Hey,

I’ve been trying to use tune for RL problems for the past 6 weeks, and after all sorts of things continually not working I tried using PPO/lunar lander continuous for a proof of concept. That still isn’t working and I have absolutely no idea why, so I was hoping someone could tell me what I’m doing wrong here:

My code (it’s short):

Parameter/Reward outputs (nothing gets above 0 reward, solved is 200):

1 Like

Can you first try setting max_concurrent=1 to debug and see if it’s actually improving a relevant metric?

I started a run as soon as you mentioned that, and since then it’s gone through 16 sequential updates (it’s RL so it’s slow). The expected reward of the policy has also not improved in that time.

Can you post the stdout and also relevant ray tune code snippets (tune.report, tune.run)?

Here’s the ray snippets and std out for max_concurrent=10 from the original post:

And here’s my current stdout for max_concurrent=1:

@justinkterry can you post the longer max_concurrent stdout?

Could you show maybe the full 100 samples?

Here you go, sorry:

Hmm ok. I think the hyperparameter space that you’re working with is way too large. Could you try narrowing this to say 2 or 3 hyperparameters and trying again?

I tried tuning just 3 hyperparameters, then just 1 hyperparameter, neither showed any improvement

3 hyperparameter code: from stable_baselines3 import PPOfrom stable_baselines3.common.callbacks impor - Pastebin.com
stdout: I accidentally lost this file but literally nothing happened. It actually got slightly worse.
1 hyperparam code:from stable_baselines3 import PPOfrom stable_baselines3.common.callbacks impor - Pastebin.com
stdout: 1

The hyperparameter search space I was originally working with is also not way too large. Other projects using straight optuna for this exact environment and similar search spaces have had it work just fine:

Either you, I, and a bunch of my colleagues who have looked at my script are missing something or there’s some sort of weird bug in Tune’s support for Optuna.

Hey @justinkterry sorry for the slow reply - do you have the output for:

  • 1 hyperparameter
  • using max_concurrent=1

?

Max concurrent is above if you read the thread.

It’s been so long I don’t know which file was for the single hyperparameter anymore (sorry about that) but it never improved or decreased in performance from the initial random policy.

I switched to using straight optuna with their own distribution system instead of rays and functionally the same hyperparameters and it works fine, which furthers my belief that a problem exists in tune.

OK got it; let me know if you ever revisit this.

We do have convergence tests to actually sanity check the implementation, so there’s evidence that the Tune Optuna link is hooked properly.

I do suspect there’s something subtle and specific either to SB or your code that is causing the learning not to improve.