MLflow with RLlib training curves have incorrect step count

I’m using MLflow with RLlib + Tune, and the plot to compare the training curves (e.g. mean reward) seems to be incorrect.

The step counts should be all lined up, similarly to Tensorboard:
tb-training

Also, it’s not clear what “step” refers to in MLflow, while it’s clear in TB that it’s the timesteps.

The terminal logs look like this:

+-------------------+----------+---------------------+--------+------------------+--------+-----------+----------------------+----------------------+--------------------+                                         
| Trial name        | status   | loc                 |   iter |   total time (s) |     ts |    reward |   episode_reward_max |   episode_reward_min |   episode_len_mean |                                         
|-------------------+----------+---------------------+--------+------------------+--------+-----------+----------------------+----------------------+--------------------|                                         
| trial-2e3a6_00000 | RUNNING  | 192.168.1.74:395169 |      5 |          1241.54 | 300000 | -0.671691 |            -0.498597 |            -0.844548 |               2880 |                                         
| trial-2e3a6_00001 | RUNNING  | 192.168.1.74:395170 |      5 |          1249.21 | 300000 | -0.678623 |            -0.552662 |            -0.798805 |               2880 |                                         
| trial-2e3a6_00002 | RUNNING  | 192.168.1.74:395144 |      5 |          1248.3  | 300000 | -0.674146 |            -0.513186 |            -0.821598 |               2880 |                                         
| trial-2e3a6_00003 | RUNNING  | 192.168.1.74:395171 |      6 |          1489.13 | 360000 | -0.637731 |            -0.452467 |            -0.802104 |               2880 |                                         
+-------------------+----------+---------------------+--------+------------------+--------+-----------+----------------------+----------------------+--------------------+   

Hey @vakker00 , sorry, I’m not sure I can help with this. If TB displays the plots using the correct x-axis alignment, then I would assume that the Tune logging output is correct. Could this be an MLFlow problem? Maybe you need to specify somewhere, what columns (e.g. “timesteps_total”, “iter”) to use on the x-axis?

Thanks @sven1977 for looking in this.
I still think it is an issue with the logger callbacks.

For TB in the log_trial_results the step is used from:

step = result.get(TIMESTEPS_TOTAL) or result[TRAINING_ITERATION]
...
self._trial_writer[trial].add_scalar(full_attr, value, global_step=step)

See here for more details.

While the MLflow cb uses the passed in iteration, which is different for each trial for some reason.

self.client.log_metric(run_id=run_id, key=key, value=value, step=iteration)

See here for more details.

If I add these print statements to the corresponding log_trial_results

print('tb', trial, iteration, step)
print('mlflow', trial, iteration)

then it prints the following, which highlights the issue:

mlflow PPO_RandomEnv_1bbf3_00001 14                                                                                                                                                                                
tb     PPO_RandomEnv_1bbf3_00001 14 4000 
mlflow PPO_RandomEnv_1bbf3_00002 15
tb     PPO_RandomEnv_1bbf3_00002 15 4000
mlflow PPO_RandomEnv_1bbf3_00000 16
tb     PPO_RandomEnv_1bbf3_00000 16 4000
mlflow PPO_RandomEnv_1bbf3_00003 17
tb     PPO_RandomEnv_1bbf3_00003 17 4000

mlflow PPO_RandomEnv_1bbf3_00001 18
tb     PPO_RandomEnv_1bbf3_00001 18 8000
mlflow PPO_RandomEnv_1bbf3_00000 19
tb     PPO_RandomEnv_1bbf3_00000 19 8000
mlflow PPO_RandomEnv_1bbf3_00002 20
tb     PPO_RandomEnv_1bbf3_00002 20 8000
mlflow PPO_RandomEnv_1bbf3_00003 21
tb     PPO_RandomEnv_1bbf3_00003 21 8000

mlflow PPO_RandomEnv_1bbf3_00001 22
tb     PPO_RandomEnv_1bbf3_00001 22 12000
mlflow PPO_RandomEnv_1bbf3_00002 23
tb     PPO_RandomEnv_1bbf3_00002 23 12000
mlflow PPO_RandomEnv_1bbf3_00003 24
tb     PPO_RandomEnv_1bbf3_00003 24 12000
mlflow PPO_RandomEnv_1bbf3_00000 25
tb     PPO_RandomEnv_1bbf3_00000 25 12000

Repro script:

import ray
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.examples.env.random_env import RandomEnv
from ray.tune.integration.mlflow import MLflowLoggerCallback

if __name__ == '__main__':

    ray.init()

    config = {
        'framework': 'torch',
        'num_workers': 1,
        'num_envs_per_worker': 1,
        'num_gpus': 0,
        "lr": tune.grid_search([1e-2, 1e-3, 1e-4, 1e-5]),
        'env': RandomEnv,
    }

    stop = {
        'timesteps_total': 10000,
    }

    result = tune.run(PPOTrainer,
                      config=config,
                      stop=stop,
                      num_samples=1,
                      verbose=1,
                      callbacks=[
                          MLflowLoggerCallback(experiment_name="experiment1",
                                               save_artifact=True)
                      ])

@sven1977 do you think this is a bug or is this indeed the intended behaviour?

I opened a PR, see Fix step for MLflow log_trial_result by vakker · Pull Request #16840 · ray-project/ray · GitHub