Problems combining ray tune, mlflow and keras (tensorflow)

Hi,

I am using ray tune now for a while and it is really good! But when combining it with mlflow and keras callbacks, I have encountered problems.

My Settings:

  • Windows
  • tensorflow==2.11.0
  • ray==2.3.0
  • mlflow==2.2.1

I am using it with a tune_trainable function and a trainable function (see below) with ReportCheckpointCallback for keras and MLflowLoggerCallback for automated logging. My Code is working so far: The Trials are created, are running and my parameters and metrics are logged to mlflow, the scheduler stops the trials as assumed. The config is created with hydra and is a yaml file.

But now I have two further requests, which I am not able to solve:

  • I am logging the mean_squared_error and the mean_absolute_error, so far so good (the metrics are logged in mlflow and the scheduler is working with it to stop the trials. But I do not really know, on which datasets (train or val) the metrics are calculated. I pass both, train and val set, but only get one metric reported.
  • I want to log further customized metric within the trainable with mlflow.log_metric(key, value). For example: Passing the testset to model.evaluate(ds_test) and store the metric in mlflow after train ends. I pasted the mlfow.log_metric() in the trainable, but then I get an error: mlflow.exceptions.MlflowException: Run ‘3ef7584943e440a08ee53c7a70a4de53’ not found.
    After that I tried with custom keras callback, to log a metric only for testing. But then under artifacts in mlfow, a new folder “mlflow” is created, which contains a new run with this metric (see image)

Unbenannt

the metrics I pass to the callbacks:

  • mean_squared_error
  • mean_absolute_error
class CustomCallback(keras.callbacks.Callback):
    def __init__(self, ds_test):
        superCustomCallback(self).__init__()
        self.ds_test = ds_test

    def on_train_begin(self, logs=None):
        mlflow.log_metric("00_my_custom", 44)


def trainable(cfg: dict) -> None:

    data_preparer = data.ingestion.DataPreparer(cfg)
    ds_train, ds_val, ds_test = data_preparer.get_tf_train_val_test_datasets()

    pointnet = model_HybridPointNetMeta.HybridPointNetMeta(cfg)
    model = pointnet.build_model()

    compiler = compile_fit.CompileFitter(cfg)
    model = compiler.compile_fit_model(
        model,
        ds_train,
        ds_val,
        callbacks=[
            ReportCheckpointCallback(metrics=list(cfg.ml_trainer.METRICS)),
            CustomCallback(ds_test)
        ],
    )


def tune_trainable(cfg: DictConfig) -> None:
    
    dict_cfg = OmegaConf.to_container(cfg, resolve=True)

    sched = get_asha_scheduler(cfg)
    search_alg = None

    tuner = tune.Tuner(
        tune.with_resources(
            trainable,
            resources={
                "cpu": cfg.ml_tuner.RESSOURCES_PER_ITER.NUM_CPU,
                "gpu": cfg.ml_tuner.RESSOURCES_PER_ITER.NUM_GPU,
            },
        ),
        run_config=air.RunConfig(
            name=cfg.ml_tuner.RUN_CONFIG.NAME,
            stop=None,
            callbacks=[
                MLflowLoggerCallback(
                    tracking_uri="http://127.0.0.1:5000",
                    experiment_name="Test",
                    save_artifact=False,
                ),
            ],
            verbose=cfg.ml_tuner.RUN_CONFIG.VERBOSE,
        ),
        tune_config=tune.TuneConfig(
            search_alg=search_alg,
            scheduler=sched,
            metric=cfg.ml_trainer.METRICS[0],
            mode=cfg.ml_tuner.TUNE_CONFIG.MODE_METRICS,
            num_samples=cfg.ml_tuner.TUNE_CONFIG.NUM_SAMPLES,
        ),
        param_space=dict_cfg,
    )
    results = tuner.fit()

I also tried the mlflow_setup() within the trainable, but then I get an error, that the params are not allowed to be overwritten. The last thing I tried, is the @mlflow_mixin decorator for the trainable function. This creates trials in mlflow and logs what I want to log, but then I do not get the metrics back to ray tune to control the scheduler.

Can anyone help? Thanks!
Patrick

Hi @machine,

Ray Tune’s ReportCheckpointCallback only passes through metrics it gets reported from Keras. Usually, validation metrics are prefixed with val_, so it looks like you are reporting training metrics.

You can just leave out the metrics= key - Ray Tune will then forward all metrics it gets from Keras. You can then see if there are e.g. validation metrics you’d like to use instead.

Unfortunately this doesn’t work out of the box with the MlflowLoggerCallback. The reason is that the callback is executed on the driver (the script that spawns the tune trials) and not in the trainable (which runs the actual keras fitting).

What you can do is overwrite the ReportCheckpointCallback with a custom callback like this:

class CustomCallback(ReportCheckpointCallback):
    def _get_reported_metrics(self, logs: Dict) -> Dict:
        metrics = super()._get_reported_metrics(logs)
        metrics["00_my_custom"] = 44
        return metrics

which will add your custom metric to Tune’s session.report. This will then be picked up by the MlflowLoggerCallback and passed to mlflow.

That would be an alternative - you can use setup_mlflow within the trainable. There was a bug recently, so you should update to the latest Ray version to use this. The mlflow_mixin is deprecated and shouldn’t be used. Note if you use setup_mlflow you shouldn’t use the MlflowReporterCallback and you will have to log metrics and checkpoints yourself (e.g. with a custom Keras callback).

Hi @kai ,

thanks! I will try the next days. In the meantime I tried with mlflow_mixing - that is working now. But if it is deprecated and not recommended to use, I will change the code and check for function.

Patrick