Hi,
I am using ray tune now for a while and it is really good! But when combining it with mlflow and keras callbacks, I have encountered problems.
My Settings:
- Windows
- tensorflow==2.11.0
- ray==2.3.0
- mlflow==2.2.1
I am using it with a tune_trainable function and a trainable function (see below) with ReportCheckpointCallback for keras and MLflowLoggerCallback for automated logging. My Code is working so far: The Trials are created, are running and my parameters and metrics are logged to mlflow, the scheduler stops the trials as assumed. The config is created with hydra and is a yaml file.
But now I have two further requests, which I am not able to solve:
- I am logging the mean_squared_error and the mean_absolute_error, so far so good (the metrics are logged in mlflow and the scheduler is working with it to stop the trials. But I do not really know, on which datasets (train or val) the metrics are calculated. I pass both, train and val set, but only get one metric reported.
- I want to log further customized metric within the trainable with mlflow.log_metric(key, value). For example: Passing the testset to model.evaluate(ds_test) and store the metric in mlflow after train ends. I pasted the mlfow.log_metric() in the trainable, but then I get an error: mlflow.exceptions.MlflowException: Run ‘3ef7584943e440a08ee53c7a70a4de53’ not found.
After that I tried with custom keras callback, to log a metric only for testing. But then under artifacts in mlfow, a new folder “mlflow” is created, which contains a new run with this metric (see image)
the metrics I pass to the callbacks:
- mean_squared_error
- mean_absolute_error
class CustomCallback(keras.callbacks.Callback):
def __init__(self, ds_test):
superCustomCallback(self).__init__()
self.ds_test = ds_test
def on_train_begin(self, logs=None):
mlflow.log_metric("00_my_custom", 44)
def trainable(cfg: dict) -> None:
data_preparer = data.ingestion.DataPreparer(cfg)
ds_train, ds_val, ds_test = data_preparer.get_tf_train_val_test_datasets()
pointnet = model_HybridPointNetMeta.HybridPointNetMeta(cfg)
model = pointnet.build_model()
compiler = compile_fit.CompileFitter(cfg)
model = compiler.compile_fit_model(
model,
ds_train,
ds_val,
callbacks=[
ReportCheckpointCallback(metrics=list(cfg.ml_trainer.METRICS)),
CustomCallback(ds_test)
],
)
def tune_trainable(cfg: DictConfig) -> None:
dict_cfg = OmegaConf.to_container(cfg, resolve=True)
sched = get_asha_scheduler(cfg)
search_alg = None
tuner = tune.Tuner(
tune.with_resources(
trainable,
resources={
"cpu": cfg.ml_tuner.RESSOURCES_PER_ITER.NUM_CPU,
"gpu": cfg.ml_tuner.RESSOURCES_PER_ITER.NUM_GPU,
},
),
run_config=air.RunConfig(
name=cfg.ml_tuner.RUN_CONFIG.NAME,
stop=None,
callbacks=[
MLflowLoggerCallback(
tracking_uri="http://127.0.0.1:5000",
experiment_name="Test",
save_artifact=False,
),
],
verbose=cfg.ml_tuner.RUN_CONFIG.VERBOSE,
),
tune_config=tune.TuneConfig(
search_alg=search_alg,
scheduler=sched,
metric=cfg.ml_trainer.METRICS[0],
mode=cfg.ml_tuner.TUNE_CONFIG.MODE_METRICS,
num_samples=cfg.ml_tuner.TUNE_CONFIG.NUM_SAMPLES,
),
param_space=dict_cfg,
)
results = tuner.fit()
I also tried the mlflow_setup() within the trainable, but then I get an error, that the params are not allowed to be overwritten. The last thing I tried, is the @mlflow_mixin decorator for the trainable function. This creates trials in mlflow and logs what I want to log, but then I do not get the metrics back to ray tune to control the scheduler.
Can anyone help? Thanks!
Patrick