`train.report()` report ignored, instead custom training_function's return is used as report

When reporting the results to the tuner using train.report() AND the via normal function return, the tuner seems to read from the return instead of the train.report(), thus keep on giving me the error that the tuner cannot find the metric specified in tune_config of my tuner.

How can I have the tuner read from the report instead of the function return?
(The error goes away if I comment out the line return output.)

################################################################################
##### Last few lines of my training function #####
################################################################################

        ## Create a Ray Tune session report
        ## Passes the checkpoint data to Ray Tune
        report = {
            "loss": np.mean(track_validation_loss[epoch, :]),
        }
        train.report(report, checkpoint = checkpoint_from_storage)

    
    ## Collect all the items into dictionary to return
    ## Update this into a 2D matrix to be able to track epoch and batch
    output = {
        "Training Loss": track_training_loss, 
        "Training TP": track_training_TP_count, 
        "Training FP": track_training_FP_count, 
        "Training TN": track_training_TN_count,
        "Training FN": track_training_FN_count,
        "Validation Loss": track_validation_loss, 
        "Validation TP": track_validation_TP_count, 
        "Validation FP": track_validation_FP_count, 
        "Validation TN": track_validation_TN_count,
        "Validation FN": track_validation_FN_count,
    }
        
    return output
################################################################################
##### Tuner definition #####
################################################################################

    ## Tuner
    tuner = tune.Tuner(
        tune.with_resources(
            tune.with_parameters(train_the_model),   # Tuner will use what is in param_space
            #resources = {"cpu": psutil.cpu_count(logical=True)},  # Logical CPU units - This would oversubscribe and cause low CPU utilization
            resources = {"cpu": psutil.cpu_count(logical=False), # Physical CPU units
                         "gpu": torch.cuda.device_count()},  
        ),
        tune_config = tune.TuneConfig(
            metric="loss",  # Can also put under scheduler
            mode="min",     # Can also put under scheduler
            scheduler=scheduler,
            num_samples=10,  # 
        ),
        param_space=param_space["params"]
    )
    
    ## Fit the tuner
    results = tuner.fit()

@aoot The return statement here is the equivalent of doing one final train.report at the end of the training function, without a checkpoint.

The error that’s being raised is due to specifying the metric="loss" in the TuneConfig. By default, Tune will raise an error if you try to report a set of metrics that doesn’t include this tracked metric.

You can get around this by just adding a dummy loss (or the latest loss) in the return dictionary, or by setting the environment variable.

Example:

from ray import train, tune


def train_fn(config):
    train.report({"loss": 1})

    return {"a": 2, "b": 3, "loss": 1}


tuner = tune.Tuner(train_fn, tune_config=tune.TuneConfig(metric="loss", mode="min"))
results = tuner.fit()


>>> results[0].metrics_dataframe[["a", "b"]]
     a    b
0  NaN  NaN
1  2.0  3.0
1 Like

Hey Justin,
Thanks for taking the time to read through the code.

I was hoping that if I use train.report() to report the metric that is expected in tune_config, then the Tuner would ignore what is returned.

However, your solution is elegant, and I’ll use that to suppress the error msg.

1 Like

For some reason my environment is not setup properly, so I have yet to be able to test the following question:

Does Tuner register the loss from train.report() or from the returned dictionary?

@aoot Ray Tune will log all reported metrics: see the result.metrics_dataframe output above which contains 2 results (1 from train.report and 1 from the returned dict).