`train.report()` report ignored, instead custom training_function's return is used as report

aoot · September 26, 2023, 3:09am

When reporting the results to the tuner using train.report() AND the via normal function return, the tuner seems to read from the return instead of the train.report(), thus keep on giving me the error that the tuner cannot find the metric specified in tune_config of my tuner.

How can I have the tuner read from the report instead of the function return?
(The error goes away if I comment out the line return output.)

################################################################################
##### Last few lines of my training function #####
################################################################################

        ## Create a Ray Tune session report
        ## Passes the checkpoint data to Ray Tune
        report = {
            "loss": np.mean(track_validation_loss[epoch, :]),
        }
        train.report(report, checkpoint = checkpoint_from_storage)

    
    ## Collect all the items into dictionary to return
    ## Update this into a 2D matrix to be able to track epoch and batch
    output = {
        "Training Loss": track_training_loss, 
        "Training TP": track_training_TP_count, 
        "Training FP": track_training_FP_count, 
        "Training TN": track_training_TN_count,
        "Training FN": track_training_FN_count,
        "Validation Loss": track_validation_loss, 
        "Validation TP": track_validation_TP_count, 
        "Validation FP": track_validation_FP_count, 
        "Validation TN": track_validation_TN_count,
        "Validation FN": track_validation_FN_count,
    }
        
    return output

################################################################################
##### Tuner definition #####
################################################################################

    ## Tuner
    tuner = tune.Tuner(
        tune.with_resources(
            tune.with_parameters(train_the_model),   # Tuner will use what is in param_space
            #resources = {"cpu": psutil.cpu_count(logical=True)},  # Logical CPU units - This would oversubscribe and cause low CPU utilization
            resources = {"cpu": psutil.cpu_count(logical=False), # Physical CPU units
                         "gpu": torch.cuda.device_count()},  
        ),
        tune_config = tune.TuneConfig(
            metric="loss",  # Can also put under scheduler
            mode="min",     # Can also put under scheduler
            scheduler=scheduler,
            num_samples=10,  # 
        ),
        param_space=param_space["params"]
    )
    
    ## Fit the tuner
    results = tuner.fit()

justinvyu · September 26, 2023, 5:30pm

@aoot The return statement here is the equivalent of doing one final train.report at the end of the training function, without a checkpoint.

The error that’s being raised is due to specifying the metric="loss" in the TuneConfig. By default, Tune will raise an error if you try to report a set of metrics that doesn’t include this tracked metric.

You can get around this by just adding a dummy loss (or the latest loss) in the return dictionary, or by setting the environment variable.

Example:

from ray import train, tune


def train_fn(config):
    train.report({"loss": 1})

    return {"a": 2, "b": 3, "loss": 1}


tuner = tune.Tuner(train_fn, tune_config=tune.TuneConfig(metric="loss", mode="min"))
results = tuner.fit()


>>> results[0].metrics_dataframe[["a", "b"]]
     a    b
0  NaN  NaN
1  2.0  3.0

aoot · September 26, 2023, 5:41pm

Hey Justin,
Thanks for taking the time to read through the code.

I was hoping that if I use train.report() to report the metric that is expected in tune_config, then the Tuner would ignore what is returned.

However, your solution is elegant, and I’ll use that to suppress the error msg.

aoot · September 26, 2023, 8:49pm

justinvyu:

from ray import train, tune


def train_fn(config):
    train.report({"loss": 1})

    return {"a": 2, "b": 3, "loss": 1}


tuner = tune.Tuner(train_fn, tune_config=tune.TuneConfig(metric="loss", mode="min"))
results = tuner.fit()

For some reason my environment is not setup properly, so I have yet to be able to test the following question:

Does Tuner register the loss from train.report() or from the returned dictionary?

justinvyu · September 26, 2023, 10:01pm

@aoot Ray Tune will log all reported metrics: see the result.metrics_dataframe output above which contains 2 results (1 from train.report and 1 from the returned dict).

Topic		Replies	Views
Is tune.report without metric somthing to avoid? Ray Tune	6	1370	August 30, 2022
AttributeError: module 'ray.tune' has no attribute 'report'	2	2633	April 12, 2024
Train.report, tune.report and session.report does not work with ray.train specifically xgboost_ray? how to report custom metrics to the SearchGenerator? Ray Train	1	503	February 3, 2023
How to report custom metric to tune while using lightgbm_ray? Ray Tune	7	784	November 12, 2021
TuneReportCallback is unable to read PyTorch Lightning metrics during Tuner.fit(...) Ray Tune	5	1305	January 27, 2023

`train.report()` report ignored, instead custom training_function's return is used as report

Related topics