Cannot get rid of Ray Tune invoking botocore and failing

I am running Ray Tune in local model “ray.init(local_mode=True)” but after running the first trial, I get the following error:

botocore.exceptions.NoCredentialsError: Unable to locate credentials

It seems that botocore is a library for Amazon web services. However, I don’t have any reference to Amazon services anywhere, and I am not even having an account there.

How to say Ray Tune not to use botocore at all, thus not trying to connect to Amazon web services? A strange thing is that the code worked earlier but that error started to appear recently.

EDIT. This is probably related to Ray Tune mlflow integration (@mlflow_mixin), or maybe just an mlflow issue. But anyway, still unsolved.

Hey @nikujar
Are you running on your laptop? Actually I think local_mode=True is being deprecated. What if you just do ray.init()?

1 Like

Not in a laptop but in an on-premises computer, yes. Looks like that helped, when I removed ‘local_mode=True’ it is no more invoking botocore! Thank you very much!

Well, after a bit more testing I noticed that mlflow reporting stopped working. So the code now runs without that ‘local_mode=True’ -setting, but I don’t get any reports into the mlflow server then. Whereas with that ‘local_mode=True’ the mlflow reporting works but then the ending of first trial fails due to the botocore problem I described originally.

Could you share how you’re using MLflow reporting? Are you using the built-in MLflow callback?

1 Like

The tune.run call looks like this:

    analysis = tune.run(
        train,  # our training function
        callbacks=
        [   # this makes Ray Tune to call mlflow logger
            MLflowLoggerCallback(
                experiment_name=experiment_name,
                tags={"Framework": "Ray Tune"},
                save_artifact=True),
        ],
        num_samples=num_trials,
        resources_per_trial=resources_per_trial,
        config=tune_config,
        metric=tune_config['metric'],
    )

And then, the training code is as follows:

@mlflow_mixin
def train(
    config: TuneConfig  # config is provided by the ray.tune
):
    mlflow.autolog()
    mlflow_run = mlflow.start_run(run_name="local", nested=True)
    model.fit(
        x=training_generator,
        epochs=config["epochs"],
        validation_data=validation_generator,
        use_multiprocessing=False,
        workers=1,
        callbacks=[
            TuneReportCallback([config['metric']])
        ],
    )
    mlflow.end_run()

Can you show me your config, especially config["mlflow"] part of it?

What is your tracking_uri?

You also use both MLflowLoggerCallback as well as @mlflow_mixin. I just wonder if both are necessary.

Also note that tune trials will be by default “directory changed” to different ones than your working directory. Maybe that’s causing the issue?

Are you running mlflow locally? How do you check reports? (with what arguments do you run mlflow ui?)

1 Like

Thanks for good hints! I am tracking into localhost based mlflow server, the uri is thus “http://127.0.0.1:5000” and the config[“mlflow”] contains:

    "mlflow": {
        "tracking_uri": mlflow.get_tracking_uri(),
        "experiment_name": experiment_name,
    }

But yes, I removed MLflowLoggerCallback and that way get rid of the “botocore” problem. Maybe that callback is somehow accidentally hard-coded for AWS (Amazon) services, don’t know.

However, what then happens is that after the second trial, the program just hangs and never continues. I can see the results of the first trial in the mlflow UI, but for the second trial only the parameters are reported, and not the measures. Probably because the trial hangs just in that point, when it is trying to report the measures.

That hanging does not happen if I remove “local_mode=True” from the init call. I can survive also that way, but then there’s a new problem, again: The experiment name is somehow lost, and all mlflow logging goes into “Default” category instead of appearing under the “experiment_name”. I set the experiment name before tune.run() call with mlflow.set_experiment(experiment_name) function. Need to study more to find out why the experiment name is lost somewhere inside the tune, any ideas?

My suggestion is to stay away from local_mode=True for now.

To debug the experiment_name issue, could you add some logging around setup_mlflow.py’s setup_mlflow method. Basically I wonder if self._mlflow.get_experiment(experiment_id=experiment_id) can give you the correct experiment you already created in driver code.

1 Like