Cannot get rid of Ray Tune invoking botocore and failing

nikujar · August 23, 2022, 8:14am

I am running Ray Tune in local model “ray.init(local_mode=True)” but after running the first trial, I get the following error:

botocore.exceptions.NoCredentialsError: Unable to locate credentials

It seems that botocore is a library for Amazon web services. However, I don’t have any reference to Amazon services anywhere, and I am not even having an account there.

How to say Ray Tune not to use botocore at all, thus not trying to connect to Amazon web services? A strange thing is that the code worked earlier but that error started to appear recently.

EDIT. This is probably related to Ray Tune mlflow integration (@mlflow_mixin), or maybe just an mlflow issue. But anyway, still unsolved.

xwjiang2010 · August 23, 2022, 6:16pm

Hey @nikujar
Are you running on your laptop? Actually I think local_mode=True is being deprecated. What if you just do ray.init()?

nikujar · August 24, 2022, 10:28am

Not in a laptop but in an on-premises computer, yes. Looks like that helped, when I removed ‘local_mode=True’ it is no more invoking botocore! Thank you very much!

nikujar · August 24, 2022, 11:36am

Well, after a bit more testing I noticed that mlflow reporting stopped working. So the code now runs without that ‘local_mode=True’ -setting, but I don’t get any reports into the mlflow server then. Whereas with that ‘local_mode=True’ the mlflow reporting works but then the ending of first trial fails due to the botocore problem I described originally.

rliaw · August 24, 2022, 4:34pm

Could you share how you’re using MLflow reporting? Are you using the built-in MLflow callback?

nikujar · August 25, 2022, 6:30am

The tune.run call looks like this:

    analysis = tune.run(
        train,  # our training function
        callbacks=
        [   # this makes Ray Tune to call mlflow logger
            MLflowLoggerCallback(
                experiment_name=experiment_name,
                tags={"Framework": "Ray Tune"},
                save_artifact=True),
        ],
        num_samples=num_trials,
        resources_per_trial=resources_per_trial,
        config=tune_config,
        metric=tune_config['metric'],
    )

And then, the training code is as follows:

@mlflow_mixin
def train(
    config: TuneConfig  # config is provided by the ray.tune
):
    mlflow.autolog()
    mlflow_run = mlflow.start_run(run_name="local", nested=True)
    model.fit(
        x=training_generator,
        epochs=config["epochs"],
        validation_data=validation_generator,
        use_multiprocessing=False,
        workers=1,
        callbacks=[
            TuneReportCallback([config['metric']])
        ],
    )
    mlflow.end_run()

xwjiang2010 · August 25, 2022, 4:58pm

Can you show me your config, especially config["mlflow"] part of it?

What is your tracking_uri?

You also use both MLflowLoggerCallback as well as @mlflow_mixin. I just wonder if both are necessary.

Also note that tune trials will be by default “directory changed” to different ones than your working directory. Maybe that’s causing the issue?

Are you running mlflow locally? How do you check reports? (with what arguments do you run mlflow ui?)

nikujar · August 26, 2022, 7:22am

Thanks for good hints! I am tracking into localhost based mlflow server, the uri is thus “http://127.0.0.1:5000” and the config[“mlflow”] contains:

    "mlflow": {
        "tracking_uri": mlflow.get_tracking_uri(),
        "experiment_name": experiment_name,
    }

But yes, I removed MLflowLoggerCallback and that way get rid of the “botocore” problem. Maybe that callback is somehow accidentally hard-coded for AWS (Amazon) services, don’t know.

However, what then happens is that after the second trial, the program just hangs and never continues. I can see the results of the first trial in the mlflow UI, but for the second trial only the parameters are reported, and not the measures. Probably because the trial hangs just in that point, when it is trying to report the measures.

That hanging does not happen if I remove “local_mode=True” from the init call. I can survive also that way, but then there’s a new problem, again: The experiment name is somehow lost, and all mlflow logging goes into “Default” category instead of appearing under the “experiment_name”. I set the experiment name before tune.run() call with mlflow.set_experiment(experiment_name) function. Need to study more to find out why the experiment name is lost somewhere inside the tune, any ideas?

xwjiang2010 · August 30, 2022, 8:53pm

My suggestion is to stay away from local_mode=True for now.

To debug the experiment_name issue, could you add some logging around setup_mlflow.py’s setup_mlflow method. Basically I wonder if self._mlflow.get_experiment(experiment_id=experiment_id) can give you the correct experiment you already created in driver code.

Topic		Replies	Views
MLflowLoggerCallback fails with botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL Ray Tune	2	1151	March 4, 2021
Downloading working directory from private S3 storage Ray Core	5	120	February 5, 2025
Timeout on ray.workers.import_thread._run when using @mlflow_mixin & mlflow.autolog() Ray Tune	2	524	March 4, 2021
Trouble starting Tune job on local machine	1	303	August 2, 2023
RayTune S3 Access Error Ray Tune	0	219	February 14, 2024

Cannot get rid of Ray Tune invoking botocore and failing

Related topics