Error while calling best_checkpoint on Ray Cluster on Kubernetes

sai_lalith_Polawar · August 16, 2022, 9:03am

I am running my .py script inside a pod on OpenShift Container Platform and it is running very good but at the end when I am trying to get path to best_checkpoint I am facing following error

File “test_hyperopt.py”, line 127, in
logger.info(“Best Checkpoint directory: \n{}\n”.format(analysis.get_best_checkpoint(best_trial, metric=“score”, mode=“max”)))
File “/opt/conda/lib/python3.8/site-packages/ray/tune/analysis/experiment_analysis.py”, line 469, in get_best_checkpoint
return TrialCheckpoint(local_path=best_path, cloud_path=cloud_path)
File “/opt/conda/lib/python3.8/site-packages/ray/tune/cloud.py”, line 86, in init
Checkpoint.init(self, uri=PLACEHOLDER)
File “/opt/conda/lib/python3.8/site-packages/ray/ml/checkpoint.py”, line 131, in init
local_path = _get_local_path(uri)
File “/opt/conda/lib/python3.8/site-packages/ray/ml/checkpoint.py”, line 457, in _get_local_path
if path is None or is_non_local_path_uri(path):
File “/opt/conda/lib/python3.8/site-packages/ray/ml/utils/remote_storage.py”, line 74, in is_non_local_path_uri
if bool(get_fs_and_path(uri)[0]):
File “/opt/conda/lib/python3.8/site-packages/ray/ml/utils/remote_storage.py”, line 104, in get_fs_and_path
fs, path = pyarrow.fs.FileSystem.from_uri(uri)
File “pyarrow/_fs.pyx”, line 463, in pyarrow._fs.FileSystem.from_uri
File “pyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow/error.pxi”, line 115, in pyarrow.lib.check_status
OSError: When resolving region for bucket ‘placeholder’: AWS Error [code 99]: curlCode: 7, Couldn’t connect to server

I am only using Persistent Volume but I have no Idea why I am getting hthis AWS error evethough I am not using any cloud storage.

For good understanding my tune.run is as follows

analysis = tune.run(
        obj_fn,
        local_dir="./results",
        metric="score",
        mode="max",
        checkpoint_score_attr="score",
        sync_config=tune.SyncConfig(
        syncer=None
        ),
        config={...},
        search_alg=algo,  
        num_samples=num_samples,  
        verbose=Verbosity.V1_EXPERIMENT,
    )
best_trial = analysis.get_best_trial(metric="score", mode="max")
logger.info("Best Checkpoint directory: \n{}\n".format(analysis.get_best_checkpoint(best_trial, metric="score", mode="max")))

On my personl laptop it is giving the path of best_checkpoint but when I am running on OpenShift error is raising. Please can anyone help with understanding the cause of the error and solutionn for it.

xwjiang2010 · August 16, 2022, 4:14pm

Hmm, seems your access to S3 is having issue from OpenShift container.

Have you checked out this link? Working with Cloud Storage (S3, GCS) • Arrow R Package

sai_lalith_Polawar · August 16, 2022, 4:40pm

Thank you for the reply, but I am no where handling AWS through my code. I am using shared memory NFS not any cloud storage. All the results of trial and checkpoints are saved within the pod or on head node if I use Ray Cluster. I do not understand why local path is not considered and ray trying to look into URI path. Correct me if I am wrong.

xwjiang2010 · August 16, 2022, 7:57pm

Ah got it. You are absolutely right. This should not be expected.
What’s happening is in TrialCheckpoint(local_path, cloud_path), local_path is probably not correct. This makes TrialCheckpoint falling back to try downloading from remote storage (which is using PLACEHOLDER as a placeholder uri).

One question, what ray version are you using?

Can you update your ray version to ray-2.0.0 by any chance?

sai_lalith_Polawar · August 16, 2022, 9:14pm

I working on ray=1.13.0

Yes, sure I will try to run my script using ray=2.0.0 and will let you know as soon as possible.

Topic		Replies	Views
TuneGridSearchCV error finding folder /home/ray/results Kubernetes	0	27	January 11, 2024
Ray tune pytorch example Ray Tune	1	424	February 23, 2022
Using Ray Tune in a Ray Cluster, checkpoints not synced back to source node Ray Tune	9	991	April 22, 2021
ValueError: The returned checkpoint path must be within the given checkpoint dir Ray Tune	7	379	January 25, 2021
Ray Tune x SLURM - Problem with checkpoints Ray Libraries (Data, Train, Tune, Serve)	5	366	March 15, 2023

Error while calling best_checkpoint on Ray Cluster on Kubernetes

Related topics