Error while calling best_checkpoint on Ray Cluster on Kubernetes

I am running my .py script inside a pod on OpenShift Container Platform and it is running very good but at the end when I am trying to get path to best_checkpoint I am facing following error

File “test_hyperopt.py”, line 127, in
logger.info(“Best Checkpoint directory: \n{}\n”.format(analysis.get_best_checkpoint(best_trial, metric=“score”, mode=“max”)))
File “/opt/conda/lib/python3.8/site-packages/ray/tune/analysis/experiment_analysis.py”, line 469, in get_best_checkpoint
return TrialCheckpoint(local_path=best_path, cloud_path=cloud_path)
File “/opt/conda/lib/python3.8/site-packages/ray/tune/cloud.py”, line 86, in init
Checkpoint.init(self, uri=PLACEHOLDER)
File “/opt/conda/lib/python3.8/site-packages/ray/ml/checkpoint.py”, line 131, in init
local_path = _get_local_path(uri)
File “/opt/conda/lib/python3.8/site-packages/ray/ml/checkpoint.py”, line 457, in _get_local_path
if path is None or is_non_local_path_uri(path):
File “/opt/conda/lib/python3.8/site-packages/ray/ml/utils/remote_storage.py”, line 74, in is_non_local_path_uri
if bool(get_fs_and_path(uri)[0]):
File “/opt/conda/lib/python3.8/site-packages/ray/ml/utils/remote_storage.py”, line 104, in get_fs_and_path
fs, path = pyarrow.fs.FileSystem.from_uri(uri)
File “pyarrow/_fs.pyx”, line 463, in pyarrow._fs.FileSystem.from_uri
File “pyarrow/error.pxi”, line 144, in pyarrow.lib.pyarrow_internal_check_status
File “pyarrow/error.pxi”, line 115, in pyarrow.lib.check_status
OSError: When resolving region for bucket ‘placeholder’: AWS Error [code 99]: curlCode: 7, Couldn’t connect to server

I am only using Persistent Volume but I have no Idea why I am getting hthis AWS error evethough I am not using any cloud storage.

For good understanding my tune.run is as follows

analysis = tune.run(
        obj_fn,
        local_dir="./results",
        metric="score",
        mode="max",
        checkpoint_score_attr="score",
        sync_config=tune.SyncConfig(
        syncer=None
        ),
        config={...},
        search_alg=algo,  
        num_samples=num_samples,  
        verbose=Verbosity.V1_EXPERIMENT,
    )
best_trial = analysis.get_best_trial(metric="score", mode="max")
logger.info("Best Checkpoint directory: \n{}\n".format(analysis.get_best_checkpoint(best_trial, metric="score", mode="max")))

On my personl laptop it is giving the path of best_checkpoint but when I am running on OpenShift error is raising. Please can anyone help with understanding the cause of the error and solutionn for it.

Hmm, seems your access to S3 is having issue from OpenShift container.

Have you checked out this link? Working with Cloud Storage (S3, GCS) • Arrow R Package

Thank you for the reply, but I am no where handling AWS through my code. I am using shared memory NFS not any cloud storage. All the results of trial and checkpoints are saved within the pod or on head node if I use Ray Cluster. I do not understand why local path is not considered and ray trying to look into URI path. Correct me if I am wrong.

Ah got it. You are absolutely right. This should not be expected.
What’s happening is in TrialCheckpoint(local_path, cloud_path), local_path is probably not correct. This makes TrialCheckpoint falling back to try downloading from remote storage (which is using PLACEHOLDER as a placeholder uri).

One question, what ray version are you using?

Can you update your ray version to ray-2.0.0 by any chance?

I working on ray=1.13.0

Yes, sure I will try to run my script using ray=2.0.0 and will let you know as soon as possible.

1 Like