Hi!
I am trying to read a dataset of images from an on-prem S3 solution with SSL, using my corporations internaly issued ca-certificate. I have a raycluster running in Kubernetes and have extended rayproject/ray to inlude these ca-certificates. I have alse added environment variable REQUESTS_CA_BUNDLE
, ant that made boto3 work. But when using ray.data.read_images
with a pyarrow.fs.S3FileSystem
I have no luck. If I enter a pod and look for certificate paths I get:
>>> import certifi
>>> certifi.where()
'/home/ray/anaconda3/lib/python3.8/site-packages/certifi/cacert.pem'
>>> import ssl
>>> ssl.get_default_verify_paths()
DefaultVerifyPaths(cafile='/etc/ssl/certs/ca-certificates.crt', capath=None, openssl_cafile_env='SSL_CERT_FILE', openssl_cafile='/home/ray/anaconda3/ssl/cert.pem', openssl_capath_env='SSL_CERT_DIR', openssl_capath='/home/ray/anaconda3/ssl/certs')
This is my Dockerfile:
FROM rayproject/ray:2.4.0-py38
COPY CorpCaChain.pem /usr/local/share/ca-certificates/CorpCaChain.crt
USER root
RUN update-ca-certificates
USER ray
ENV REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt
With boto3 there is no problem:
import boto3
import ray
import json
if __name__ == '__main__':
ray.init()
config = json.load(open('config.json'))
new_bucket_name = config['bucket_name']
b3_session = boto3.Session(
aws_access_key_id=config['access_key'],
aws_secret_access_key=config['secret_key']
)
s3_resource = b3_session.resource(
service_name='s3',
use_ssl=(config["scheme"] == 'https'),
endpoint_url=f"{config['scheme']}://{config['endpoint']}",
)
my_bucket = s3_resource.Bucket(new_bucket_name)
for s3_file in my_bucket.objects.all():
print(s3_file.key)
But not with this:
import ray
import json
from pyarrow.fs import S3FileSystem
if __name__ == '__main__':
ray.init()
config = json.load(open('config.json'))
bucket_name = config['bucket_name']
s3_path = f"s3://{bucket_name}"
s3_filesystem = S3FileSystem(
access_key=config['access_key'],
secret_key=config['secret_key'],
endpoint_override=f"{config['scheme']}://{config['endpoint']}",
scheme=config['scheme']
)
ds = ray.data.read_images(
s3_path,
filesystem=s3_filesystem,
include_paths=True,
)
The error:
2023-06-19 08:08:53,237 INFO worker.py:1616 -- Connected to Ray cluster. View the dashboard at http://10.244.1.23:8265
Traceback (most recent call last):
File "ray_air_s3test.py", line 19, in <module>
ds = ray.data.read_images(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/read_api.py", line 663, in read_images
return read_datasource(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/read_api.py", line 334, in read_datasource
requested_parallelism, min_safe_parallelism, read_tasks = ray.get(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/client_mode_hook.py", line 105, in wrapper
return func(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/worker.py", line 2521, in get
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(OSError): ray::_get_read_tasks() (pid=3853, ip=10.244.1.73)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/read_api.py", line 1873, in _get_read_tasks
reader = ds.create_reader(**kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/image_datasource.py", line 65, in create_reader
return _ImageDatasourceReader(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/image_datasource.py", line 144, in __init__
super().__init__(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_based_datasource.py", line 391, in __init__
zip(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_meta_provider.py", line 175, in expand_paths
yield from _expand_paths(paths, filesystem, partitioning, ignore_missing_paths)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_meta_provider.py", line 408, in _expand_paths
yield from _get_file_infos_serial(paths, filesystem, ignore_missing_paths)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_meta_provider.py", line 435, in _get_file_infos_serial
yield from _get_file_infos(path, filesystem, ignore_missing_paths)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_meta_provider.py", line 498, in _get_file_infos
_handle_read_os_error(e, path)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_meta_provider.py", line 378, in _handle_read_os_error
raise error
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_meta_provider.py", line 496, in _get_file_infos
file_info = filesystem.get_file_info(path)
File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When getting information for bucket 'raw512x256lab': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 60, SSL peer certificate or SSH remote key was not OK
---------------------------------------
Job 'raysubmit_BwN6R6qfHufGCSXr' failed
---------------------------------------
Status message: Job failed due to an application error, last available logs (truncated to 20,000 chars):
_handle_read_os_error(e, path)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_meta_provider.py", line 378, in _handle_read_os_error
raise error
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/file_meta_provider.py", line 496, in _get_file_infos
file_info = filesystem.get_file_info(path)
File "pyarrow/_fs.pyx", line 571, in pyarrow._fs.FileSystem.get_file_info
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 115, in pyarrow.lib.check_status
OSError: When getting information for bucket 'raw512x256lab': AWS Error NETWORK_CONNECTION during HeadBucket operation: curlCode: 60, SSL peer certificate or SSH remote key was not OK
Any advice would be appreciated!