Loading Geotiff Images Into Ray Dataset

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I’m trying to migrate a geospatial ML inference pipeline to use ray.
The first step in our pipeline is loading a bunch of geotiff files that represent satellite images. Where I’m stuck is trying to figure out the best way to get the geotiffs in the ray dataset

So far:
I tried using ray.data.read_images but it relies on pil/pillow and pil says that they support tifs (Image file formats - Pillow (PIL Fork) 10.2.0 documentation) but pil doesn’t seem to be very robust at opening geotiffs (is it possible to open a geotiff file in python without using gdal? - Geographic Information Systems Stack Exchange). I tried opening 4 different geotiffs and only 1 worked with pil and it wasn’t clear why.
I tried using gdal to open the geotiff and then store it in a python object and use ray’s from_items loader but it had trouble serializing the gdal opened image.

Do you have any tips on how to load geotiffs into a ray dataset?

One possibly way would be to read the geotiff files in with ray.data.read_binary_files() (docs), then decode each file with a function passed to Dataset.map() (docs). This decode function would likely use PIL / gdal, essentially taking 1 file and translating to the actual object format you want to use in the dataset.

ds = ray.data.read_binary_files(file_paths).map(decode_fn)

Thanks sjl!
This helped me. I used ray’s read_binary_files api like you suggested and then I used gdals virtual file system (GDAL Virtual File Systems (compressed, network hosted, etc...): /vsimem, /vsizip, /vsitar, /vsicurl, ... — GDAL documentation) to have the geotiffs load from memory instead of disk.