Ray Dataset Cannot Read Parquet File

Hi everyone,

I am trying to read an unpartitioned parquet file using “ray.data.read_parquet”, but I am getting the following error:
RayTaskError(PickleError): ray::remote_read() (pid=221, ip=XXX.XXX.XXX.XXX)

File “/opt/conda/lib/python3.8/site-packages/ray/data/read_api.py”, line 166, in remote_read

File “/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/datasource.py”, line 120, in call

result = self._read_fn()

File “/opt/conda/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py”, line 145, in

File “/opt/conda/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py”, line 78, in read_pieces

File “/opt/conda/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py”, line 79, in

File “stringsource”, line 6, in pyarrow._dataset.__pyx_unpickle_ParquetReadOptions

_pickle.PickleError: Incompatible checksums (239187151 vs 0xaef2dd6 = (dictionary_columns))

I have recreated the issue with multiple different files, and I think it has something to do with the versions of the packages. I am using:
ray==1.9.2
pyarrow==5.0.0
cloudpickle==1.6.0

Any insight on this issue would be greatly appreciated! :smiley:

hi @dan, I agree it would be related to version. Have you tried with newer version of Ray? Note it’s currently at 1.13.0 (actually 2.0.0rc0 was just released). If you plan to stick with 1.9.2, could you share some script/file to reproduce it?