Hi everyone,
I am trying to read an unpartitioned parquet file using “ray.data.read_parquet”, but I am getting the following error:
RayTaskError(PickleError): ray::remote_read() (pid=221, ip=XXX.XXX.XXX.XXX)
File “/opt/conda/lib/python3.8/site-packages/ray/data/read_api.py”, line 166, in remote_read
File “/home/ray/anaconda3/lib/python3.8/site-packages/ray/data/datasource/datasource.py”, line 120, in call
result = self._read_fn()
File “/opt/conda/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py”, line 145, in
File “/opt/conda/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py”, line 78, in read_pieces
File “/opt/conda/lib/python3.8/site-packages/ray/data/datasource/parquet_datasource.py”, line 79, in
File “stringsource”, line 6, in pyarrow._dataset.__pyx_unpickle_ParquetReadOptions
_pickle.PickleError: Incompatible checksums (239187151 vs 0xaef2dd6 = (dictionary_columns))
I have recreated the issue with multiple different files, and I think it has something to do with the versions of the packages. I am using:
ray==1.9.2
pyarrow==5.0.0
cloudpickle==1.6.0
Any insight on this issue would be greatly appreciated!