Reading data from hdfs meets Segmentation fault

I read data from hdfs using ray data(2.39.0) but meet Segmentation fault even after ray.shutdown() is executed.

My job is simple:

import sys
import ray
from pyarrow.fs import HadoopFileSystem
import pyarrow

ray.init()

if __name__ == "__main__":
    hdfs = HadoopFileSystem(xxx)
    ds = ray.data.read_parquet("/test/xxx.parquet", filesystem=hdfs)
    print(ds.schema())
    ds.show(10)

print("before shutdown")
ray.shutdown()
print("after shutdown")

All code is executed and “after shutdown” is printed, but I meet a Segmentation fault (sometimes not).

If I use pure pyarrow to read the data, everything is ok.

what had happened, and what should I do?

2 Likes

I also ran into the same issue and found that read_parquet_bulk was more stable (it is marked as deprecated though and requires manually listing files to handle reading directories)

1 Like