Ray data read hdfs slowly and process slowly

Our data volume is 2 million pieces, each with 4000 columns, totaling about 80G. The data is stored on hdfs.
Ray cluster resources 6 cores and 24GB of memory per node. There are a total of three nodes.

We use ray. data. read_ csv() reads hdfs data, and all data is read into memory and overflowed to disk. When we perform ds. show(), ds. max (“x0”), or other operators, they all run very slowly, much slower than Spark under the same configuration. And when we run ds. max (“x0”) multiple times, the time consumed is relatively long. I can’t understand why the data is already in memory, and the time is still so long after the first time?

code eg:
import ray
from pyarrow import fs
ray.init(‘auto’)

hdfs_fs = fs.HadoopFileSystem.from_uri(“hdfs://xx-hadoop/?user=root”)
ds = ray.data.read_csv(‘/data/eps-files/’, filesystem=hdfs_fs,parallelism=2000)
ds.max(“x0”) #I saw that for the first time, all data will be placed in the object store and overflowed to disk,
#which takes a long time
ds.max(“x0”) #The time consumed for the second time is similar to the first time

Versions / Dependencies

ray 2.6.0
hdfs 3.2.2

cc: @chengsu Any idea or insights into this? Anything unique about HDFS URL here?

I have solved this problem by adding ds.materialize() before the max operation

Excellent @yiwei00000. I’ll close this issue.