Ray data read hdfs slowly and process slowly

yiwei00000 · August 28, 2023, 7:03am

Our data volume is 2 million pieces, each with 4000 columns, totaling about 80G. The data is stored on hdfs.
Ray cluster resources 6 cores and 24GB of memory per node. There are a total of three nodes.

We use ray. data. read_ csv() reads hdfs data, and all data is read into memory and overflowed to disk. When we perform ds. show(), ds. max (“x0”), or other operators, they all run very slowly, much slower than Spark under the same configuration. And when we run ds. max (“x0”) multiple times, the time consumed is relatively long. I can’t understand why the data is already in memory, and the time is still so long after the first time?

code eg:
import ray
from pyarrow import fs
ray.init(‘auto’)

hdfs_fs = fs.HadoopFileSystem.from_uri(“hdfs://xx-hadoop/?user=root”)
ds = ray.data.read_csv(‘/data/eps-files/’, filesystem=hdfs_fs,parallelism=2000)
ds.max(“x0”) #I saw that for the first time, all data will be placed in the object store and overflowed to disk,
#which takes a long time
ds.max(“x0”) #The time consumed for the second time is similar to the first time

Versions / Dependencies

ray 2.6.0
hdfs 3.2.2

Jules_Damji · August 29, 2023, 8:38pm

cc: @chengsu Any idea or insights into this? Anything unique about HDFS URL here?

yiwei00000 · August 31, 2023, 8:55am

I have solved this problem by adding ds.materialize() before the max operation

Jules_Damji · August 31, 2023, 3:54pm

Excellent @yiwei00000. I’ll close this issue.

Topic		Replies	Views
Ray data.read_csv keeps pausing Ray Data	3	410	September 28, 2023
MaxAbsScaler to process data, and the processing speed is extremely slow	0	213	September 4, 2023
Ray.data.read_csv Huge Dataset memory limitations	0	240	September 5, 2023
How to convert Pytorch torch.utils.data.Dataset to ray.data.dataset?	15	1399	December 8, 2022
Does ray.data.read_json() support reading from HDFS?	4	686	July 24, 2023

Ray data read hdfs slowly and process slowly

Versions / Dependencies

Related topics