Our data volume is 2 million pieces, each with 4000 columns, totaling about 80G. The data is stored on hdfs.
Ray cluster resources 6 cores and 24GB of memory per node. There are a total of three nodes.
We use ray. data. read_ csv() reads hdfs data, and all data is read into memory and overflowed to disk. When we perform ds. show(), ds. max (“x0”), or other operators, they all run very slowly, much slower than Spark under the same configuration. And when we run ds. max (“x0”) multiple times, the time consumed is relatively long. I can’t understand why the data is already in memory, and the time is still so long after the first time?
code eg:
import ray
from pyarrow import fs
ray.init(‘auto’)
hdfs_fs = fs.HadoopFileSystem.from_uri(“hdfs://xx-hadoop/?user=root”)
ds = ray.data.read_csv(‘/data/eps-files/’, filesystem=hdfs_fs,parallelism=2000)
ds.max(“x0”) #I saw that for the first time, all data will be placed in the object store and overflowed to disk,
#which takes a long time
ds.max(“x0”) #The time consumed for the second time is similar to the first time
Versions / Dependencies
ray 2.6.0
hdfs 3.2.2