What happened + What you expected to happen
Our data volume is 100000 pieces, each with 1000 columns, totaling approximately 1.5G. The data is stored on hdfs.
The Ray cluster has 6 cores, each with 24GB of memory, and a total of three nodes.
I use MaxAbsScaler’s fit_transform It takes about 10 minutes for the transformer to process these data, but I only need 15 seconds to use Spark’s MaxAbsScaler to fit and transform. Why is ray so slow? Spark and ray use the same machine resources.
Versions / Dependencies
ray 2.6.0
hdfs 3.2.2
Reproduction script
import ray
from pyarrow import fs
ray.init(‘auto’)
hdfs_fs = fs.HadoopFileSystem.from_uri(“hdfs://jd-hadoop/?user=root”)
ds = ray.data.read_csv(‘/data/eps-files/’, filesystem=hdfs_fs,parallelism=2000)
files = sorted(ds.input_files())
ds = ray.data.read_csv(files[0:100], filesystem=hdfs_fs,parallelism=100)
ds = ds.materialize()
ds.max(“x1”)
cols = ds.columns()[0:1001]
ds_1001 = ds.select_columns(cols)
ds_1001 = ds_1001.materialize()
from ray.data.preprocessors import MaxAbsScaler
preprocessor = MaxAbsScaler(columns=cols[1:len(cols)])
ds_1001_trans = preprocessor.fit_transform(ds_1001) # It takes about 10 minutes
ds_1001_trans = ds_1001_trans.materialize()
ds_1001_trans.show(10)
Issue Severity
High: It blocks me from completing my task.