MaxAbsScaler to process data, and the processing speed is extremely slow

yiwei00000 · September 4, 2023, 2:36am

What happened + What you expected to happen

Our data volume is 100000 pieces, each with 1000 columns, totaling approximately 1.5G. The data is stored on hdfs.
The Ray cluster has 6 cores, each with 24GB of memory, and a total of three nodes.

I use MaxAbsScaler’s fit_transform It takes about 10 minutes for the transformer to process these data, but I only need 15 seconds to use Spark’s MaxAbsScaler to fit and transform. Why is ray so slow? Spark and ray use the same machine resources.

Versions / Dependencies

ray 2.6.0
hdfs 3.2.2

Reproduction script

import ray
from pyarrow import fs

ray.init(‘auto’)
hdfs_fs = fs.HadoopFileSystem.from_uri(“hdfs://jd-hadoop/?user=root”)
ds = ray.data.read_csv(‘/data/eps-files/’, filesystem=hdfs_fs,parallelism=2000)
files = sorted(ds.input_files())
ds = ray.data.read_csv(files[0:100], filesystem=hdfs_fs,parallelism=100)
ds = ds.materialize()
ds.max(“x1”)

cols = ds.columns()[0:1001]
ds_1001 = ds.select_columns(cols)
ds_1001 = ds_1001.materialize()
from ray.data.preprocessors import MaxAbsScaler
preprocessor = MaxAbsScaler(columns=cols[1:len(cols)])
ds_1001_trans = preprocessor.fit_transform(ds_1001) # It takes about 10 minutes
ds_1001_trans = ds_1001_trans.materialize()
ds_1001_trans.show(10)

Issue Severity

High: It blocks me from completing my task.

Topic		Replies	Views
Ray data read hdfs slowly and process slowly Ray Train	3	471	August 31, 2023
How can I optimize the following process Ray Core	0	18	July 19, 2024
Benchmarks for Ray Data? Ray Data	13	1033	October 5, 2023
Why does Ray Data execute 3 blocks first on MAC Ray Data	1	23	March 11, 2025
[Train] Using Datasets is MUCH slower then instantiating data in workers	0	73	August 27, 2024

MaxAbsScaler to process data, and the processing speed is extremely slow

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Related topics