[Train] Using Datasets is MUCH slower then instantiating data in workers

iamhatesz · August 27, 2024, 3:41pm

Hi!

I am training a model using Ray’s distributed XGBoost implementation.

For the last couple of days I’ve been testing it on a different size datasets. What I observed is that following the official tutorial that suggests loading the data as a Dataset and then accessing and materializing it shards in workers [link1] [link2] is much slower then directly loading chunks of data in workers, and consuming much more memory (because of the object store) leading to a need for more workers.

For instance, loading a Parquet dataset consisting of 15M rows and ~1.5k columns from GCS takes around 1 minute when doing reading in workers, and crashes (due to out of disk space errors, most likely because of the object spilling) when using Dataset after 11 minutes.

I understand that Ray Data does much more than just fetching and loading the data into memory, but I don’t understand why the difference is so huge.

Here is the worker function I used to debug that loads only the file paths from the Dataset, but does the reading locally:

def worker_from_paths():
    def get_dataset_shard(dataset_key: str) -> pd.DataFrame:
        ds_files_iter = ray.train.get_dataset_shard(dataset_key)
        ds_files = list(ds_files_iter.materialize().to_pandas()["item"])
        df = pd.concat((pd.read_parquet(f) for f in ds_files))
        return df

    df = get_dataset_shard("train")
    train.report({"size": len(df)})

And the direct one, taken from the XGBoost Ray implementation:

def worker_from_direct():
    def get_dataset_shard(dataset_key: str) -> pd.DataFrame:
        ds_files_iter = ray.train.get_dataset_shard(dataset_key)
        df = ds_files_iter.materialize().to_pandas()
        return df

    df = get_dataset_shard("train")
    train.report({"size": len(df)})

I’d like to ask if there is any practical difference between these two approaches and why the approach I find more efficient is not suggested in the docs. What are the use cases when using StreamSplitDataIterator might be better?

Thanks!

Topic		Replies	Views
Accessing Large Static Datasets with Ray Clusters	3	549	May 27, 2023
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1201	February 14, 2023
Tensorflowtrainer train way slower than (normal pandas and tensorflow) Ray Train	1	580	April 12, 2023
Ray train tensorflowtrainer look slower than than (normal pandas and tensorflow) i.e without using distribution training or any framework Ray Train	2	689	April 13, 2023
Long initialization time to initialize_session with large scale dataset Ray Train	3	59	January 29, 2025

[Train] Using Datasets is MUCH slower then instantiating data in workers

Related topics