Ray data.read_csv keeps pausing

Dhruva_Kartik · September 28, 2023, 1:02am

How severe does this issue affect your experience of using Ray?

Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When I run this,

    dataset = ray.data.read_csv(
        s3_bucket_name+'/'+s3_folder_path_prefix+'/',
        partition_filter=partition_filter,
        filesystem=s3,
        convert_options=convert_options
    )

When I materialize this, it works but the read just keeps pausing or proceeds very slowly. But occasionally it goes forward in a burst. I was monitoring this process on top and I see that a lot of the processes are on sleep and only a few are running. Sometimes a lot of them start running and that is when the read progress bar moves forward.

I am using a 64 CPU instance and using all of them to read data. My data read function is very slow because of this. Is this an issue or is there some parameter that should be changed? A simple multiprocessing read function seems to be a lot faster than this.

Dhruva_Kartik · September 28, 2023, 1:03am

This is how top looks.

Dhruva_Kartik · September 28, 2023, 1:04am

And sometimes, it is like this.

sjl · September 28, 2023, 8:36pm

Hi @Dhruva_Kartik , what is the intended use case for the dataset above (e.g. ingested by Trainer)? If you materialize the dataset, this will read the entire dataset into memory, which is often resource intensive and can cause failures (e.g. materializing a dataset which is significantly larger than memory available).

Since Ray Data provides streaming execution, it’s often times not necessary to materialize the dataset for most use cases. For example, if you want to iterate over the dataset, you can use iter_batches() without having to materialize the dataset.

Topic		Replies	Views
Ray data read hdfs slowly and process slowly Ray Train	3	474	August 31, 2023
Ray Data read Parquet loads all the data in one go	4	597	October 21, 2023
Why isn't `ray.data.read_api._get_reader` parallelized?	0	190	December 5, 2023
Ray.data.read_csv Huge Dataset memory limitations	0	240	September 5, 2023
Abyssmal perf - How to not do it wrong with ray data reading text	1	163	January 27, 2024

Ray data.read_csv keeps pausing

Related topics