Ray data.read_csv keeps pausing

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

When I run this,

    dataset = ray.data.read_csv(
        s3_bucket_name+'/'+s3_folder_path_prefix+'/',
        partition_filter=partition_filter,
        filesystem=s3,
        convert_options=convert_options
    )

When I materialize this, it works but the read just keeps pausing or proceeds very slowly. But occasionally it goes forward in a burst. I was monitoring this process on top and I see that a lot of the processes are on sleep and only a few are running. Sometimes a lot of them start running and that is when the read progress bar moves forward.

I am using a 64 CPU instance and using all of them to read data. My data read function is very slow because of this. Is this an issue or is there some parameter that should be changed? A simple multiprocessing read function seems to be a lot faster than this.


This is how top looks.


And sometimes, it is like this.

Hi @Dhruva_Kartik , what is the intended use case for the dataset above (e.g. ingested by Trainer)? If you materialize the dataset, this will read the entire dataset into memory, which is often resource intensive and can cause failures (e.g. materializing a dataset which is significantly larger than memory available).

Since Ray Data provides streaming execution, it’s often times not necessary to materialize the dataset for most use cases. For example, if you want to iterate over the dataset, you can use iter_batches() without having to materialize the dataset.