I’m streaming a dataset with read_parquet and enabled the rich_progress bar with DatasetContext.get_current().enable_rich_progress_bars = True.
I’m having logs that look like :
The “85.2GB/9.3GB object store” in the Ray rich progress bar log means that your dataset operators are currently using 85.2GB of object store memory, while only 9.3GB was reserved for Ray Data’s use. This can happen if the cluster’s total object store memory is much larger, but Ray Data is configured (by default) to use only a fraction (typically 50%) of it, and your workload is exceeding that reservation, possibly due to other Ray components or misconfiguration. This is a warning that you may be over-consuming object store memory relative to what Ray Data expects, which could lead to performance issues or object spilling. For more, see the explanation in this GitHub issue.
If you want Ray Data to use more of the available object store memory, you can set DataContext.get_current().override_object_store_memory_limit_fraction = 1.0 or set the environment variable RAY_DATA_OBJECT_STORE_MEMORY_LIMIT_FRACTION=1.0 before starting Ray. Would you like a step-by-step breakdown of how these numbers are calculated or how to adjust the configuration?