How to interpret ray progress bar

I’m streaming a dataset with read_parquet and enabled the rich_progress bar with DatasetContext.get_current().enable_rich_progress_bars = True.
I’m having logs that look like :

(pid=94348) Running Dataset: train_30_1. Active & requested resources: 24/90 CPU, 85.2GB/9.3GB object store: 6%|▌ | 132M/2.11G [02:00<44:56, 734k row/s] 

Not sure how to interpret the “85.2GB/9.3GB object store” part ?

I can’t seem to map the 9.3 GB in the resource requirements chosen of my cluster. (I’ve got a total of 167.81GB allocated for my object store memory)

1. Severity of the issue: (select one)

Low: Annoying but doesn’t hinder my work.

2. Environment:

  • Ray version: ray==2.51.1
  • Python version: 3.11
  • OS: Linux
  • Cloud/Infrastructure:
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected:
  • Actual:

The “85.2GB/9.3GB object store” in the Ray rich progress bar log means that your dataset operators are currently using 85.2GB of object store memory, while only 9.3GB was reserved for Ray Data’s use. This can happen if the cluster’s total object store memory is much larger, but Ray Data is configured (by default) to use only a fraction (typically 50%) of it, and your workload is exceeding that reservation, possibly due to other Ray components or misconfiguration. This is a warning that you may be over-consuming object store memory relative to what Ray Data expects, which could lead to performance issues or object spilling. For more, see the explanation in this GitHub issue.

If you want Ray Data to use more of the available object store memory, you can set DataContext.get_current().override_object_store_memory_limit_fraction = 1.0 or set the environment variable RAY_DATA_OBJECT_STORE_MEMORY_LIMIT_FRACTION=1.0 before starting Ray. Would you like a step-by-step breakdown of how these numbers are calculated or how to adjust the configuration?

Sources:

Hint: Mention @RunLLM in the post for followups.