How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hey Team,
Would be great to get some feedback on our use case!
We are currently trying to make use of data streaming with training our ml models. We definitely can see the streaming behavior, however, it still is very slow, and not achieving high CPU or GPU utilization.
We are currently using Ray 2.4 and training with Tensorflow 2.7, converting using the iter_tf_batches
.
We have experimented with the DatasetContext.get_current().execution_options.resource_limits.object_store_memory
setting to a range of values, 1MB, 2GB, 4Gb, 10GB, and 20GB
but no matter what we are not getting the expected “out of the box” streaming feeling.
Some things to note about our dataset, it is made up of Parquet files with JPEG images inside them. There are about 200000 files, with the avg size being 60MB, and contain between 100-400 samples each. The preprocessing is quite heavy computationally and heavily increases a samples size.
The first batch of data that is fed into training, has the following stats:
Stage 1 ReadParquet: 66/66 blocks executed in 534.99s
* Remote wall time: 1.62s min, 5.27s max, 2.95s mean, 195.01s total
* Remote cpu time: 1.02s min, 2.03s max, 1.41s mean, 93.19s total
* Peak heap memory usage (MiB): 5460.5 min, 39400.53 max, 22678 mean
* Output num rows: 2424 min, 4325 max, 3750 mean, 247521 total
* Output size bytes: 350239343 min, 607671869 max, 551557776 mean, 36402813262 total
* Tasks per node: 66 min, 66 max, 66 mean; 1 nodes used
* Extra metrics: {'obj_store_mem_alloc': 36402813262, 'obj_store_mem_freed': 245057435863, 'obj_store_mem_peak': 245057435863}
Stage 2 MapBatches(preprocessor): 14/14 blocks executed in 534.99s
* Remote wall time: 13.22s min, 26.6s max, 22.72s mean, 318.08s total
* Remote cpu time: 17.13s min, 40.11s max, 30.9s mean, 432.67s total
* Peak heap memory usage (MiB): 1221.54 min, 8867.44 max, 5274 mean
* Output num rows: 149 min, 290 max, 279 mean, 3919 total
* Output size bytes: 280362572 min, 545672120 max, 526721438 mean, 7374100132 total
* Tasks per node: 14 min, 14 max, 14 mean; 1 nodes used
* Extra metrics: {'obj_store_mem_alloc': 7374100132, 'obj_store_mem_freed': 577999233, 'obj_store_mem_peak': 7930077107}
Dataset iterator time breakdown:
* Total time user code is blocked: 939.52ms
* Num blocks local: 0
* Num blocks remote: 0
* Num blocks unknown location: 0
* Batch iteration time breakdown (summed across prefetch threads):
* In ray.get(): 10.47ms min, 12.82ms max, 11.67ms avg, 46.66ms total
* In batch creation: 870.04ms min, 870.04ms max, 870.04ms avg, 870.04ms total
* In batch formatting: 15.55ms min, 15.55ms max, 15.55ms avg, 15.55ms total
The CPU utilization has been very low:
Note the green line is the GPU worker, only achieving a max of 4% utilization.
As for the GPU utilization, it looks as follows:
Thnx!