How can I force map_batches to work harder?

localh · December 7, 2023, 2:50am

I have a function that I want to apply with map_batches to create a weight vector:

def class_weight(batch):
    classes = np.array([0.06159456, 4.55880019])
    labels = batch['label'].values
    labels = labels.astype(np.int8)  

    # create weight vector
    samples_weight = classes[labels]
    return {"weight": samples_weight}

Initially, map_batches will use a fair amount of the CPUs but eventually it will go to zero, reporting statistics like:

Running: 0.0/160.0 CPU, 0.0/32.0 GPU, 685.19 MiB/279.4 GiB object_store_memory:  30%|███       | 192/640 [12:24<13:58,  1.87s/it]
Running: 0.0/160.0 CPU, 0.0/32.0 GPU, 685.19 MiB/279.4 GiB object_store_memory:  30%|███       | 193/640 [12:26<13:55,  1.87s/it]
Running: 0.0/160.0 CPU, 0.0/32.0 GPU, 685.19 MiB/279.4 GiB object_store_memory:  30%|███       | 194/640 [12:28<13:47,  1.86s/it]
Running: 0.0/160.0 CPU, 0.0/32.0 GPU, 685.19 MiB/279.4 GiB object_store_memory:  30%|███       | 195/640 [12:33<19:39,  2.65s/it]
Running: 0.0/160.0 CPU, 0.0/32.0 GPU, 685.19 MiB/279.4 GiB object_store_memory:  31%|███       | 196/640 [12:37<23:55,  3.23s/it]
Running: 0.0/160.0 CPU, 0.0/32.0 GPU, 685.19 MiB/279.4 GiB object_store_memory:  31%|███       | 197/640 [12:42<26:41,  3.61s/it]
Running: 0.0/160.0 CPU, 0.0/32.0 GPU, 685.19 MiB/279.4 GiB object_store_memory:  31%|███       | 198/640 [12:46<28:32,  3.88s/it]
Running: 0.0/160.0 CPU, 0.0/32.0 GPU, 685.19 MiB/279.4 GiB object_store_memory: 100%|██████████| 198/198 [12:51<00:00,  3.88s/it]

My interpretation of the logs is that Ray believes the operation is trivial, using only 685mb of available 280 GB; thus it doesnt need to use many CPUs? How can I convince Ray otherwise?

Jules_Damji · December 8, 2023, 9:10pm

cc: @chengsu Any hints for him into visibility to use all the resources (both CPU and memory) for doing map_batches

Topic		Replies	Views
[Data] map_batches is not respecting concurrency from the beginning	1	157	December 6, 2024
Ray 2.9.3: map_batches and multi-gpu -- not processing partition blocks / evenly sharding Ray Data	2	254	March 12, 2024
Dataset support concurrency in one block when using map_batches	4	647	October 1, 2022
Issues with Batch Overflow during exceptions while utilizing map_batches Ray Data	1	26	August 22, 2024
How to use map_groups with GPU support?	0	78	April 9, 2024

How can I force map_batches to work harder?

Related topics