Unable to Index Batch Inference

jameszp · October 21, 2023, 4:12am

How severe does this issue affect your experience of using Ray?

High: It blocks me to complete my task.

I am using the Paddle OCR library to do text recognition. I am running text recognition on long videos and am trying to use Ray to speed up the processing.

ray_images = ray.data.from_items(rows)

Column     Type
------     ----
index      int64
timestamp  double
image      numpy.ndarray(shape=(720, 1280, 3), dtype=uint8)


class TextDetector:
    def __init__(self):
        with tf.device("GPU:0"):
            self.ocr = PaddleOCR(use_angle_cls=False, lang="en", show_log=False)

    def __call__(self, row):
        with tf.device("GPU:0"):
            return {'prediction': self.ocr.ocr(row['image'][0], cls=False)}

outputs = ray_images.map_batches(TextDetector,
                                           compute=ray.data.ActorPoolStrategy(size=2),
                                           num_gpus=1,
                                           batch_size=1,
                                           batch_format='default',
                                           zero_copy_batch=True)

predictions = outputs.take_all()

predictions

[{'prediction': [[[229.0, 53.0], [724.0, 53.0], [724.0, 75.0], [229.0, 75.0]],
   ('UNITEDSTATESSENATE', 0.9988573789596558)]},
 {'prediction': [[[226.0, 87.0],
    [576.0, 91.0],
    [576.0, 136.0],
    [225.0, 132.0]],
   ('COMMITTEE', 0.996300995349884)]},
 {'prediction': [[[576.0, 93.0],
    [856.0, 93.0],
    [856.0, 130.0],
    [576.0, 130.0]],
   ('HEARING', 0.9904467463493347)]},
 {'prediction': [[[862.0, 90.0],
    [1184.0, 97.0],
    [1183.0, 138.0],
    [861.0, 131.0]],
   ('CHANNELS', 0.9988545179367065)]},
...

This results in a modest speedup of around 25%, but it seems I have to use batch_size=1 and row['image'][0] since PaddleOCR.ocr appears to not take batches of images.

This is a problem I would like to resolve but is not the main issue.

The main issue is that frames with multiple text boxes are returned as separate dictionaries and are not associated with any frame. I would like to be able to have the prediction results be indexed with a frame so I know which image the result came from.
In this test, I have 7190 num_rows and 25810 predictions, so on average over 3 text boxes per image.

I would like to at bare minimum pass the frame index through to the output, so I tried to do that:

class TextDetector:
    def __init__(self):
        with tf.device("GPU:0"):
            self.ocr = PaddleOCR(use_angle_cls=False, lang="en", show_log=False)

    def __call__(self, row):
        with tf.device("GPU:0"):
            return {'prediction': self.ocr.ocr(row['image'][0], cls=False), 'index': row['index']}

class TextDetector:
    def __init__(self):
        with tf.device("GPU:0"):
            self.ocr = PaddleOCR(use_angle_cls=False, lang="en", show_log=False)

    def __call__(self, row):
        with tf.device("GPU:0"):
            result = {'prediction': self.ocr.ocr(row['image'][0], cls=False)}
            
        result['index'] = row['index']
        return result

Both of these returned the following error which I assume has to do with passing data to the GPU.
ValueError: All arrays must be of the same length

Unfortunately, I am not that familiar with Ray and do not know how to resolve this issue.

I would most appreciate it if there could be a way to return the index in the result for each image.
It would also be appreciated if there were a way to increase the batch_size to something higher to increase the speed.

raulchen · October 25, 2023, 10:15pm

for this line, I think you need to change the value to a list.
result = {'prediction': [self.ocr.ocr(row['image'][0], cls=False)]}

Topic		Replies	Views
Serve batching not working Ray Libraries (Data, Train, Tune, Serve)	0	153	December 6, 2023
[Serve] Batch inference Ray Libraries (Data, Train, Tune, Serve)	1	323	December 6, 2023
Is there any method for returning a false data in ray.data Ray Libraries (Data, Train, Tune, Serve)	3	188	August 2, 2023
[Data] Pandas throwing error when iterating over batches using `Dataset.iter_batches()` Ray Data	1	58	April 25, 2024
About Ray Serve upload files Ray Serve	1	690	June 28, 2021

Unable to Index Batch Inference

Related Topics