Unable to Index Batch Inference

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

I am using the Paddle OCR library to do text recognition. I am running text recognition on long videos and am trying to use Ray to speed up the processing.

ray_images = ray.data.from_items(rows)

Column     Type
------     ----
index      int64
timestamp  double
image      numpy.ndarray(shape=(720, 1280, 3), dtype=uint8)


class TextDetector:
    def __init__(self):
        with tf.device("GPU:0"):
            self.ocr = PaddleOCR(use_angle_cls=False, lang="en", show_log=False)

    def __call__(self, row):
        with tf.device("GPU:0"):
            return {'prediction': self.ocr.ocr(row['image'][0], cls=False)}

outputs = ray_images.map_batches(TextDetector,
                                           compute=ray.data.ActorPoolStrategy(size=2),
                                           num_gpus=1,
                                           batch_size=1,
                                           batch_format='default',
                                           zero_copy_batch=True)

predictions = outputs.take_all()

predictions

[{'prediction': [[[229.0, 53.0], [724.0, 53.0], [724.0, 75.0], [229.0, 75.0]],
   ('UNITEDSTATESSENATE', 0.9988573789596558)]},
 {'prediction': [[[226.0, 87.0],
    [576.0, 91.0],
    [576.0, 136.0],
    [225.0, 132.0]],
   ('COMMITTEE', 0.996300995349884)]},
 {'prediction': [[[576.0, 93.0],
    [856.0, 93.0],
    [856.0, 130.0],
    [576.0, 130.0]],
   ('HEARING', 0.9904467463493347)]},
 {'prediction': [[[862.0, 90.0],
    [1184.0, 97.0],
    [1183.0, 138.0],
    [861.0, 131.0]],
   ('CHANNELS', 0.9988545179367065)]},
...

This results in a modest speedup of around 25%, but it seems I have to use batch_size=1 and row['image'][0] since PaddleOCR.ocr appears to not take batches of images.

This is a problem I would like to resolve but is not the main issue.

The main issue is that frames with multiple text boxes are returned as separate dictionaries and are not associated with any frame. I would like to be able to have the prediction results be indexed with a frame so I know which image the result came from.
In this test, I have 7190 num_rows and 25810 predictions, so on average over 3 text boxes per image.

I would like to at bare minimum pass the frame index through to the output, so I tried to do that:

class TextDetector:
    def __init__(self):
        with tf.device("GPU:0"):
            self.ocr = PaddleOCR(use_angle_cls=False, lang="en", show_log=False)

    def __call__(self, row):
        with tf.device("GPU:0"):
            return {'prediction': self.ocr.ocr(row['image'][0], cls=False), 'index': row['index']}
class TextDetector:
    def __init__(self):
        with tf.device("GPU:0"):
            self.ocr = PaddleOCR(use_angle_cls=False, lang="en", show_log=False)

    def __call__(self, row):
        with tf.device("GPU:0"):
            result = {'prediction': self.ocr.ocr(row['image'][0], cls=False)}
            
        result['index'] = row['index']
        return result

Both of these returned the following error which I assume has to do with passing data to the GPU.
ValueError: All arrays must be of the same length

Unfortunately, I am not that familiar with Ray and do not know how to resolve this issue.

I would most appreciate it if there could be a way to return the index in the result for each image.
It would also be appreciated if there were a way to increase the batch_size to something higher to increase the speed.

for this line, I think you need to change the value to a list.
result = {'prediction': [self.ocr.ocr(row['image'][0], cls=False)]}