Is there any method for returning a false data in ray.data

Hello.

I was using ray.data to batch inference my image dataset.
In my case, some of my image dataset could not be read by PIL.
Then I created a class which read image and preprocessed it.

class ImageReader:
    def __init__(self) -> None:
        self.logger = Logger()
        self.bli2_processor = Blip2ProcessorWrapper() 

    def __call__(self, datum):
        image_path = datum['image_path']
        try:
            img = Image.open(image_path)
            if not img.mode == "RGB":
                img = img.convert("RGB")
            prepro = self.bli2_processor.processor(img, return_tensors="pt")['pixel_values']
        except Exception as e:
            self.logger.exception(f"read error: {str(e)}, origin_url: {image_path}")
            return {"batch_images": None, **datum}
        return {"batch_images": prepro, **datum}

In my class, when one image could be read, then I returned the result after blip2_preprocessor. When image could not be read, i returned None. And then, I used a filter to filter that batch_images is None.

Howerver, if I used this, I would found pyarrow exception in data serialization.

SO, is there any method for returning a false data in ray.data, so that i can filter in the follow up filter operator

@kyoka_gong Instead of return a None, can you create a fake image, with tensors, which you can filter out, that way serailzation of None won’t fail.

cc: @ericl @chengsu Any ideas here?

Yup, I think that would be the way to go. You can also mark a record as errored with a separate “valid” or “error” column.

1 Like

thx ericl’s & jules’s suggestion. :handshake: