Hello.
I was using ray.data to batch inference my image dataset.
In my case, some of my image dataset could not be read by PIL.
Then I created a class which read image and preprocessed it.
class ImageReader:
def __init__(self) -> None:
self.logger = Logger()
self.bli2_processor = Blip2ProcessorWrapper()
def __call__(self, datum):
image_path = datum['image_path']
try:
img = Image.open(image_path)
if not img.mode == "RGB":
img = img.convert("RGB")
prepro = self.bli2_processor.processor(img, return_tensors="pt")['pixel_values']
except Exception as e:
self.logger.exception(f"read error: {str(e)}, origin_url: {image_path}")
return {"batch_images": None, **datum}
return {"batch_images": prepro, **datum}
In my class, when one image could be read, then I returned the result after blip2_preprocessor. When image could not be read, i returned None. And then, I used a filter to filter that batch_images
is None.
Howerver, if I used this, I would found pyarrow exception
in data serialization.
SO, is there any method for returning a false data in ray.data, so that i can filter in the follow up filter operator