Can't pickle pyarrow.dataset.Expression

We are facing an issue with the latest release where cloudpickle_fast.CloudPickler.dump blows up with, AttributeError(“module ‘pickle’ has no attribute ‘PickleBuffer’”) whenever the method being pickled contains a pyarrow.dataset.Expression as an arg/kwarg

Has anyone else experienced this issue as well, and is there anything that I can try to remedy it?

cc @suquark Can you address his question?

Hi @Baywatch , could you provide a reproducible example so I can investigate it? Thanks!

This turns out that pyarrow._dataset.Expression (usually used in pyarrow filters) in pyarrow==3.0.0 does not compat with the cloudpickle serializer used in Ray. The pyarrow==0.17.1 works fine. One workaround is to wrap the filters with the default pickle.dumps/pickle.loads method. Another workaround found by @Baywatch (thanks!) is to use the arrow triple style filters.

2 Likes

@sangcho FYI, I find the serialization failure only happens with ray client, not Ray remote functions or actors. This might indicate potential issues in ray clients, and we should write more tests to try to trigger some of them.

1 Like

cc @Ameer_Haj_Ali ^ Can you make sure to add more tests around the serialization?

Looping in @ijrsvt who is leading the Ray Client Efforts.

Sorry, but could you please describe the solution in more details? I’m using the example code of finetuning RAG by ray, but I’ m experiencing the same problem anyway, which seems to happen in the process of initializing retrievers.

the dataset expression syntax doesn’t seem to be picklable.
So if you are using filters like:
import pyarrow.dataset as ds
some_filter = ds.field(“FieldName”).isin([list_value])

that won’t work with Ray.
instead, try:
some_filter = [(‘FieldName’, ‘in’, lst)]

Hi, I’m using the same code of finetuning RAG by ray. So did you fix it?