Hello Ray Community,
I am evaluating Ray on a POC and I am having some performance issues when trying to work on a dask dataframe using a function with a large closure.
Specifically, I am trying to label encode a field
def encode_id(x: str) -> int:
# This contains a closure of a large object
return lbl_encoder.transform([x])[0]
df["id_enc"] = df["id"].apply(encode_id, meta=("encoded_label", "int"))
I was reading about distributing large objects using scatter, like suggested here Distributing heavy functions via closures · Issue #5503 · dask/dask · GitHub, but since I am using dask on ray, I am not sure how to get a reference to the dask client, since dask on ray docs states that you should not instantiate the dask client directly.
Any suggestion is welcome! Thank in advance.