I am using ray data map_batches() to transform my dataset.
My dataset have all value with the type String, and I need to specify the type before write it to parquet file.
but I have a problem, I want to do this line in the fn in map_batches(): batch_df[col][batch_df[col] == ‘’] = None
after that, the ds turn to PandasBlockSchema instead of normal Dataset.
so I can not using dataset.to_arrow_refs() to make the specific types of fields before write it to parquet file.
please help me
Hi everyone,
I think you should using ray.data.map_batches()
to transform your dataset with string values and need to replace empty strings with None
before writing to a Parquet file. When you do batch_df[col][batch_df[col] == ''] = None
, it converts your dataset to a PandasBlockSchema
, preventing you from using dataset.to_arrow_refs()
to specify field types before saving. try to this may be it is useful.
Thanks
are you a bot :)). your answer do not make sense :))
I think @ussesjenny response is on the right track; to put another way you basically need to fill in the Non empty strings before your write to Parquet while it’s still a Dataset object.
Your explicit call to batch_df implicitly converts it to PandasBlockSchema, which, if I’m understanding correctly, is what you are trying to avoid.