None value in ds

I am using ray data map_batches() to transform my dataset.
My dataset have all value with the type String, and I need to specify the type before write it to parquet file.

but I have a problem, I want to do this line in the fn in map_batches(): batch_df[col][batch_df[col] == ‘’] = None

after that, the ds turn to PandasBlockSchema instead of normal Dataset.
so I can not using dataset.to_arrow_refs() to make the specific types of fields before write it to parquet file.

please help me

Hi everyone,

I think you should using ray.data.map_batches() to transform your dataset with string values and need to replace empty strings with None before writing to a Parquet file. When you do batch_df[col][batch_df[col] == ''] = None, it converts your dataset to a PandasBlockSchema, preventing you from using dataset.to_arrow_refs() to specify field types before saving. try to this may be it is useful.

Thanks

are you a bot :)). your answer do not make sense :))

I think @ussesjenny response is on the right track; to put another way you basically need to fill in the Non empty strings before your write to Parquet while it’s still a Dataset object.

Your explicit call to batch_df implicitly converts it to PandasBlockSchema, which, if I’m understanding correctly, is what you are trying to avoid.