None value in ds

duong_phuc · August 1, 2024, 2:55am

I am using ray data map_batches() to transform my dataset.
My dataset have all value with the type String, and I need to specify the type before write it to parquet file.

but I have a problem, I want to do this line in the fn in map_batches(): batch_df[col][batch_df[col] == ‘’] = None

after that, the ds turn to PandasBlockSchema instead of normal Dataset.
so I can not using dataset.to_arrow_refs() to make the specific types of fields before write it to parquet file.

please help me

ussesjenny · August 1, 2024, 4:42am

Hi everyone,

I think you should using ray.data.map_batches() to transform your dataset with string values and need to replace empty strings with None before writing to a Parquet file. When you do batch_df[col][batch_df[col] == ''] = None, it converts your dataset to a PandasBlockSchema, preventing you from using dataset.to_arrow_refs() to specify field types before saving. try to this may be it is useful.

Thanks

duong_phuc · August 1, 2024, 4:55am

are you a bot :)). your answer do not make sense :))

Sam_Chan · August 6, 2024, 6:54am

I think @ussesjenny response is on the right track; to put another way you basically need to fill in the Non empty strings before your write to Parquet while it’s still a Dataset object.

Your explicit call to batch_df implicitly converts it to PandasBlockSchema, which, if I’m understanding correctly, is what you are trying to avoid.

Topic		Replies	Views
ValueError: buffer source array is read-only with ds.map_batches and pandas as the batch format Ray Data	3	1422	November 30, 2022
Write Parquet adds new column value Ray Data	11	1242	April 17, 2023
Method chaining on datasets	1	544	March 15, 2023
Unable to add column to ray dataset read via parquet	1	293	November 2, 2023
Pandas FutureWarning With ray.data.Dataset.map	0	578	January 20, 2024

None value in ds

Related topics