Method chaining on datasets

ckapoor · March 15, 2023, 3:00pm

Hi there, I am having a hard time trying to do a map / method chaining in Ray. Would like some hint on how to move forward.

Scenario: I am reading a parquet file that user data grouped by user. I call map_groups( fn_A) which works fine and returns a Dataset with pandas DataFrame as the format. Then, I groupby the resulted dataset and cal map_groups(fn_b) and it throws an exception. I am not sure why that is?

ds = ck_read_parquet_dir(s3files, columns)
grouped_ds = ds.groupby('id')
result = grouped_ds.map_groups(lambda x: ck1_compute_fs_and_extract(x, True), batch_format="pandas")
    print(f"CCAR schema 1- {result.schema()}")
    print(f"CCAR result type 1 - {result}")
    print(f"CCAR result type 1 - {result.take(3)}")
    print(f"CCAR result 1 - {type(result)}, {result.count()}")
    print(f"CCAR result default_batch_format - {result.default_batch_format()}")

result of above print statements

MapBatches(group_fn): 100%|██████████| 1/1 [00:00<00:00, 9.58it/s]
CCAR schema 1- PandasBlockSchema(names=[‘time’, ‘id’, ])
CCAR result type 1 - Dataset(num_blocks=1, num_rows=16412, schema={time: int64, id: int64)
CCAR result 1 - <class ‘ray.data.dataset.Dataset’>, 16412
CCAR result default_batch_format - <class ‘pandas.core.frame.DataFrame’>

However, when I run the following, I get an error
result2 = result.groupby('id').map_groups(lambda x: do_nothing(x))
where

def do_nothing(df: pd.DataFrame) -> pd.DataFrame:
    print("CCAR: Do NOTHING")
    return df

Error message:
[2m[36m(_sample_block pid=916, ip=192.168.88.11)[0m return self._table[[k[0] for k in key]].sample(n_samples, ignore_index=True)160[2m[36m(_sample_block pid=916, ip=192.168.88.11)[0m

TypeError: sample() got an unexpected keyword argument ‘ignore_index’

As far as I understand result is a dataset and one should be able to group by and apply map again. My original code was simply to call like map_groups(fn_a).groupBy(‘id’).map_groups(fn_b)

Any help is highly appreciated.

Thank you

Jules_Damji · March 15, 2023, 6:23pm

@ckapoor asking the Ray Data group to chime in
cc; @jianxiao @chengsu

Topic		Replies	Views
Map_groups chaining bug?	0	285	March 25, 2023
Possible reasons for ray data stucks at write_csv (or write_parquet)?	3	365	July 25, 2023
Apply function to (groupkey, groupvalue) of grouped by dataset Ray Data	1	535	December 23, 2022
Ray Data: How to yield entire groups from a batch?	5	269	January 27, 2024
Ray data creating multiple datasets and repeating map operations on ray dashboard Ray Train	2	138	November 21, 2024

Method chaining on datasets

result of above print statements

Related topics