Ray Data: How to yield entire groups from a batch?

localh · January 24, 2024, 3:26am

I want to use Ray Data for a sequential data problem. By taking one batch of data, I want all the data associated with the group. So if I take 32 batches, I get all the data for 32 different items that I grouped by.

Here is a rough example:


def to_group(group: pd.DataFrame) -> pd.DataFrame:
    return group  

ds = ray.data.read_parquet("s3://anonymous@ray-example-data/iris.parquet") \
        .groupby('variety') \
        .map_groups(to_group)

x = ds.take(1)  # by taking 1 batch, I want all Virginica, next batch I want all Setosa, etc
x = ds.take_batch(1)

localh · January 24, 2024, 5:28pm

Looking for alternatives, particularly from Torch, I noticed that the documentation here seems to be misleading as it suggests it can support IterableDataset.

https://docs.ray.io/en/latest/data/api/doc/ray.data.from_torch.html

    import torch  
    from torch.utils.data import IterableDataset  
    import random  
    import ray
    
    class FakeDataIterableDataset(IterableDataset):  
        def __init__(self, num_samples):  
            self.num_samples = num_samples  
    
        def __iter__(self):  
            for _ in range(self.num_samples):  
                features = torch.tensor([random.random() for _ in range(3)])  
                label = torch.tensor([random.randint(0, 1)])  
                yield features, label  
    
    num_samples = 1000  
    dataset = FakeDataIterableDataset(num_samples)  
    ds = ray.data.from_torch(dataset)
    ds.take(1)

TypeError: object of type 'FakeDataIterableDataset' has no len()

Jules_Damji · January 25, 2024, 7:13pm

cc: @bveeramani any insight here?

bveeramani · January 25, 2024, 7:51pm

Hey @localh, I don’t think there’s an easy way to do that right now. With the way map_groups is implemented, the data doesn’t really remain grouped after the map_groups

Jules_Damji · January 25, 2024, 7:59pm

thanks @bveeramani for a quick response. I did not know that. Cheers!

Jules_Damji · January 27, 2024, 12:01am

@localh I’ll close this issue since the current implementation does not provide an easy and intuitive way

cc: @bveeramani

Topic		Replies	Views
Apply function to (groupkey, groupvalue) of grouped by dataset Ray Data	1	540	December 23, 2022
Method chaining on datasets	1	544	March 15, 2023
How to use ray data to transform rows into batch objects?	3	433	July 17, 2024
[Ray Dataset] Shifting data/Lag features Ray Data	4	523	August 17, 2022
Bucketing in Ray Dataset?	1	48	November 18, 2024

Ray Data: How to yield entire groups from a batch?

Related topics