Map Ray Dataset by Passing into a Python Class (Ray Actor)?

Hi, I am a Ray Newbie. I want to create a new dataset column called embedding by passing each row to the GetEmbeddings class. I was wondering if this is possible? I want to do the embedding processing step in parallel. Currently I get error that

Standalone Python objects are not allowed in Ray 2.5. To return Python objects from map(), wrap them in a dict, e.g., return "{'item': item}" instead of just "item"

Assume ds is a Ray Dataset

@ray.remote
class GetEmbeddings:
    def __init__(self):
         # Do something
    def get_response(self, row):
        # Get some_embedding_list by running a model on row["text"]
        row["embedding"] = some_embedding_list
        return row

process_embedding = GetEmbeddings.remote()
ds = ds.map(process_embedding.get_response.remote)

Thanks!

So the way to do this is using stateful transforms as mentioned in the doc:
https://docs.ray.io/en/latest/data/transforming-data.html#stateful-transforms

We need to create a class with init and call functions

1 Like