Actor Design: Storing object refs and retreving the object

How severe does this issue affect your experience of using Ray?

  • None: Asking for design help.

I have the following Actor that holds a dataframe. I don’t want the actor to hold the dataframe in memory, so I put it in the object store and then get it when needed.

It feels like an anti-pattern because I explicitly put the df in the object store in load_dataset(). Then I manually de-reference it when needed in shape().

What is the best way to design an actor that hold data and functions that reference the data?

@ray.remote
class DataSet:
    """This remote class wraps a Sklearn dataset."""

    dataset_dict = {
        'iris': load_iris,
        'wine': load_wine,
        'digits': load_digits
    }

    def __init__(self, dataset_choice):
        self.dataset_choice = dataset_choice
        self.sklearn_data_ref, self.dataset_ref = self.load_dataset(dataset_choice)

    def load_dataset(self, dataset_choice):
        load_dataset = self.dataset_dict[dataset_choice]
        sklearn_data = load_dataset()
        dataset_df = pd.DataFrame(data=sklearn_data.data, columns=sklearn_data.feature_names)
        sk_ref = ray.put(sklearn_data)
        dataset_ref = ray.put(dataset_df)

        return sk_ref, dataset_ref

    def shape(self):
        dataset = ray.get(self.dataset_ref)
        return dataset.shape

Unless you pass the dataset ref to other workers, it is better just having direct reference within an actor.