When to use Ray actors vs Ray Dataset for Text Processing

harshit206 · November 30, 2022, 10:22pm

I have successfully implemented key phrase extraction code using two approaches:

Reading data in python data structures and processing it using Ray Actor Pool.
Reading data with Ray Dataset and processing it.

How to decide between processing data with python lists (decorated with ray actors) vs Ray Dataset considering performance and other aspects?
I am working on keyphrase extraction which uses spacy model.

My data lies in S3 and I have the option to keep my data in s3 either in multiple json files or parquet. In this experiment, for ray actor implementation, i read json file and for Ray dataset implementation, i read data from parquet files. Dataset provides api to read files directly from s3 which makes it easier. If I go ahead with Ray Actor Pool I would have to write another helper function for reading files from S3.

I was hoping to learn about general guidelines while developing ray applications (in this case using ray dataset vs python data structures(decorated with ray actors) for accomplishing my task.

Sharing snippets of code below.

Using Ray Actor Pool:

@ray.remote
class Textprocessor:
    def __init__(self):

       # setup instructions for spacy model
        import pytextrank
        self.nlp = spacy.load("en_core_web_lg")
        self.nlp.max_length = 1080000
        self.nlp.add_pipe("textrank")

  def process(self, args):
         ## some processing ###
        return
  def process_func2(self, args):
         ## some processing ###
        return
  def mainprocess(self, doc):
         ## some processing ###
        return keyphrases

documents = read_json(path) #read json is a utility function and documents is list of dict.
actors = []
for actor in range(int(ray.cluster_resources()['CPU'])):
    actors.append(Textprocessor.remote())
pool = ActorPool(actors)
for output in pool.map_unordered(lambda a, v: a.mainprocess.remote(v), documents):
## processing

Using Ray Datasets:

ds = ray.data.read_parquet("S3//PATH")

class TransformBatch:
    def __init__(self):
        import pytextrank

        self.nlp = spacy.load("en_core_web_lg")
        self.nlp.max_length = 1080000  
        self.nlp.add_pipe("textrank")  

  def process(self, args):
         ## some processing ###
        return
  def process_func2(self, args):
         ## some processing ###
        return

    def __call__(self, batch):
        import copy
        batch = copy.deepcopy(batch)
        batch['lower_text'] = batch['text'].map(str.lower)
        batch['spacy_docs'] = batch['lower_text'].map(self.nlp)
        batch['doc_phrases'] = batch['spacy_docs'].map(self.process)
        batch['key_phrases'] = batch['doc_phrases'].map(self.process_func2)
        return pd.DataFrame(batch['key_phrases'])  #only need key phrases

ds = ds_docs.map_batches(TransformBatch, compute=ActorPoolStrategy(6,8), batch_format='pandas')

rliaw · November 30, 2022, 10:36pm

Awesome! Good to see you got things working.

I would say for you Datasets seems to be the least hassle option and generally the right tool here. You should use Datasets if you’re working with large amounts of data AND you don’t need the lower level primitives that Ray provides (actor and tasks). I’ll go ahead and open an issue on Github to track this as a documentation change as well.

rliaw · November 30, 2022, 10:39pm

Here’s a blog post that explains a bit more: Model Batch Inference in Ray: Actors, ActorPool, and Datasets | Anyscale (though this is perhaps more focused on model inference)

harshit206 · November 30, 2022, 11:22pm

Thanks a lot, @rliaw . I was actually looking for something like this but couldn’t find it earlier. This is really helpful.

Topic		Replies	Views
Keep PyTorch DataLoader when using Ray Data Ray Data	0	338	November 7, 2023
Deprecation of ray.utils.ActorPool Ray Core	8	995	November 15, 2022
[Core] Question on optimizing machine learning project speed using ray Ray Core	5	465	February 1, 2022
Abyssmal perf - How to not do it wrong with ray data reading text	1	163	January 27, 2024
Ray Train with Ray datasets (includes images) too slow Ray Data	5	1275	February 14, 2023

When to use Ray actors vs Ray Dataset for Text Processing

Related topics