Ray Datasets and Shell Tasks

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

Hi all,
I am working on a project to build a more robust version of a pipeline that is currently being run using bash and cron. Many of the features Ray has in abstracting worker scaling and communication are quite appealing. However, most of the tasks in our pipeline are run from the shell with arguments for input and output data. Running these commands through python is not an issue, but is the requirement to read and write data from the disk a problem for Ray? If so, do you have any recommendations or other thoughts?

Thanks!

Hi @Raghav_G, it should not be a problem for a task to read/write data from the disk. From what my understanding of your question, you basically want to have a distributed glue for a bunch tasks, which can handle launching tasks, coordinating the communication and scaling when needed. Ray users can actually leverage Ray at different levels, which we summarized here: Anyscale - Ray Distributed Library Patterns. In your case, you seem want to use level 2 (using task scheduling and communication), without a need for shared distributed memory layer, which is fine.

1 Like

Thank you for your response. After looking at library patterns you shared, I think our workflow may not entirely fit within the level 2 framework as we would need an inter-node data communication system (currently we are using rsync) to transfer intermediate files between nodes in different cluster types. Is there a way to connect the file system to the shared distributed memory layer, say through object serialization/deserialization? I am very new to Ray, so please let me know if I am misunderstanding something.

In that case, the Ray Datasets might be a way to go.

  • Creating dataset from your files: You can load those files into Datasets, which resides in the Ray’s distributed memory store. Here is the guide on how to create Dataset from file storage (local or distributed file system): Creating Datasets — Ray 3.0.0.dev0
  • Processing the data with Datasets APIs: you may express the computation you want to apply to the data with Datasets transformation APIs, more guide: Transforming Datasets — Ray 3.0.0.dev0
  • Consuming the processed data: from what I understand, you want to write out data to disk (which is one of the ways to consume the data), you may check how to save the Datasets here: Consuming Datasets — Ray 3.0.0.dev0
    Note if you are performing the above in a distributed cluster, you will need to take care of the file system (e.g. use S3 or HDFS etc).