Ray Datasets and Shell Tasks

Raghav_G · August 3, 2022, 12:33pm

How severe does this issue affect your experience of using Ray?

None: Just asking a question out of curiosity

Hi all,
I am working on a project to build a more robust version of a pipeline that is currently being run using bash and cron. Many of the features Ray has in abstracting worker scaling and communication are quite appealing. However, most of the tasks in our pipeline are run from the shell with arguments for input and output data. Running these commands through python is not an issue, but is the requirement to read and write data from the disk a problem for Ray? If so, do you have any recommendations or other thoughts?

Thanks!

jianxiao · August 3, 2022, 10:25pm

Hi @Raghav_G, it should not be a problem for a task to read/write data from the disk. From what my understanding of your question, you basically want to have a distributed glue for a bunch tasks, which can handle launching tasks, coordinating the communication and scaling when needed. Ray users can actually leverage Ray at different levels, which we summarized here: Anyscale - Ray Distributed Library Patterns. In your case, you seem want to use level 2 (using task scheduling and communication), without a need for shared distributed memory layer, which is fine.

Raghav_G · August 4, 2022, 1:43am

Thank you for your response. After looking at library patterns you shared, I think our workflow may not entirely fit within the level 2 framework as we would need an inter-node data communication system (currently we are using rsync) to transfer intermediate files between nodes in different cluster types. Is there a way to connect the file system to the shared distributed memory layer, say through object serialization/deserialization? I am very new to Ray, so please let me know if I am misunderstanding something.

jianxiao · August 4, 2022, 3:27am

In that case, the Ray Datasets might be a way to go.

Creating dataset from your files: You can load those files into Datasets, which resides in the Ray’s distributed memory store. Here is the guide on how to create Dataset from file storage (local or distributed file system): Creating Datasets — Ray 3.0.0.dev0
Processing the data with Datasets APIs: you may express the computation you want to apply to the data with Datasets transformation APIs, more guide: Transforming Datasets — Ray 3.0.0.dev0
Consuming the processed data: from what I understand, you want to write out data to disk (which is one of the ways to consume the data), you may check how to save the Datasets here: Consuming Datasets — Ray 3.0.0.dev0
Note if you are performing the above in a distributed cluster, you will need to take care of the file system (e.g. use S3 or HDFS etc).

Topic		Replies	Views
Passing large binary files and directories between tasks Ray Data	1	560	October 21, 2022
Data exchange between workers Ray Client	3	647	August 3, 2021
Avoiding cross node communication ray data pipeline	0	100	April 22, 2024
Is it possible to share objects between different driver processes? Ray Core	1	617	July 22, 2022
Ray suitability for unsplittable, non-numeric data processing	0	371	September 20, 2021

Ray Datasets and Shell Tasks

Related topics