Avoiding cross node communication ray data pipeline

vikram_Gill · April 22, 2024, 7:52pm

Hi Folks! I have four questions:

I have a distributed setup of 100 machines, where there are 3 actors installed on every node with SPREAD scheduling. If i executed a data pipeline with the 3 actors in sequence : actor A → actor B → actor C, would data be produced and consumed locally on every node, with no cross node communication ? I wish to minimize cross node communicaiton. Or do I have to play around with these options ? locality_with_output, actor_locality_enabled mentioned here: Execution Configurations — Ray 2.43.0
If I want to have multiple write/save operations in a data pipeline, is it possible ? Looks like ray.data.write_XXXX do not return a dataset to continue. Any recommendations on accomplishing the same. Goal is to save intermediate artifacts of a data pipeline to S3 as well for later debugging/analysis on a as needed basis.
Is SPREAD scheduling a soft constraint or a hard constraint for launching actors ?
Are ray tasks reused ? lets say if there is not much initialization context in launching a task but there is lot of repeated similar processing, should we favor tasks over actors ? Which one will give better peformance ?

Topic		Replies	Views
Dataset and task compute pipelining Ray Core	7	351	May 17, 2022
Ray Datasets and Shell Tasks Ray Data	3	461	August 4, 2022
Can I share Ray Queue in multiple nodes? Ray Core	0	218	November 27, 2023
Data exchange between workers Ray Client	3	647	August 3, 2021
Pipeline with queues between the actors Ray Core	1	376	October 27, 2023