Hi Folks! I have four questions:
- I have a distributed setup of 100 machines, where there are 3 actors installed on every node with SPREAD scheduling. If i executed a data pipeline with the 3 actors in sequence : actor A → actor B → actor C, would data be produced and consumed locally on every node, with no cross node communication ? I wish to minimize cross node communicaiton. Or do I have to play around with these options ? locality_with_output, actor_locality_enabled mentioned here: Execution Configurations — Ray 2.11.0
- If I want to have multiple write/save operations in a data pipeline, is it possible ? Looks like ray.data.write_XXXX do not return a dataset to continue. Any recommendations on accomplishing the same. Goal is to save intermediate artifacts of a data pipeline to S3 as well for later debugging/analysis on a as needed basis.
- Is SPREAD scheduling a soft constraint or a hard constraint for launching actors ?
- Are ray tasks reused ? lets say if there is not much initialization context in launching a task but there is lot of repeated similar processing, should we favor tasks over actors ? Which one will give better peformance ?