Ray for HPC domain and Legion Programming System comparison

How severe does this issue affect your experience of using Ray?

  • None: Just asking a question out of curiosity

I’m curious if anyone has experience with applying Ray to the HPC domain (maybe they’ve encountered problem areas in Ray’s design or where Ray’s performance falls short for HPC jobs versus its intended workloads from the ML/AI domain), and if there are any thoughts on how Ray compares in terms of design to Legion (https://legion.stanford.edu/), which seems like a very similar system to Ray, but specifically targeted at the HPC domain.

For a similar feature example, Ray has the ability to cache scheduling decisions to amortize scheduling RPC overhead for similar tasks. Legion has a similar idea of index space tasks to efficiently launch a large number of non-interfering tasks.

Legion also has this idea of Dynamic Tracing (see the paper https://legion.stanford.edu/pdfs/trace2018.pdf), which is essentially JIT compiling the task graph (quote from the paper: “…dynamic tracing, a technique to efficiently and correctly memoize a dynamic dependence analysis and generate a task graph semantically equivalent to (but also often syntactically different from) the original.”) I haven’t been able to find a similar idea that has been applied to Ray (of course, this might be because I haven’t fully understood how Ray goes about dynamically generating it’s task graph, or whether this is really even an issue for Ray where this idea of “Dynamic Tracing” could be applied, just an example I’m trying to come up with for comparison).

Would love to hear if anyone else has thoughts on these topics.