Hi all,
I’ve been using Ray 1.3 to distribute a number (~O(100) for now) long-running (~30-60 minutes each) tasks to nodes on GCP (preemptible instances w. autoscaling). Each task is configured to run on its own node – I’ve added a custom “node” resources and decorate the function with:
@ray.remote(resources={"node": 1})
def processPatch(preprocessed_observations, test_orbits, out_dir, config, cwd=None):
    ...
The tasks are all launched at once:
...
futures = []
for patch in patches:
    futures += [ processPatch.remote(preprocessed_observations, patch_orbits_fn, out_dir, config, cwd=cwd) ]
...
and then waited on with something along the lines of:
    while len(futures):
        ready, futures = ray.wait(futures, num_returns=1, timeout=None)
        results = ray.get(ready)
        ...
Occasionally the nodes get preempted (stopped) by Google. What I’d expect then is for Ray to retry and reschedule task execution to a different node. However, some fraction of the time this doesn’t happen – and typically I end up with ~95 tasks being finished, and ~5 that never do. This, I understand, shouldn’t normally happen?
The really problematic thing, though, is that there’s no indication at the driver that there’s anything wrong with these tasks. So ray.wait in the loop above, just hangs there forever.
Question: Does anyone have any ideas on what may be going on (or how to debug it)? I.e., is there a way to inspect a submitted task to see what Ray thinks it’s doing (e.g., where it’s scheduled, etc.)?