Ray tasks lost on node failiure, how to debug?

Hi all,

I’ve been using Ray 1.3 to distribute a number (~O(100) for now) long-running (~30-60 minutes each) tasks to nodes on GCP (preemptible instances w. autoscaling). Each task is configured to run on its own node – I’ve added a custom “node” resources and decorate the function with:

@ray.remote(resources={"node": 1})
def processPatch(preprocessed_observations, test_orbits, out_dir, config, cwd=None):
    ...

The tasks are all launched at once:

...
futures = []
for patch in patches:
    futures += [ processPatch.remote(preprocessed_observations, patch_orbits_fn, out_dir, config, cwd=cwd) ]
...

and then waited on with something along the lines of:

    while len(futures):
        ready, futures = ray.wait(futures, num_returns=1, timeout=None)
        results = ray.get(ready)
        ...

Occasionally the nodes get preempted (stopped) by Google. What I’d expect then is for Ray to retry and reschedule task execution to a different node. However, some fraction of the time this doesn’t happen – and typically I end up with ~95 tasks being finished, and ~5 that never do. This, I understand, shouldn’t normally happen?

The really problematic thing, though, is that there’s no indication at the driver that there’s anything wrong with these tasks. So ray.wait in the loop above, just hangs there forever.

Question: Does anyone have any ideas on what may be going on (or how to debug it)? I.e., is there a way to inspect a submitted task to see what Ray thinks it’s doing (e.g., where it’s scheduled, etc.)?

In case it gives anyone some insight, here’s what I noticed looking at autoscaler outputs after a node was preempted:

======== Autoscaler status: 2021-06-05 12:01:57.946482 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 30 ray_worker_cpu8
Pending:
 (no pending nodes)
Recent failures:
 10.138.15.204: ray_worker_cpu8
 10.138.15.200: ray_worker_cpu8
 10.138.15.195: ray_worker_cpu8
 10.138.0.90: ray_worker_cpu8
 10.138.0.92: ray_worker_cpu8
 10.138.0.85: ray_worker_cpu8
 10.138.0.73: ray_worker_cpu8
 10.138.0.79: ray_worker_cpu8

Resources
---------------------------------------------------------------

Usage:
 30.0/962.0 CPU
 0.00/2634.564 GiB memory
 30.0/30.0 node
 0.00/1129.421 GiB object_store_memory

Demands:
 {'node': 1.0, 'CPU': 1.0}: 23+ pending tasks/actors
2021-06-05 12:02:03,919	INFO load_metrics.py:101 -- LoadMetrics: Removed mapping: 10.138.15.206 - 1622919716.5614395
2021-06-05 12:02:03,920	INFO load_metrics.py:107 -- LoadMetrics: Removed 1 stale ip mappings: {'10.138.15.206'} not in {'10.138.0.81', '10.138.0.95', '10.138.0.104', '10.138.0.123', '10.138.15.203', '10.138.0.106', '10.138.15.198', '10.138.0.23', '10.138.0.78', '10.138.15.194', '10.138.0.58', '10.138.15.197', '10.138.0.75', '10.138.15.201', '10.138.0.122', '10.138.0.86', '10.138.0.61', '10.138.0.87', '10.138.0.89', '10.138.15.196', '10.138.0.96', '10.138.0.124', '10.138.15.205', '10.138.0.83', '10.138.0.101', '10.138.0.93', '10.138.0.88', '10.138.15.193', '10.138.15.199', '10.138.15.202'}
2021-06-05 12:02:04,522	INFO autoscaler.py:309 --
======== Autoscaler status: 2021-06-05 12:02:04.522313 ========
Node status
---------------------------------------------------------------
Healthy:
 1 ray_head_default
 29 ray_worker_cpu8
Pending:
 (no pending nodes)
Recent failures:
 10.138.15.206: ray_worker_cpu8
 10.138.15.204: ray_worker_cpu8
 10.138.15.200: ray_worker_cpu8
 10.138.15.195: ray_worker_cpu8
 10.138.0.90: ray_worker_cpu8
 10.138.0.92: ray_worker_cpu8
 10.138.0.85: ray_worker_cpu8
 10.138.0.73: ray_worker_cpu8
 10.138.0.79: ray_worker_cpu8

Resources
---------------------------------------------------------------

Usage:
 29.0/930.0 CPU
 0.00/2546.900 GiB memory
 29.0/29.0 node
 0.00/1091.851 GiB object_store_memory

Demands:
 (no resource demands)
2021-06-05 12:02:11,079	INFO autoscaler.py:309 --

Note how the Demands seem to have disappeared after node removal (from 23+ pending to “no resource demands”).

Hmm if things are hanging without exceptions, I think that could be a bug. cc @Stephanie_Wang do you have any guess what’s happening here?

If there are any hints on how to debug this (e.g., how to increase logging levels, or what to expect), I’d be extremely grateful. Right now it’s a showstopper for us when it comes to using Ray. So I’d be happy to try out any suggestions (down to recompiling if needed).

In case it helps:

  • As I said, our tasks are long running (>= 30 minutes). Not sure if that has anything to do with the issue, but thought I’d mention it.
  • Also, the tasks are very simple: they just invoke an external (also Python) program via subprocess.run() (it’s a backward-compatibility stopgap). Not sure if that’s something that could cause an interaction w. Ray, but also thought I’d mention it.

@mjuric Sorry for the delay in the response! We currently don’t have tools to debug task-level inspection, so in this case, you should probably tweak the log level and navigate them to figure out what’s the problem. A couple suggestion;

  • Debugging — Ray v2.0.0.dev0 Set the debug level logging when you run ray start or ray.init
  • My gut feeling is that it is highly likely a bug from Ray’s end. But you can basically print the list of object ref, and global search from ray worker log files to see where the object ref is generated. I’d love to offer pair debugging here. Ray logging directory structure is here; Logging — Ray v2.0.0.dev0
  • What’s your scale? I’d like to recommend you to simulate the scenario (with 2 machines for example and stop one of nodes from the AWS console) and see if it is reproducible. In that case, we can try debugging the issue and make sure it is working.

Thanks @sangcho!

I’ll try this out, but it may probably take a few weeks. We had to move this project to Celery + a home-brewed autoscaler to meet a June deadline so I’m out of resources to work on Ray at the moment.

The typical scale where this would show up was ~30 nodes (32 cores each) on Google Cloud (not AWS). Interestingly, I couldn’t trigger it by manually stopping machines (which according to GCP’s documentation should be equivalent to preemption), but it reliably shows up when GCP preempts on its own (making me think there’s something different between those two scenarios).

As soon as I have time I’ll try to construct a minimal case that reproduces the problem, w. all the logging turned on.

1 Like