Next steps for debugging Ray process?

Hi there,

I’m currently running into an odd issue while using Ray in python on AWS wherein Ray will intermittently hang indefinitely without crashing or producing any error messages that I can see.

Here’s a simplified representation of my project setup:

def main():
print(“starting run”)
task_running = True
while task_running:
print(“starting main loop”)

    output_1 = ray.get([process_a.remote(a) for a in range(N)])
    print("completed process a")
    print()
    print("spawning process b")
    output_2 = ray.get([process_b.remote(b) for b in output_1])
    print("completed process b")

    task_running = check_output(output_2)
    print("completed main loop")
print("completed run")

@ray.remote(num_cpus=1)
def process_a(x):
print(“\tstarting process a #{}”.format(x))
process_running = True
while process_running:
print(“\tstarting loop of process a #{}”.format(x))
output_3 = ray.get([process_c.remote(x, c) for c in range(M)])
print(“\tcompleted loop of process a #{}”.format(x))
process_running = check_output_3(output_3)
print(“completed process a #{}”.format(x))

@ray.remote(num_cpus=1)
def process_c(x, y):
print(“\t\tstarting process c #{}-{}”.format(x, y))
# some process (runtime: 30-60s)
return y

@ray.remote(num_cpus=1)
def process_b(z):
print(“\tstarting process b #{}”.format(z)
# some process (runtime 30-300s)
return z

The code in processes a, b, and c is a combination of python and C code (which has been compiled from python using Cython). My ray config file is basically the default ray config file for AWS with the following changes:

min_workers: N
max_workers: N
target_utilization_fraction: 1.0
head_node:
InstanceType: INSTANCE_TYPE
worker_nodes:
InstanceType: INSTANCE_TYPE

, where N corresponds to the number of instances of process_a/process_b that I’m running and INSTANCE_TYPE is an amazon EC2 instance type with M CPUs, so that my total cluster has NxM total CPUs. This means that running process_a utilizes the entire cluster, while running process_b only uses a small fraction of the cluster.

The issue that I’m running into is that ray is intermittently stalling while attempting to complete process_b. An example output from a stalled run with N=2 and M=3 would be something like the following:

starting run
starting main loop
starting process a #0
starting loop of process a #0
starting process c #0-0
starting process c #0-1
starting process c #0-2
completed loop of process a #0
starting loop of process a #1
starting process c #1-0
starting process c #1-1
starting process c #1-2
completed loop of process a #1
completed process a #1
starting loop of process a #0
starting process c #0-0
starting process c #0-1
starting process c #0-2
completed loop of process a #0
completed process a #0
completed process a

spawning process b
starting process b #0
starting process b #1
completed process b
completed main loop
starting main loop
starting process a #0
starting loop of process a #0
starting process c #0-0
starting process c #0-1
starting process c #0-2
completed loop of process a #0
completed process a #0
starting loop of process a #1
starting process c #1-0
starting process c #1-1
starting process c #1-2
completed loop of process a #1
completed process a #1
completed process a

spawning process b

What’s odd is that none of the print statements in process_b get called during the stall, but the main process prints the line directly before it would spawn the process_b instances. I have left the code here for up to an hour and nothing else gets printed and the code never crashes.

Also odd is the fact that process_b will run successfully all the way through several times before stalling. As far as I can tell, there is no pattern to the inputs that make process_b stall, though the actual meat of processes b and c is stochastic, so it’s difficult to tell if there is some rare pattern of inputs that is the culprit.

In order to see if there was something crashing in process_b, I ran the same code but with the N process_b instances run serially (just using a for loop instead of Ray), and this completes successfully. It’s only when I run the second process in parallel using Ray that I get this stalling behavior. I have never experienced any stalling with process_a.

I’m at a bit of an impasse of where to even begin looking to debug this. I’m also somewhat new to using Ray, so though I tried my best to search through the existing discussions, documentation, and GitHub issues, it’s completely possible that I missed an existing answer for how to deal with situations like these! If that’s the case, my apologies for the duplication.

Basically, I’m trying to get an intuition for where to look next. Is this likely to be a memory issue with the Ray workers? Is it related to the variability of run times that I get with process_b? Am I experiencing some kind of timeout issue where ray workers are waiting too long for new jobs? Is there something problematic with the alternating nested processes design pattern that I’m using? Are there other aspects of my code that I should be considering?

The original code is overly complicated and too long to post in it’s entirety here, but I’m happy to provide any additional details that might be helpful.

Any help would be greatly appreciated! I’ve been pulling my hair out over this one for several weeks now.

One potentially place to start would be with a debugger. Check out Ray Debugger — Ray v2.0.0.dev0 for some info on how use pdb with Ray. You should also be able to attach gdb to ray worker process to debug any C/C++ code.

If you do end up suspecting this is a Ray bug, then hopefully this can help you generate a simpler reproduction.