Unexpected Subprocess behaviour inside a loop in a ray method

Hello Guys,

I am using python subprocess module inside a for loop in a ray method as described below-

import subprocess

@ray.remote
def get_info(val):
  cmd = "<SOME_COMMAND_TO_EXECUTE_ON_TERMINAL>".format(val)
  out=subprocess.check_output(cmd, shell=True)
  return out

for i in range(0,1000):
  future = get_info.remote(i)
  print(ray.get(future))
  

Above code has following behaviour -

  1. When we are not using loop at all then it is working fine in distributed environment.
  2. If we use loop then it does not go for 1000 iteration, it ends in middle after few iterations (less than 10) without giving any proper error message.

As per my understanding , Subprocess or Popen module in python makes a child process. But when we are using this with ray distributed process environment, then it leads to some ambiguities.

Anyone has encounter this problem before. Please help me resolving this issue.

Thanks
Pooja Ayanile

Hi Pooja,

for me the following code works nicely:

import subprocess
import ray

ray.init()

@ray.remote
def get_info(val):
  cmd = f"echo {val}"
  out=subprocess.check_output(cmd, shell=True)
  return out

for i in range(0,1000):
  future = get_info.remote(i)
  print(ray.get(future))

Are you using a homogeneous cluster for processing? What kind of machines? Are you sure each machine supports the command you call? If it is a custom script, you should probably use absolute paths to call it.

By the way, the way you write your code it is not parallelized. Is that what you want? If you want parallel execution, you can do something like this:

futures = [get_info.remote(i) for i in range(1000)]
print(ray.get(futures))

Hi kai,
Let me reframe my query, inside for loop I wanted to call get_info.remote() function 3 times like get_info.remote(i),get_info.remote(i+1),get_info.remote(i+2). For now I am running code on single server which is having configuration -
RAM - 32 GB
CPU - 4
Storage - 4 TB

hey @Pooja_Ayanile, concretely, let’s say cmd = "echo {val}".

What output are you getting, and what output do you expect?

Hi @Pooja_Ayanile can you post the stack trace for your issue?

Hello @kai , @rliaw ,

Issue is resolved. I noticed that the which i was using, was not supported in one of the nodes in the cluster. That affected my entire program and caused failures.
Also got to know that subprocess module works fine in ray distributed environment.

Really appreciate your time guys.

Thanks,
Pooja Ayanile

1 Like