Race conditions between subprocess file creation and file read

  • High: It blocks me to complete my task.

Hello,

I am completely new to ray and distributed computing so I apologize if this is a stupid question with an obvious answer. I have setup a cluster where the head node loops through a large json dataset containing building data. The head node sends each worker 1…n buildings at a time to preprocess before sending the preprocessed data to an executable via subprocess.Popen(). The executable run time is O(e^n) leaving plenty of time for the subprocess to finish and the output file to be created when using larger n values. The problem arises when I use an n value of 30 or less. After the cluster has run for a few minutes, I start to see FileNotFound errors. This occurs from code that accesses the output file after the subprocess. The worker that fails changes sometimes so it is not isolated to a specific computer.

  • Each machine is using Ubuntu 20.04
  • Python version 3.9.16
  • ray version 2.3.1
  • Each node has been manually setup with all the same dependencies, folders, and code files.
  • If I run the code on a single computer without using Ray, I do not experience this problem.
  • All files are created locally on the worker using unique file names

A jist of what the code looks like:

preprocess(buildings, preprocess_output_file)
p = subprocess.Popen([executable, preprocess_output_file_path])
p.communicate(timeout=600)
postprocess.(executable_output_file_path)

Please let me know if I can provide any extra information to help in troubleshooting this problem.

Thank you very much :slight_smile: