- High: It blocks me to complete my task.
Hello,
I am completely new to ray and distributed computing so I apologize if this is a stupid question with an obvious answer. I have setup a cluster where the head node loops through a large json dataset containing building data. The head node sends each worker 1…n buildings at a time to preprocess before sending the preprocessed data to an executable via subprocess.Popen(). The executable run time is O(e^n) leaving plenty of time for the subprocess to finish and the output file to be created when using larger n values. The problem arises when I use an n value of 30 or less. After the cluster has run for a few minutes, I start to see FileNotFound errors. This occurs from code that accesses the output file after the subprocess. The worker that fails changes sometimes so it is not isolated to a specific computer.
- Each machine is using Ubuntu 20.04
- Python version 3.9.16
- ray version 2.3.1
- Each node has been manually setup with all the same dependencies, folders, and code files.
- If I run the code on a single computer without using Ray, I do not experience this problem.
- All files are created locally on the worker using unique file names
A jist of what the code looks like:
preprocess(buildings, preprocess_output_file)
p = subprocess.Popen([executable, preprocess_output_file_path])
p.communicate(timeout=600)
postprocess.(executable_output_file_path)
Please let me know if I can provide any extra information to help in troubleshooting this problem.
Thank you very much