Spawning a web server in a subprocess of a worker

In a ray worker, I am spawning a subprocess (popen) and in that subprocess I am spawning a web server (0.0.0.0:{port}).

Later on, in the same ray worker, I try to POST to that server, however I get a connection refused error.

Outside of the ray cluster, while the worker / subprocess are running, I am able to POST to this web server. In the ray worker, I am also able to POST to external web servers, just not the one in the child process.

This is within a rllib PPO training session.

Hi H-Park and welcome back to the Ray community :slight_smile:
There could be a few things happening here. First, are you sure the server that you spun up in the subprocess is ready before the Ray worker tries to send a post to it? Make sure the server is fully ready if possible.
Also instead of 0.0.0.0, does this code work if you use 127.0.0.1 or localhost? And nothing else is running on the port right?

Ray also has a RAY_kill_child_processes_on_worker_exit variable so make sure you’re not trying to POST to a subprocess that might be dead already (this variable turns on/off automatic killing of child processes). You can read about it here: Lifetimes of a User-Spawn Process — Ray 2.44.1

Some relevant docs that might help:

If you could send over the code so I can reproduce the bug that would be super helpful!

Thanks for your response!

are you sure the server that you spun up in the subprocess is ready before the Ray worker tries to send a post to it?

Yes. I verified this with a sleep after the subprocess.popen for the server and during that sleep I manually POSTed to it from a separate terminal thread.

I was not able to figure this out, unfortunately. But weirdly enough it works when converting the subprocess into a docker container that houses the server. What’s the difference you ask, when both are simple POST requests to a local URL? I have no idea. But the docker approach works beautifully and the subprocess approach is dead on arrival.

The code is pretty mingled into my company’s codebase, especially considering this is within rllib, so I am unable to share it. I wonder if it has something to do with ray rllib’s rolloutworker.

We use ray rllib 2.12 due to bugs I have since reported involving extra resource allocation. We have also come across occasional policy divergence deep into training (which therefore crashes the learner worker and brings down the whole train job).

We get the feeling that anyscale is no longer interested in maintaining rllib, and are chasing the llm cash cow, so we picked a version that has been stable for us and sticking with it.

Hi H-Park! Glad ya’ll were able to figure it out. :slight_smile: Let me know if there’s anything else I can help you with.
Also, thank you for filing an issue on Github by the way, we are actively working on those and going through them too! I’m glad you found 2.12 stable and working well for your use cases.

Thanks for posting @H-Park . We are certainly investing in RLlib, we launched the new v2 stack earlier this year which included a complete re-architecture of all the core components and are doing further improvements there. Have you had a chance to check out and move to the new stack?