Running a ray tune example from within a singularity container

I am new to both ray and singularity, but have managed to build singularity container (in fact, an ‘apptainer’), within which I am trying to run the pytorch lightning example from here: ray.train.lightning.LightningTrainer — Ray 2.4.0

However, I get the following error:

singularity exec --bind /tmp/:/tmp lumi_rasmus_ray.sif python raytest1.py
Traceback (most recent call last):
File “/opt/conda/envs/conda_container_env/lib/python3.9/site-packages/ray/_private/node.py”, line 292, in init
ray._private.services.wait_for_node(
File “/opt/conda/envs/conda_container_env/lib/python3.9/site-packages/ray/_private/services.py”, line 460, in wait_for_node
raise TimeoutError(
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2023-05-04_23-52-37_148997_237020/sockets/plasma_store in the list of object store socket names.

How do I need to change the singularity call to enable ray to run inside it?

Would like to bump, we have new on prem HPC using singularity that it would be cost effective for me to migrate my code to.

However, I am not confident that a ray implementation will work in general, so…

actually, I have good news then :slight_smile: sort’a. It turned out that my first problems were caused by me testing on the frontend, and not a node. when I run simple scripts with ray on the nodes, including inside a singularity container, things seem to work. I have not had time to run the full lightning example yet, unfortunately.

That’s great to hear!

I’ll cross my fingers that your full project is able to deploy properly.

We were just okayed to deploy our project so will be interesting over here.