Running a ray tune example from within a singularity container

I am new to both ray and singularity, but have managed to build singularity container (in fact, an ‘apptainer’), within which I am trying to run the pytorch lightning example from here:

However, I get the following error:

singularity exec --bind /tmp/:/tmp lumi_rasmus_ray.sif python
Traceback (most recent call last):
File “/opt/conda/envs/conda_container_env/lib/python3.9/site-packages/ray/_private/”, line 292, in init
File “/opt/conda/envs/conda_container_env/lib/python3.9/site-packages/ray/_private/”, line 460, in wait_for_node
raise TimeoutError(
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2023-05-04_23-52-37_148997_237020/sockets/plasma_store in the list of object store socket names.

How do I need to change the singularity call to enable ray to run inside it?

Would like to bump, we have new on prem HPC using singularity that it would be cost effective for me to migrate my code to.

However, I am not confident that a ray implementation will work in general, so…

actually, I have good news then :slight_smile: sort’a. It turned out that my first problems were caused by me testing on the frontend, and not a node. when I run simple scripts with ray on the nodes, including inside a singularity container, things seem to work. I have not had time to run the full lightning example yet, unfortunately.

That’s great to hear!

I’ll cross my fingers that your full project is able to deploy properly.

We were just okayed to deploy our project so will be interesting over here.

Would you be willing to share your solution once you are comfortable with it? I have a similar need for my current project using Slurm and Apptainer to run a Ray container and train a model. Any help would be greatly appreciated.

Hi Ryan
Unfortunately, I am now stuck on a new issue:

I can let you know if I manage to solve that too :slight_smile: