Running a ray tune example from within a singularity container

Kaare_Mikkelsen · May 4, 2023, 9:03pm

I am new to both ray and singularity, but have managed to build singularity container (in fact, an ‘apptainer’), within which I am trying to run the pytorch lightning example from here: https://docs.ray.io/en/latest/train/api/doc/ray.train.lightning.LightningTrainer.html#ray.train.lightning.LightningTrainer

However, I get the following error:

singularity exec --bind /tmp/:/tmp lumi_rasmus_ray.sif python raytest1.py
Traceback (most recent call last):
File “/opt/conda/envs/conda_container_env/lib/python3.9/site-packages/ray/_private/node.py”, line 292, in init
ray._private.services.wait_for_node(
File “/opt/conda/envs/conda_container_env/lib/python3.9/site-packages/ray/_private/services.py”, line 460, in wait_for_node
raise TimeoutError(
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2023-05-04_23-52-37_148997_237020/sockets/plasma_store in the list of object store socket names.

How do I need to change the singularity call to enable ray to run inside it?

jchen6727 · May 12, 2023, 2:52pm

Would like to bump, we have new on prem HPC using singularity that it would be cost effective for me to migrate my code to.

However, I am not confident that a ray implementation will work in general, so…

github.com/apptainer/singularity

Singularity Built From NGC Base Yields "unable to find libcuda.so.1"

opened 07:43PM - 14 Apr 21 UTC

closed 07:20PM - 21 Apr 21 UTC

cupdike

Question cannot reproduce

``` $ singularity --version singularity version 3.5.3 ``` ### Expected behav…ior After building a sif off of a docker NGC base, cuda driver files should remain intact as they were on the docker image. ### Actual behavior The cudu driver files are 0 bytes on the singularity image: ``` # Docker $ srun --pty docker run -it --rm native_horovod_ray2 bash -c 'ls -al /usr/lib/x86_64-linux-gnu/libcuda.so*' lrwxrwxrwx 1 root root 12 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1 lrwxrwxrwx 1 root root 20 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.455.45.01 -rw-r--r-- 1 root root 16331696 Dec 27 19:02 /usr/lib/x86_64-linux-gnu/libcuda.so.418.181.07 -rwxr-xr-x 1 root root 21070200 Nov 5 23:13 /usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01 # Singularity $ singularity run native_horray2.sif bash -c 'ls -alL /usr/lib/x86_64-linux-gnu/libcuda.so*' -rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so -rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -rw-r--r-- 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.418.181.07 -rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01 ``` ### Steps to reproduce this behavior ``` cat Dockerfile FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 CMD nvidia-smi # TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully ENV TENSORFLOW_VERSION=2.3.2 ENV PYTORCH_VERSION=1.6.0 ENV TORCHVISION_VERSION=0.7.0 ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1 ENV NCCL_VERSION=2.7.8-1+cuda10.1 ENV MXNET_VERSION=1.6.0.post0 ENV PYSPARK_PACKAGE=pyspark==2.4.7 ENV SPARK_PACKAGE=spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz # Python 3.7 is supported by Ubuntu Bionic out of the box ARG python=3.7 ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ build-essential \ cmake \ g++-7 \ git \ curl \ vim \ wget \ ca-certificates \ libcudnn7=${CUDNN_VERSION} \ libnccl2=${NCCL_VERSION} \ libnccl-dev=${NCCL_VERSION} \ libjpeg-dev \ libpng-dev \ python${PYTHON_VERSION} \ python${PYTHON_VERSION}-dev \ python${PYTHON_VERSION}-distutils \ librdmacm1 \ libibverbs1 \ ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \ python get-pip.py && \ rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet RUN pip install future typing packaging RUN pip install tensorflow==${TENSORFLOW_VERSION} \ keras \ h5py RUN PYTAGS=$(python -c "from packaging import tags; tag = list(tags.sys_tags())[0]; print(f'{tag.interpreter}-{tag.abi}')") && \ pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl \ https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl RUN pip install mxnet-cu101==${MXNET_VERSION} # Install Spark stand-alone cluster. RUN wget --progress=dot:giga https://archive.apache.org/dist/spark/${SPARK_PACKAGE} -O - | tar -xzC /tmp; \ archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v /tmp/\${archive/%.tgz/} /spark" # Install PySpark. RUN apt-get update -qq && apt install -y openjdk-8-jdk-headless RUN pip install ${PYSPARK_PACKAGE} # Install Open MPI RUN mkdir /tmp/openmpi && \ cd /tmp/openmpi && \ wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \ tar zxf openmpi-4.0.0.tar.gz && \ cd openmpi-4.0.0 && \ ./configure --enable-orterun-prefix-by-default && \ make -j $(nproc) all && \ make install && \ ldconfig && \ rm -rf /tmp/openmpi RUN pip install ray # Install Horovod, temporarily using CUDA stubs RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \ pip install --no-cache-dir horovod[all-frameworks] && \ ldconfig # Install OpenSSH for MPI to communicate between containers RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \ mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \ echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \ mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples RUN apt-get install -y --no-install-recommends subversion && \ svn checkout https://github.com/horovod/horovod/trunk/examples && \ rm -rf /examples/.svn WORKDIR "/examples" ``` docker build . --rm=false --tag native_horovod_ray2 singularity build native_horray2.sif native_horray2/ singularity build --sandbox native_horray2 docker-archive://native_horray2.tar docker run -it --rm native_horovod_ray2 bash -c 'ls -al /usr/lib/x86_64-linux-gnu/libcuda.so*' singularity run native_horray2.sif bash -c 'ls -alL /usr/lib/x86_64-linux-gnu/libcuda.so*' ### What OS/distro are you running RHEL 7.9 ### How did you install Singularity From source using ansible, https://github.com/abims-sbr/ansible-singularity/blob/master/tasks/main.yml

Kaare_Mikkelsen · May 13, 2023, 8:29pm

actually, I have good news then sort’a. It turned out that my first problems were caused by me testing on the frontend, and not a node. when I run simple scripts with ray on the nodes, including inside a singularity container, things seem to work. I have not had time to run the full lightning example yet, unfortunately.

jchen6727 · May 22, 2023, 2:15pm

That’s great to hear!

I’ll cross my fingers that your full project is able to deploy properly.

We were just okayed to deploy our project so will be interesting over here.

Ryan · August 7, 2023, 3:38pm

Would you be willing to share your solution once you are comfortable with it? I have a similar need for my current project using Slurm and Apptainer to run a Ray container and train a model. Any help would be greatly appreciated.

Kaare_Mikkelsen · August 7, 2023, 7:04pm

Hi Ryan
Unfortunately, I am now stuck on a new issue: https://github.com/ray-project/ray/issues/30012

I can let you know if I manage to solve that too

Topic		Replies	Views
RuntimeError : Socket Timeout (ProcessGoupGloo) Ray Libraries (Data, Train, Tune, Serve)	3	804	January 11, 2023
Runtime error while training Ray Train	1	424	August 26, 2022
Ray init failed, but ray status success Ray Core	4	590	March 16, 2024
Getting Started with Ray does not work on any computer I try it Ray Tune	4	1300	September 13, 2023
Error: RuntimeError: No rendezvous handler for env:// Ray Train	5	640	April 5, 2023

Running a ray tune example from within a singularity container

Related Topics