I am new to both ray and singularity, but have managed to build singularity container (in fact, an ‘apptainer’), within which I am trying to run the pytorch lightning example from here: ray.train.lightning.LightningTrainer — Ray 2.4.0
However, I get the following error:
singularity exec --bind /tmp/:/tmp lumi_rasmus_ray.sif python raytest1.py
Traceback (most recent call last):
File “/opt/conda/envs/conda_container_env/lib/python3.9/site-packages/ray/_private/node.py”, line 292, in init
ray._private.services.wait_for_node(
File “/opt/conda/envs/conda_container_env/lib/python3.9/site-packages/ray/_private/services.py”, line 460, in wait_for_node
raise TimeoutError(
TimeoutError: Timed out after 30 seconds while waiting for node to startup. Did not find socket name /tmp/ray/session_2023-05-04_23-52-37_148997_237020/sockets/plasma_store in the list of object store socket names.
How do I need to change the singularity call to enable ray to run inside it?
Would like to bump, we have new on prem HPC using singularity that it would be cost effective for me to migrate my code to.
However, I am not confident that a ray implementation will work in general, so…
opened 07:43PM - 14 Apr 21 UTC
closed 07:20PM - 21 Apr 21 UTC
Question
cannot reproduce
```
$ singularity --version
singularity version 3.5.3
```
### Expected behav… ior
After building a sif off of a docker NGC base, cuda driver files should remain intact as they were on the docker image.
### Actual behavior
The cudu driver files are 0 bytes on the singularity image:
```
# Docker
$ srun --pty docker run -it --rm native_horovod_ray2 bash -c 'ls -al /usr/lib/x86_64-linux-gnu/libcuda.so*'
lrwxrwxrwx 1 root root 12 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1
lrwxrwxrwx 1 root root 20 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.455.45.01
-rw-r--r-- 1 root root 16331696 Dec 27 19:02 /usr/lib/x86_64-linux-gnu/libcuda.so.418.181.07
-rwxr-xr-x 1 root root 21070200 Nov 5 23:13 /usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01
# Singularity
$ singularity run native_horray2.sif bash -c 'ls -alL /usr/lib/x86_64-linux-gnu/libcuda.so*'
-rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so
-rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.1
-rw-r--r-- 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.418.181.07
-rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01
```
### Steps to reproduce this behavior
```
cat Dockerfile
FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04
CMD nvidia-smi
# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=2.3.2
ENV PYTORCH_VERSION=1.6.0
ENV TORCHVISION_VERSION=0.7.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.7.8-1+cuda10.1
ENV MXNET_VERSION=1.6.0.post0
ENV PYSPARK_PACKAGE=pyspark==2.4.7
ENV SPARK_PACKAGE=spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
# Python 3.7 is supported by Ubuntu Bionic out of the box
ARG python=3.7
ENV PYTHON_VERSION=${python}
# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]
RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-7 \
git \
curl \
vim \
wget \
ca-certificates \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION} \
libnccl-dev=${NCCL_VERSION} \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
# Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing packaging
RUN pip install tensorflow==${TENSORFLOW_VERSION} \
keras \
h5py
RUN PYTAGS=$(python -c "from packaging import tags; tag = list(tags.sys_tags())[0]; print(f'{tag.interpreter}-{tag.abi}')") && \
pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl \
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl
RUN pip install mxnet-cu101==${MXNET_VERSION}
# Install Spark stand-alone cluster.
RUN wget --progress=dot:giga https://archive.apache.org/dist/spark/${SPARK_PACKAGE} -O - | tar -xzC /tmp; \
archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v /tmp/\${archive/%.tgz/} /spark"
# Install PySpark.
RUN apt-get update -qq && apt install -y openjdk-8-jdk-headless
RUN pip install ${PYSPARK_PACKAGE}
# Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi
RUN pip install ray
# Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod[all-frameworks] && \
ldconfig
# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
# Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn
WORKDIR "/examples"
```
docker build . --rm=false --tag native_horovod_ray2
singularity build native_horray2.sif native_horray2/
singularity build --sandbox native_horray2 docker-archive://native_horray2.tar
docker run -it --rm native_horovod_ray2 bash -c 'ls -al /usr/lib/x86_64-linux-gnu/libcuda.so*'
singularity run native_horray2.sif bash -c 'ls -alL /usr/lib/x86_64-linux-gnu/libcuda.so*'
### What OS/distro are you running
RHEL 7.9
### How did you install Singularity
From source using ansible, https://github.com/abims-sbr/ansible-singularity/blob/master/tasks/main.yml
I set up a SLURM job to do distributed training using Ray running with Singularity containers. The SLURM command launches a process that does the typical Ray commands to set up RayExecutors. What happens is that this process sees the GPUs without issue. However, the RayExecutor worker processes cannot see the GPUs. It is detailed extensively in this issue opened against Singularity:
I closed that issue because it does not look like it’s a Singularity problem. It appears that the worker p…
actually, I have good news then sort’a. It turned out that my first problems were caused by me testing on the frontend, and not a node. when I run simple scripts with ray on the nodes, including inside a singularity container, things seem to work. I have not had time to run the full lightning example yet, unfortunately.
That’s great to hear!
I’ll cross my fingers that your full project is able to deploy properly.
We were just okayed to deploy our project so will be interesting over here.