RayExecutors Launched Using with Singularity Don't See GPUs

Gossamer · April 22, 2021, 2:36pm

I set up a SLURM job to do distributed training using Ray running with Singularity containers. The SLURM command launches a process that does the typical Ray commands to set up RayExecutors. What happens is that this process sees the GPUs without issue. However, the RayExecutor worker processes cannot see the GPUs. It is detailed extensively in this issue opened against Singularity:

github.com/hpcng/singularity

Singularity Built From NGC Base Yields "unable to find libcuda.so.1"

opened 07:43PM - 14 Apr 21 UTC

closed 07:20PM - 21 Apr 21 UTC

cupdike

Question cannot reproduce

``` $ singularity --version singularity version 3.5.3 ``` ### Expected behav…ior After building a sif off of a docker NGC base, cuda driver files should remain intact as they were on the docker image. ### Actual behavior The cudu driver files are 0 bytes on the singularity image: ``` # Docker $ srun --pty docker run -it --rm native_horovod_ray2 bash -c 'ls -al /usr/lib/x86_64-linux-gnu/libcuda.so*' lrwxrwxrwx 1 root root 12 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so -> libcuda.so.1 lrwxrwxrwx 1 root root 20 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -> libcuda.so.455.45.01 -rw-r--r-- 1 root root 16331696 Dec 27 19:02 /usr/lib/x86_64-linux-gnu/libcuda.so.418.181.07 -rwxr-xr-x 1 root root 21070200 Nov 5 23:13 /usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01 # Singularity $ singularity run native_horray2.sif bash -c 'ls -alL /usr/lib/x86_64-linux-gnu/libcuda.so*' -rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so -rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.1 -rw-r--r-- 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.418.181.07 -rwxr-xr-x 1 root root 0 Apr 14 17:37 /usr/lib/x86_64-linux-gnu/libcuda.so.455.45.01 ``` ### Steps to reproduce this behavior ``` cat Dockerfile FROM nvidia/cuda:10.1-cudnn7-devel-ubuntu18.04 CMD nvidia-smi # TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully ENV TENSORFLOW_VERSION=2.3.2 ENV PYTORCH_VERSION=1.6.0 ENV TORCHVISION_VERSION=0.7.0 ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1 ENV NCCL_VERSION=2.7.8-1+cuda10.1 ENV MXNET_VERSION=1.6.0.post0 ENV PYSPARK_PACKAGE=pyspark==2.4.7 ENV SPARK_PACKAGE=spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz # Python 3.7 is supported by Ubuntu Bionic out of the box ARG python=3.7 ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \ build-essential \ cmake \ g++-7 \ git \ curl \ vim \ wget \ ca-certificates \ libcudnn7=${CUDNN_VERSION} \ libnccl2=${NCCL_VERSION} \ libnccl-dev=${NCCL_VERSION} \ libjpeg-dev \ libpng-dev \ python${PYTHON_VERSION} \ python${PYTHON_VERSION}-dev \ python${PYTHON_VERSION}-distutils \ librdmacm1 \ libibverbs1 \ ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \ python get-pip.py && \ rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet RUN pip install future typing packaging RUN pip install tensorflow==${TENSORFLOW_VERSION} \ keras \ h5py RUN PYTAGS=$(python -c "from packaging import tags; tag = list(tags.sys_tags())[0]; print(f'{tag.interpreter}-{tag.abi}')") && \ pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl \ https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}%2Bcu101-${PYTAGS}-linux_x86_64.whl RUN pip install mxnet-cu101==${MXNET_VERSION} # Install Spark stand-alone cluster. RUN wget --progress=dot:giga https://archive.apache.org/dist/spark/${SPARK_PACKAGE} -O - | tar -xzC /tmp; \ archive=$(basename "${SPARK_PACKAGE}") bash -c "mv -v /tmp/\${archive/%.tgz/} /spark" # Install PySpark. RUN apt-get update -qq && apt install -y openjdk-8-jdk-headless RUN pip install ${PYSPARK_PACKAGE} # Install Open MPI RUN mkdir /tmp/openmpi && \ cd /tmp/openmpi && \ wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \ tar zxf openmpi-4.0.0.tar.gz && \ cd openmpi-4.0.0 && \ ./configure --enable-orterun-prefix-by-default && \ make -j $(nproc) all && \ make install && \ ldconfig && \ rm -rf /tmp/openmpi RUN pip install ray # Install Horovod, temporarily using CUDA stubs RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \ HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \ pip install --no-cache-dir horovod[all-frameworks] && \ ldconfig # Install OpenSSH for MPI to communicate between containers RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \ mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \ echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \ mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples RUN apt-get install -y --no-install-recommends subversion && \ svn checkout https://github.com/horovod/horovod/trunk/examples && \ rm -rf /examples/.svn WORKDIR "/examples" ``` docker build . --rm=false --tag native_horovod_ray2 singularity build native_horray2.sif native_horray2/ singularity build --sandbox native_horray2 docker-archive://native_horray2.tar docker run -it --rm native_horovod_ray2 bash -c 'ls -al /usr/lib/x86_64-linux-gnu/libcuda.so*' singularity run native_horray2.sif bash -c 'ls -alL /usr/lib/x86_64-linux-gnu/libcuda.so*' ### What OS/distro are you running RHEL 7.9 ### How did you install Singularity From source using ansible, https://github.com/abims-sbr/ansible-singularity/blob/master/tasks/main.yml

I closed that issue because it does not look like it’s a Singularity problem. It appears that the worker processes created from the “main” process have different environments (environment variables and mounts) and this affects that drivers needed for the TensorFlow to see the GPUs.

So is there any way to Ray to launch the RayExecutors with the proper environment so that the GPUs can work? And if not, should I file this as a bug or a feature request?

Thanks, Clark

rliaw · April 22, 2021, 6:59pm

Hey Clark,

Thanks for following up. Do you know what specifically do we need in the worker env vars that is lacking?

Also, can you remind me how you are starting your Ray nodes?

Gossamer · April 22, 2021, 7:56pm

Hi Richard. The SLURM commands look like this (I was using srun for debugging purposes):

# HEAD
srun  --nodes=1  --ntasks=1 -w server1 --cpus-per-task=5 singularity run ~/horovodDocker/native_horray.sif  ray start  --head  --node-ip-address=30.30.30.30  --port=6379  --redis-password=supersecret  --num-cpus 5 --num-gpus 0  --include-dashboard False  --block &

# Training script
srun --nodes=1 --ntasks=1 singularity run ~/horovodDocker/native_horray.sif python horray_mnist.py --address 30.30.30.30:6379 --redis_password supersecret

My startup code for the RayExecutor is below. You can see I tried to pass through what I thought were the most needed environment vars–but no luck. But there are quite a few differences in the environment vars between the main process and the launched worker. And note that per Singularity Built From NGC Base Yields "unable to find libcuda.so.1" · Issue #5935 · hpcng/singularity · GitHub, for some reason the worker mounts tmpfs as /.singularity.d/actions instead of /.singularity.d/libs which might be the real problem. I have no idea why that would happen.

Anything that could make the env vars similar and provide the same mounts might solve it. Unfortunately I don’t know Linux well enough to know what’s possible.


    ray.init(address=args.address, _redis_password=args.redis_password)

    settings = RayExecutor.create_settings(timeout_s=30)
    executor = RayExecutor(
       settings, num_hosts=1, num_slots=1, use_gpu=True, cpus_per_slot=4)

    print("executor.start")
    envPassThroughs = [
       'CUDA_PKG_VERSION'
      ,'CUDA_VERSION'
      ,'CUDA_VISIBLE_DEVICES'
      ,'CUDNN_VERSION'
      ,'GPU_DEVICE_ORDINAL'
      ,'LD_LIBRARY_PATH'
    ]

    exec_env = {k:os.environ[k] for k in envPassThroughs}
    executor.start(extra_env_vars=exec_env)

    executor.run(train, kwargs=dict(num_epochs=5))

Thanks, Clark

rliaw · April 23, 2021, 8:25am

Yeah, that’s unfortunate. In Ray, workers are spawned using this command.

github.com

ray-project/ray/blob/master/python/ray/_private/services.py#L1458


        redis_address,
        plasma_store_name,
        raylet_name,
        redis_password,
        session_dir,
    )
else:
    cpp_worker_command = []


# Create the command that the Raylet will use to start workers.
start_worker_command = [
    sys.executable,
    worker_path,
    f"--node-ip-address={node_ip_address}",
    "--node-manager-port=RAY_NODE_MANAGER_PORT_PLACEHOLDER",
    f"--object-store-name={plasma_store_name}",
    f"--raylet-name={raylet_name}",
    f"--redis-address={redis_address}",
    f"--temp-dir={temp_dir}",
    f"--metrics-agent-port={metrics_agent_port}",
    f"--logging-rotate-bytes={max_bytes}",

Unfortunately, i’m not sure why subprocesses don’t get captured correctly with Singularity.

Topic		Replies	Views
Running a ray tune example from within a singularity container	5	1112	August 7, 2023
Ray in Singularity container Ray Clusters	2	512	September 27, 2021
Cluster Launcher and Singularity Ray Tune	1	402	December 4, 2020
Ray on slurm - Problems with initialization Ray Clusters	6	3665	December 29, 2022
Ray cluster does is not creating workers? Ray Core	22	2894	April 26, 2021

RayExecutors Launched Using with Singularity Don't See GPUs

Related topics