RayExecutors Launched Using with Singularity Don't See GPUs

I set up a SLURM job to do distributed training using Ray running with Singularity containers. The SLURM command launches a process that does the typical Ray commands to set up RayExecutors. What happens is that this process sees the GPUs without issue. However, the RayExecutor worker processes cannot see the GPUs. It is detailed extensively in this issue opened against Singularity:

I closed that issue because it does not look like it’s a Singularity problem. It appears that the worker processes created from the “main” process have different environments (environment variables and mounts) and this affects that drivers needed for the TensorFlow to see the GPUs.

So is there any way to Ray to launch the RayExecutors with the proper environment so that the GPUs can work? And if not, should I file this as a bug or a feature request?

Thanks, Clark

Hey Clark,

Thanks for following up. Do you know what specifically do we need in the worker env vars that is lacking?

Also, can you remind me how you are starting your Ray nodes?

Hi Richard. The SLURM commands look like this (I was using srun for debugging purposes):

# HEAD
srun  --nodes=1  --ntasks=1 -w server1 --cpus-per-task=5 singularity run ~/horovodDocker/native_horray.sif  ray start  --head  --node-ip-address=30.30.30.30  --port=6379  --redis-password=supersecret  --num-cpus 5 --num-gpus 0  --include-dashboard False  --block &

# Training script
srun --nodes=1 --ntasks=1 singularity run ~/horovodDocker/native_horray.sif python horray_mnist.py --address 30.30.30.30:6379 --redis_password supersecret

My startup code for the RayExecutor is below. You can see I tried to pass through what I thought were the most needed environment vars–but no luck. But there are quite a few differences in the environment vars between the main process and the launched worker. And note that per Singularity Built From NGC Base Yields "unable to find libcuda.so.1" · Issue #5935 · hpcng/singularity · GitHub, for some reason the worker mounts tmpfs as /.singularity.d/actions instead of /.singularity.d/libs which might be the real problem. I have no idea why that would happen.

Anything that could make the env vars similar and provide the same mounts might solve it. Unfortunately I don’t know Linux well enough to know what’s possible.


    ray.init(address=args.address, _redis_password=args.redis_password)

    settings = RayExecutor.create_settings(timeout_s=30)
    executor = RayExecutor(
       settings, num_hosts=1, num_slots=1, use_gpu=True, cpus_per_slot=4)

    print("executor.start")
    envPassThroughs = [
       'CUDA_PKG_VERSION'
      ,'CUDA_VERSION'
      ,'CUDA_VISIBLE_DEVICES'
      ,'CUDNN_VERSION'
      ,'GPU_DEVICE_ORDINAL'
      ,'LD_LIBRARY_PATH'
    ]

    exec_env = {k:os.environ[k] for k in envPassThroughs}
    executor.start(extra_env_vars=exec_env)

    executor.run(train, kwargs=dict(num_epochs=5))

Thanks, Clark

Yeah, that’s unfortunate. In Ray, workers are spawned using this command.

Unfortunately, i’m not sure why subprocesses don’t get captured correctly with Singularity.