I set up a SLURM job to do distributed training using Ray running with Singularity containers. The SLURM command launches a process that does the typical Ray commands to set up RayExecutors. What happens is that this process sees the GPUs without issue. However, the RayExecutor worker processes cannot see the GPUs. It is detailed extensively in this issue opened against Singularity:
I closed that issue because it does not look like it’s a Singularity problem. It appears that the worker processes created from the “main” process have different environments (environment variables and mounts) and this affects that drivers needed for the TensorFlow to see the GPUs.
So is there any way to Ray to launch the RayExecutors with the proper environment so that the GPUs can work? And if not, should I file this as a bug or a feature request?
Hi Richard. The SLURM commands look like this (I was using srun for debugging purposes):
# HEAD
srun --nodes=1 --ntasks=1 -w server1 --cpus-per-task=5 singularity run ~/horovodDocker/native_horray.sif ray start --head --node-ip-address=30.30.30.30 --port=6379 --redis-password=supersecret --num-cpus 5 --num-gpus 0 --include-dashboard False --block &
# Training script
srun --nodes=1 --ntasks=1 singularity run ~/horovodDocker/native_horray.sif python horray_mnist.py --address 30.30.30.30:6379 --redis_password supersecret
My startup code for the RayExecutor is below. You can see I tried to pass through what I thought were the most needed environment vars–but no luck. But there are quite a few differences in the environment vars between the main process and the launched worker. And note that per Singularity Built From NGC Base Yields "unable to find libcuda.so.1" · Issue #5935 · hpcng/singularity · GitHub, for some reason the worker mounts tmpfs as /.singularity.d/actions instead of /.singularity.d/libs which might be the real problem. I have no idea why that would happen.
Anything that could make the env vars similar and provide the same mounts might solve it. Unfortunately I don’t know Linux well enough to know what’s possible.
ray.init(address=args.address, _redis_password=args.redis_password)
settings = RayExecutor.create_settings(timeout_s=30)
executor = RayExecutor(
settings, num_hosts=1, num_slots=1, use_gpu=True, cpus_per_slot=4)
print("executor.start")
envPassThroughs = [
'CUDA_PKG_VERSION'
,'CUDA_VERSION'
,'CUDA_VISIBLE_DEVICES'
,'CUDNN_VERSION'
,'GPU_DEVICE_ORDINAL'
,'LD_LIBRARY_PATH'
]
exec_env = {k:os.environ[k] for k in envPassThroughs}
executor.start(extra_env_vars=exec_env)
executor.run(train, kwargs=dict(num_epochs=5))
Thanks, Clark