Ray jobs stuck pending in docker container when using GPU on mnist example

In the ray repo when running:

python ray/python/ray/tune/examples/mnist_pytorch.py --cuda

trials get stuck on pending. I am within a docker container launched via:

docker run -it --network=host -v /tmp:/tmp --gpus=all rayproject/ray-ml:latest-gpu

However there is no error indicating a GPU cannot be found. Note I am running in wsl2 (with all the gpu support working) verified by running the mnist example from the pytorch repository from the same container.

Any thoughts on this? I’ve run into similar issues with docker, ray, and GPUs in tune and rllib use cases.

Hey @Christopher_Fusting thanks a bunch for posting this!

What’s the output where it gets stuck on pending?

Just continual updates that the jobs are pending. To reproduce, from within the ray repo navigate to

python/ray/tune/examples

Run the example with cuda:

docker run --gpus=all --rm -w /sheep -v `pwd`:/sheep rayproject/ray-ml:latest-gpu python mnist_pytorch.py --cuda
2021-06-24 14:25:52,870 INFO services.py:1274 -- View the Ray dashboard at http://127.0.0.1:8265
2021-06-24 14:25:52,872 WARNING services.py:1740 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This will harm performance! You may be able to free up space by deleting files in /dev/shm. If you are inside a Docker container, you can increase /dev/shm size by passing '--shm-size=2.46gb' to 'docker run' (or add it to the run_options list in a Ray cluster config). Make sure to set this to more than 30% of available RAM.
2021-06-24 14:25:54,907 WARNING function_runner.py:545 -- Function checkpointing is disabled. This may result in unexpected behavior when using checkpointing features or certain schedulers. To enable, set the train function arguments to be `func(config, checkpoint_dir=None)`.
== Status ==
Memory usage on this node: 2.9/7.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/4.46 GiB heap, 0.0/2.23 GiB objects
Result logdir: /home/ray/ray_results/exp
Number of trials: 16/50 (16 PENDING)
+-------------------------+----------+-------+-------------+------------+
| Trial name              | status   | loc   |          lr |   momentum |
|-------------------------+----------+-------+-------------+------------|
| train_mnist_c2440_00000 | PENDING  |       | 0.000437896 |   0.226685 |
| train_mnist_c2440_00001 | PENDING  |       | 0.00175204  |   0.522983 |
| train_mnist_c2440_00002 | PENDING  |       | 0.00748296  |   0.599304 |
| train_mnist_c2440_00003 | PENDING  |       | 0.000658517 |   0.250076 |
| train_mnist_c2440_00004 | PENDING  |       | 0.00161277  |   0.783046 |
| train_mnist_c2440_00005 | PENDING  |       | 0.000329033 |   0.674434 |
| train_mnist_c2440_00006 | PENDING  |       | 0.000101433 |   0.263296 |
| train_mnist_c2440_00007 | PENDING  |       | 0.00271429  |   0.463791 |
| train_mnist_c2440_00008 | PENDING  |       | 0.00214337  |   0.772585 |
| train_mnist_c2440_00009 | PENDING  |       | 0.00108123  |   0.734409 |
| train_mnist_c2440_00010 | PENDING  |       | 0.000201707 |   0.446361 |
| train_mnist_c2440_00011 | PENDING  |       | 0.000110724 |   0.884472 |
| train_mnist_c2440_00012 | PENDING  |       | 0.00892235  |   0.621163 |
| train_mnist_c2440_00013 | PENDING  |       | 0.00318506  |   0.357504 |
| train_mnist_c2440_00014 | PENDING  |       | 0.000174868 |   0.345737 |
| train_mnist_c2440_00015 | PENDING  |       | 0.00733102  |   0.160049 |
+-------------------------+----------+-------+-------------+------------+


== Status ==
Memory usage on this node: 2.9/7.8 GiB
Using AsyncHyperBand: num_stopped=0
Bracket: Iter 64.000: None | Iter 16.000: None | Iter 4.000: None | Iter 1.000: None
Resources requested: 0/8 CPUs, 0/0 GPUs, 0.0/4.46 GiB heap, 0.0/2.23 GiB objects
Result logdir: /home/ray/ray_results/exp
Number of trials: 16/50 (16 PENDING)
+-------------------------+----------+-------+-------------+------------+
| Trial name              | status   | loc   |          lr |   momentum |
|-------------------------+----------+-------+-------------+------------|
| train_mnist_c2440_00000 | PENDING  |       | 0.000437896 |   0.226685 |
| train_mnist_c2440_00001 | PENDING  |       | 0.00175204  |   0.522983 |
| train_mnist_c2440_00002 | PENDING  |       | 0.00748296  |   0.599304 |
| train_mnist_c2440_00003 | PENDING  |       | 0.000658517 |   0.250076 |
| train_mnist_c2440_00004 | PENDING  |       | 0.00161277  |   0.783046 |
| train_mnist_c2440_00005 | PENDING  |       | 0.000329033 |   0.674434 |
| train_mnist_c2440_00006 | PENDING  |       | 0.000101433 |   0.263296 |
| train_mnist_c2440_00007 | PENDING  |       | 0.00271429  |   0.463791 |
| train_mnist_c2440_00008 | PENDING  |       | 0.00214337  |   0.772585 |
| train_mnist_c2440_00009 | PENDING  |       | 0.00108123  |   0.734409 |
| train_mnist_c2440_00010 | PENDING  |       | 0.000201707 |   0.446361 |
| train_mnist_c2440_00011 | PENDING  |       | 0.000110724 |   0.884472 |
| train_mnist_c2440_00012 | PENDING  |       | 0.00892235  |   0.621163 |
| train_mnist_c2440_00013 | PENDING  |       | 0.00318506  |   0.357504 |
| train_mnist_c2440_00014 | PENDING  |       | 0.000174868 |   0.345737 |
| train_mnist_c2440_00015 | PENDING  |       | 0.00733102  |   0.160049 |
+-------------------------+----------+-------+-------------+------------+

To test that the gpu is working from the pytorch examples repo run (cuda is enabled by default).

docker run --gpus=all --rm -w /sheep/mnist -v `pwd`:/sheep rayproject/ray-ml:latest-gpu python main.py

Thanks for the help on this!

This morning I dug into this a bit and found that by setting:

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
ray.init(num_gpus=1)

the issue was resolved (in my use case, although I suspect in the example in which I posted the results the behavior would be similar). This was after confirming via the normal procedure that cuda was available and the device id was 0. It feels like there is an issue with the GPU being detected automatically and thus once we get to ray.run() the gpu settings in the config parameter have no effect.

Not sure if this is WSL 2 or Docker specific.

OK got it. I think this might be related to [Core] GPU assignment via CUDA_VISIBLE_DEVICES is broken when using placement groups. · Issue #16614 · ray-project/ray · GitHub. Thanks for following up @Christopher_Fusting , this totally fell off my radar…