How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.
Hello Ray community!
A year ago I began experimenting w/ QMIX on rrlib to control the MATSim traffic simulator. Since then, I have purchased a 2021 MacBook 14" which has a 10-core M1 CPU and 10 GPUs. Prototyping my multiagent scenario could greatly benefit from the speedup from the GPU cores. However, I can’t seem to get Ray to recognize the GPUs are available. I recognize that, while m1 support currently exists for both Ray and PyTorch, it is experimental.
Below, the first section shows my env setup, and the second section shows the hello-world-flavored tests I ran to confirm PyTorch, RLlib, and finally, GPU utilization.
My environment
based on reviewing these installation instructions:
# miniforge
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
zsh Miniforge3-MacOSX-arm64.sh
rm Miniforge3-MacOSX-arm64.sh
# pytorch
conda install pytorch -c pytorch-nightly
# ray
pip uninstall grpcio
conda install grpcio=1.43.0
pip install ray "ray[rllib]"
1. Confirm PyTorch sees GPUs (OK)
$ python
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:00:33)
[Clang 13.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.backends.mps.is_available()
True
2. Run CartPole with RLlib on PyTorch using CPUs (OK)
Next I confirm I can run the cartpole example with torch (“–framework torch”) and otherwise default arguments. This terminates normally after 26 seconds with a reward of 156.79 after 11 iterations/44k time steps:
== Status ==
Current time: 2022-07-08 10:03:22 (running for 00:00:30.28)
Memory usage on this node: 10.8/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/5.99 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/rjf/ray_results/PPO
Number of trials: 1/1 (1 TERMINATED)
3. Run CartPole with RLlib on PyTorch using GPUs (FAILED)
command as launched from VS Code, where my launch.json has the added env entry for "RLLIB_NUM_GPUS": "1"
:
$ cd /Users/rjf/dev/external/ray ; /usr/bin/env /Users/rjf/miniforge3/bin/python /Users/rjf/.vscode/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/d
ebugpy/launcher 54469 -- /Users/rjf/dev/external/ray/rllib/examples/cartpole_lstm.py --framework torch
(scheduler +8s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +8s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
...
Ray status notifications in the console repeatedly say “PENDING” after that:
== Status ==
Current time: 2022-07-08 10:03:54 (running for 00:00:05.15)
Memory usage on this node: 10.6/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/6.89 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/rjf/ray_results/PPO
Number of trials: 1/1 (1 PENDING)
Thanks in advance for any help.