RLlib, PyTorch and Mac M1 GPUs: No available node types can fulfill resource request

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

Hello Ray community!

A year ago I began experimenting w/ QMIX on rrlib to control the MATSim traffic simulator. Since then, I have purchased a 2021 MacBook 14" which has a 10-core M1 CPU and 10 GPUs. Prototyping my multiagent scenario could greatly benefit from the speedup from the GPU cores. However, I can’t seem to get Ray to recognize the GPUs are available. I recognize that, while m1 support currently exists for both Ray and PyTorch, it is experimental.

Below, the first section shows my env setup, and the second section shows the hello-world-flavored tests I ran to confirm PyTorch, RLlib, and finally, GPU utilization.

My environment

based on reviewing these installation instructions:

# miniforge
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-MacOSX-arm64.sh
zsh Miniforge3-MacOSX-arm64.sh
rm Miniforge3-MacOSX-arm64.sh

# pytorch
conda install pytorch -c pytorch-nightly

# ray
pip uninstall grpcio
conda install grpcio=1.43.0
pip install ray "ray[rllib]"
1. Confirm PyTorch sees GPUs (OK)
$ python
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:00:33) 
[Clang 13.0.1 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.backends.mps.is_available()
True
2. Run CartPole with RLlib on PyTorch using CPUs (OK)

Next I confirm I can run the cartpole example with torch (“–framework torch”) and otherwise default arguments. This terminates normally after 26 seconds with a reward of 156.79 after 11 iterations/44k time steps:

== Status ==
Current time: 2022-07-08 10:03:22 (running for 00:00:30.28)
Memory usage on this node: 10.8/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/5.99 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/rjf/ray_results/PPO
Number of trials: 1/1 (1 TERMINATED)
3. Run CartPole with RLlib on PyTorch using GPUs (FAILED)

command as launched from VS Code, where my launch.json has the added env entry for "RLLIB_NUM_GPUS": "1":

$ cd /Users/rjf/dev/external/ray ; /usr/bin/env /Users/rjf/miniforge3/bin/python /Users/rjf/.vscode/extensions/ms-python.python-2022.8.1/pythonFiles/lib/python/d
ebugpy/launcher 54469 -- /Users/rjf/dev/external/ray/rllib/examples/cartpole_lstm.py --framework torch 
(scheduler +8s) Tip: use `ray status` to view detailed cluster status. To disable these messages, set RAY_SCHEDULER_EVENTS=0.
(scheduler +8s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
...

Ray status notifications in the console repeatedly say “PENDING” after that:

== Status ==
Current time: 2022-07-08 10:03:54 (running for 00:00:05.15)
Memory usage on this node: 10.6/16.0 GiB
Using FIFO scheduling algorithm.
Resources requested: 0/10 CPUs, 0/0 GPUs, 0.0/6.89 GiB heap, 0.0/2.0 GiB objects
Result logdir: /Users/rjf/ray_results/PPO
Number of trials: 1/1 (1 PENDING)

Thanks in advance for any help.

Hi @robfitzgerald,

hard to say having not the whole config at hand. My guess is: You run with "num_workers" > 0 and each worker is requesting {"GPU": 1, "CPU": 1}. This cannot be fullfilled and therefore the status keeps in “PENDING”.

If this is the case, try to set the num_workers=3 and num_gpus_per_worker=0.25 as an example

Lars,

Thank you for the lead.

hard to say having not the whole config at hand

I am running the cartpole example, passing --framework torch as an argument. the config section begins here.

Per your suggestion I have modified the general config section at line 63 to include your suggestion (and tested with and without num_gpus, to test if there was some kind of collision between these config keys):

            "env": StatelessCartPole,
            # Use GPUs iff `RLLIB_NUM_GPUS` env var set to > 0.
            # "num_gpus": int(os.environ.get("RLLIB_NUM_GPUS", "0")),
            "num_workers": 3,
            "num_gpus_per_worker": 0.25,
            "model": {
                "use_lstm": True,
                "lstm_cell_size": 256,
                "lstm_use_prev_action": args.use_prev_action,
                "lstm_use_prev_reward": args.use_prev_reward,
            },
            "framework": args.framework,
            # Run with tracing enabled for tfe/tf2?
            "eager_tracing": args.eager_tracing,

with similar results

without num_gpus key:

(scheduler +8s) Error: No available node types can fulfill resource request {'GPU': 0.25, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

with num_gpus=1:

(scheduler +8s) Error: No available node types can fulfill resource request {'GPU': 1.0, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.
(scheduler +8s) Error: No available node types can fulfill resource request {'GPU': 0.25, 'CPU': 1.0}. Add suitable node types to this cluster to resolve this issue.

It’s not obvious to me where to look in the documentation for a solution, besides the list of common parameters, which didn’t clear this up for me. Any other ideas?

@robfitzgerald it appears to me as if Ray might have not recognized the GPU internally.