1. Severity of the issue: (select one)
High: Completely blocks me.
2. Environment:
- Ray version: 2.49.2
- Python version: 3.11.6
- OS: linux
- Cloud/Infrastructure: Azure AKS
- Other libs/tools (if relevant):
3. What happened vs. what you expected:
- Expected: Ability to deploy ray serve with the head pod running on a CPU only node
- Actual: ray crashes on start up with Permission denied: ‘nvidia-smi’
Hi I am tryging to deploy a ray serve cluster. My goal is to have all gpu actors scale to 0 when there are no active requests. To do this i want the head pod on a cpu only node so that only it is up and im not paying for gpu nodes. When i do this i get the nvidea-smi permission error. It seems to be unavoidable as nvidea-smi will not be installed on any node that is not running on a GPU node. Is there a way to change this?
After further investigation this issue only occurs on later versions of ray. Does not occur when using version 2.38. It looks like the dashboard tries to access nvidea-smi regardless of the type of node the head pod runs on in newever versions. Is this a bug or is my configuration incorrect?
rayClusterConfig:
rayVersion: 2.49.2
enableInTreeAutoscaling: true
headGroupSpec:
rayStartParams:
dashboard-host: ‘0.0.0.0’
num-cpus: ‘0’
num-gpus: ‘0’
include-dashboard: “false”
2025-09-23 14:56:21,222 ERROR agent.py:459 – Agent is working abnormally. It will exit immediately.
Traceback (most recent call last):
File “/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/agent.py”, line 457, in
loop.run_until_complete(agent.run())
File “/home/ray/anaconda3/lib/python3.11/asyncio/base_events.py”, line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File “/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/agent.py”, line 172, in run
modules = self._load_modules()
^^^^^^^^^^^^^^^^^^^^
File “/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/agent.py”, line 152, in _load_modules
c = cls(self)
^^^^^^^^^
File “/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/modules/reporter/reporter_agent.py”, line 478, in init
self._gpu_profiling_manager = GpuProfilingManager(
^^^^^^^^^^^^^^^^^^^^
File “/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/modules/reporter/gpu_profile_manager.py”, line 77, in init
if not self.node_has_gpus():
^^^^^^^^^^^^^^^^^^^^
File “/home/ray/anaconda3/lib/python3.11/site-packages/ray/dashboard/modules/reporter/gpu_profile_manager.py”, line 107, in node_has_gpus
subprocess.check_output([“nvidia-smi”], stderr=subprocess.DEVNULL)
File “/home/ray/anaconda3/lib/python3.11/subprocess.py”, line 466, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/ray/anaconda3/lib/python3.11/subprocess.py”, line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “/home/ray/anaconda3/lib/python3.11/subprocess.py”, line 1026, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File “/home/ray/anaconda3/lib/python3.11/subprocess.py”, line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
PermissionError: [Errno 13] Permission denied: ‘nvidia-smi’