Restoring a checkpoint for inference test on rllib

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

  • Ray version: 2.44.1
  • Python version: 3.11
  • OS: Linux & Windows
  • Cloud/Infrastructure: Local Ray instance on colab
  • Other libs/tools (if relevant):

3. What happened vs. what you expected:

  • Expected: I was saving a checkpoint for a DQN config. I trained the algorithm on an environment with available gpu. i want to test it visually ( using SUMO simulator ) so i needed to restore the model locally. expected it to work out of the box.
  • Actual: getting this error - File “C:\Users\ilai\AppData\Local\pypoetry\Cache\virtualenvs\rl-tsc-2025-WnPmI87f-py3.11\Lib\site-packages\ray\rllib\utils\framework.py”, line 88, in get_device
    assert config.local_gpu_idx < torch.cuda.device_count(), (
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    AssertionError: local_gpu_idx 0 is not a valid GPU ID or is not available,
    as it expects a GPU to be available where this config is ran ( had a different error message on other ray versions but same problem )

code is basic,
algo = Algorithm.from_checkpoint(checkpoint_path),
where checkpoint paths refers to a local checkpoint - “/path/to/checkpoints/checkpoint_xxx”
and checkpoint dir looking as follows -

On earlier ray versions i could bypass it by editing the pkl of the algorithm_state but on this version it fails. ( had a different error after solving this issue with the other version so tried updating to the latest version, of course model is trained on the same ray version )

Hey @Ilai_Dabush, welcome to the forums!

I have a few questions for you:

  • You mentioned you trained the policy with an available GPU, does your current setup (venv) have a GPU?
  • Can you try the below?
import torch
print(torch.cuda.is_available(), torch.cuda.device_count())

This will show you if torch is detecting your GPU/s. Depending on what the answer is to the above this will help with troubleshooting and helping you. Essentially, I think from reading this assertion on the RLLIB side, it is checking to see if you have a GPU available or not (from the below).

assert config.local_gpu_idx < torch.cuda.device_count(), (
   f"local_gpu_idx {config.local_gpu_idx} is not a valid GPU ID "
   "or is not available."
   )
   # This is an index into the available CUDA devices. For example, if
   # `os.environ["CUDA_VISIBLE_DEVICES"] = "1"` then
   # `torch.cuda.device_count() = 1` and torch.device(0) maps to that GPU
   # with ID=1 on the node.
   return torch.device(config.local_gpu_idx)

Thanks!

Tyler

Hey @tlaurie99,
That’s exactly the problem, my local machine has no GPU. I wanted to run some inference locally and for that I don’t really need GPU. For now my workaround is to not use GPU for training too as I don’t train large networks but rather very small ones. It would be nice if it could be supported to run inference on a machine without GPU even if for train GPU was configured and used.

Thanks,

Ilai

Hey @Ilai_Dabush,

Okay great that is good to know. Sorry, I think I misunderstood at first!

So, I believe you should be able to do the following. This just takes your config, changes the num_gpus to 0 and then builds the algorithm with what you trained with.

config = DQNConfig.from_checkpoint(checkpoint_path)
config = config.resources(num_gpus=0)
algo = config.build()
algo.restore(checkpoint_path)
)

#now have access to algo.evaluate()

Let me know if this works for you – I am not at a place where I can run this to check, but I will later today when I get back.

Tyler

Hey @tlaurie99,

It is indeed what you would expect to work. Problem is - It won’t let you load the DQN config to a machine with no GPU if the checkpoint was trained on gpu. It simply won’t pass that line and throw some error regarding GPU isn’t available (I’ve also tried a workaround of editing the pkl files that hold the configurations, but to no avail)

Thanks,

Ilai