Restoring a checkpoint for inference test on rllib

Ilai_Dabush · April 25, 2025, 10:19pm

1. Severity of the issue: (select one)
High: Completely blocks me.

2. Environment:

Ray version: 2.44.1
Python version: 3.11
OS: Linux & Windows
Cloud/Infrastructure: Local Ray instance on colab
Other libs/tools (if relevant):

3. What happened vs. what you expected:

Expected: I was saving a checkpoint for a DQN config. I trained the algorithm on an environment with available gpu. i want to test it visually ( using SUMO simulator ) so i needed to restore the model locally. expected it to work out of the box.
Actual: getting this error - File “C:\Users\ilai\AppData\Local\pypoetry\Cache\virtualenvs\rl-tsc-2025-WnPmI87f-py3.11\Lib\site-packages\ray\rllib\utils\framework.py”, line 88, in get_device
assert config.local_gpu_idx < torch.cuda.device_count(), (
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: local_gpu_idx 0 is not a valid GPU ID or is not available,
as it expects a GPU to be available where this config is ran ( had a different error message on other ray versions but same problem )

code is basic,
algo = Algorithm.from_checkpoint(checkpoint_path),
where checkpoint paths refers to a local checkpoint - “/path/to/checkpoints/checkpoint_xxx”
and checkpoint dir looking as follows -

On earlier ray versions i could bypass it by editing the pkl of the algorithm_state but on this version it fails. ( had a different error after solving this issue with the other version so tried updating to the latest version, of course model is trained on the same ray version )

tlaurie99 · April 28, 2025, 4:59pm

Hey @Ilai_Dabush, welcome to the forums!

I have a few questions for you:

You mentioned you trained the policy with an available GPU, does your current setup (venv) have a GPU?
Can you try the below?

import torch
print(torch.cuda.is_available(), torch.cuda.device_count())

This will show you if torch is detecting your GPU/s. Depending on what the answer is to the above this will help with troubleshooting and helping you. Essentially, I think from reading this assertion on the RLLIB side, it is checking to see if you have a GPU available or not (from the below).

assert config.local_gpu_idx < torch.cuda.device_count(), (
   f"local_gpu_idx {config.local_gpu_idx} is not a valid GPU ID "
   "or is not available."
   )
   # This is an index into the available CUDA devices. For example, if
   # `os.environ["CUDA_VISIBLE_DEVICES"] = "1"` then
   # `torch.cuda.device_count() = 1` and torch.device(0) maps to that GPU
   # with ID=1 on the node.
   return torch.device(config.local_gpu_idx)

Thanks!

Tyler

Ilai_Dabush · April 28, 2025, 5:21pm

Hey @tlaurie99,
That’s exactly the problem, my local machine has no GPU. I wanted to run some inference locally and for that I don’t really need GPU. For now my workaround is to not use GPU for training too as I don’t train large networks but rather very small ones. It would be nice if it could be supported to run inference on a machine without GPU even if for train GPU was configured and used.

Thanks,

Ilai

tlaurie99 · April 28, 2025, 5:34pm

Hey @Ilai_Dabush,

Okay great that is good to know. Sorry, I think I misunderstood at first!

So, I believe you should be able to do the following. This just takes your config, changes the num_gpus to 0 and then builds the algorithm with what you trained with.

config = DQNConfig.from_checkpoint(checkpoint_path)
config = config.resources(num_gpus=0)
algo = config.build()
algo.restore(checkpoint_path)
)

#now have access to algo.evaluate()

Let me know if this works for you – I am not at a place where I can run this to check, but I will later today when I get back.

Tyler

Ilai_Dabush · April 28, 2025, 5:49pm

Hey @tlaurie99,

It is indeed what you would expect to work. Problem is - It won’t let you load the DQN config to a machine with no GPU if the checkpoint was trained on gpu. It simply won’t pass that line and throw some error regarding GPU isn’t available (I’ve also tried a workaround of editing the pkl files that hold the configurations, but to no avail)

Thanks,

Ilai

Topic		Replies	Views
Cannot checkpoint a simple model RLlib	4	81	June 6, 2025
Error when loading and restoring a trained algorithm from a checkpoint using a APPO Algorithm RLlib	1	344	February 14, 2023
Restoring APEX_DDPG trainer using checkpoint saved with older ray version RLlib	0	427	May 28, 2021
Restoring a RLModule checkpoint with pytorch RLlib	1	51	February 22, 2025
Unable to restore fully trained checkpoint RLlib	19	2919	October 21, 2023

Restoring a checkpoint for inference test on rllib

Related topics