Using checkpoint causes GPU failure and error during training process

ZanhaPeng · July 16, 2025, 3:06am

tuner = tune.Tuner(
‘PPO’,
param_space=ppo_config,
run_config=tune.RunConfig(
storage_path=storage_dir,
name=‘p0’,
stop={‘training_iteration’: 2000},
verbose=3,
checkpoint_config=tune.CheckpointConfig(
checkpoint_at_end=True,
checkpoint_frequency=5,
),
),
)
results = tuner.fit()

error：
Trial PPO_multi_env_08ebf_00000 finished iteration 5 at 2025-07-16 10:55:01. Total running time: 1min 39s
╭──────────────────────────────────────────────────╮
│ Trial PPO_multi_env_08ebf_00000 result │
├──────────────────────────────────────────────────┤
│ env_runners/episode_len_mean 150 │
│ env_runners/episode_return_mean -172 │
│ num_env_steps_sampled_lifetime 20480 │
╰──────────────────────────────────────────────────╯
2025-07-16 10:55:01,180 ERROR tune_controller.py:1331 – Trial task failed for trial PPO_multi_env_08ebf_00000
Traceback (most recent call last):
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/air/execution/_internal/event_manager.py”, line 110, in resolve_future
result = ray.get(future)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/auto_init_hook.py”, line 22, in auto_init_wrapper
return fn(*args, **kwargs)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/client_mode_hook.py”, line 104, in wrapper
return func(*args, **kwargs)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/worker.py”, line 2849, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/worker.py”, line 937, in get_objects
raise value.as_instanceof_cause()
ray.exceptions.RayTaskError(RaySystemError): ray::PPO.save() (pid=26518, ip=10.68.4.39, actor_id=4c6af5385d92900280559f5e01000000, repr=PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True))
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/tune/trainable/trainable.py”, line 486, in save
checkpoint_dict_or_path = self.save_checkpoint(checkpoint_dir)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py”, line 2690, in save_checkpoint
self.save_to_path(
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/rllib/utils/checkpoints.py”, line 300, in save_to_path
comp_state = self.get_state(components=comp_name)[comp_name]
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/rllib/algorithms/algorithm.py”, line 2834, in get_state
state[COMPONENT_LEARNER_GROUP] = self.learner_group.get_state(
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/rllib/core/learner/learner_group.py”, line 521, in get_state
state[COMPONENT_LEARNER] = self._get_results(results)[0]
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/rllib/core/learner/learner_group.py”, line 672, in _get_results
raise result_or_error
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/rllib/utils/actor_manager.py”, line 861, in _fetch_result
result = ray.get(ready)
ray.exceptions.RaySystemError: System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.
traceback: Traceback (most recent call last):
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/storage.py”, line 533, in _load_from_bytes
return torch.load(io.BytesIO(b), weights_only=False)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1495, in load
return _legacy_load(
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1754, in _legacy_load
result = unpickler.load()
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1682, in persistent_load
obj = restore_location(obj, location)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 693, in default_restore_location
result = fn(storage, location)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 631, in _deserialize
device = _validate_device(location, backend_name)
File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 600, in _validate_device
raise RuntimeError(
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.

Trial PPO_multi_env_08ebf_00000 errored after 5 iterations at 2025-07-16 10:55:01. Total running time: 1min 39s
Error file: /tmp/ray/session_2025-07-16_10-53-20_879632_24234/artifacts/2025-07-16_10-53-21/p0/driver_artifacts/PPO_multi_env_08ebf_00000_0_2025-07-16_10-53-21/error.txt
2025-07-16 10:55:01,202 INFO tune.py:1009 – Wrote the latest version of all result files and experiment state to ‘/home/zanhao/Project_shuttle/results/p0’ in 0.0206s.

Trial status: 1 ERROR
Current time: 2025-07-16 10:55:01. Total running time: 1min 39s
Logical resource usage: 13.0/32 CPUs, 0.99/1 GPUs (0.0/1.0 accelerator_type:G)
╭────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name status iter total time (s) …lls_per_iteration …_sampled_lifetime │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_multi_env_08ebf_00000 ERROR 5 94.4173 1 20480 │
╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Number of errored trials: 1
╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ Trial name # failures error file │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ PPO_multi_env_08ebf_00000 1 /tmp/ray/session_2025-07-16_10-53-20_879632_24234/artifacts/2025-07-16_10-53-21/p0/driver_artifacts/PPO_multi_env_08ebf_00000_0_2025-07-16_10-53-21/error.txt │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) Traceback (most recent call last):
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/serialization.py”, line 458, in deserialize_objects
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) obj = self._deserialize_object(data, metadata, object_ref)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/serialization.py”, line 315, in _deserialize_object
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) return self._deserialize_msgpack_data(data, metadata_fields)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/serialization.py”, line 270, in _deserialize_msgpack_data
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) python_objects = self._deserialize_pickle5_data(pickle5_data)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/serialization.py”, line 258, in _deserialize_pickle5_data
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) obj = pickle.loads(in_band, buffers=buffers)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/storage.py”, line 533, in _load_from_bytes
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) return torch.load(io.BytesIO(b), weights_only=False)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1495, in load
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) return _legacy_load(
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1754, in _legacy_load
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) result = unpickler.load()
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1682, in persistent_load
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) obj = restore_location(obj, location)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 693, in default_restore_location
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) result = fn(storage, location)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 631, in _deserialize
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) device = _validate_device(location, backend_name)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 600, in _validate_device
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) raise RuntimeError(
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) 2025-07-16 10:55:01,178 ERROR actor_manager.py:873 – Ray error (System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) traceback: Traceback (most recent call last):
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/serialization.py”, line 458, in deserialize_objects
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) obj = self._deserialize_object(data, metadata, object_ref)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/serialization.py”, line 315, in _deserialize_object
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) return self._deserialize_msgpack_data(data, metadata_fields)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/serialization.py”, line 270, in _deserialize_msgpack_data
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) python_objects = self._deserialize_pickle5_data(pickle5_data)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/ray/_private/serialization.py”, line 258, in _deserialize_pickle5_data
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) obj = pickle.loads(in_band, buffers=buffers)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/storage.py”, line 533, in _load_from_bytes
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) return torch.load(io.BytesIO(b), weights_only=False)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1495, in load
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) return _legacy_load(
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1754, in _legacy_load
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) result = unpickler.load()
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 1682, in persistent_load
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) obj = restore_location(obj, location)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 693, in default_restore_location
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) result = fn(storage, location)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 631, in _deserialize
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) device = _validate_device(location, backend_name)
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) File “/home/zanhao/anaconda3/envs/torch/lib/python3.9/site-packages/torch/serialization.py”, line 600, in _validate_device
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) raise RuntimeError(
(PPO(env=multi_env; env-runners=4; learners=1; multi-agent=True) pid=26518) RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device(‘cpu’) to map your storages to the CPU.

christina · July 16, 2025, 5:06pm

It seems like CUDA is not available on your machine, likely because your machine doesn’t have a GPU. Do you know if your computer has a GPU or is it CPU only?

ZanhaPeng · July 19, 2025, 6:45am

yes，Of course, my computer has a GPU, and the GPU runs normally during training. However, when saving checkpoints, the GPU crashes

MCW_Lad · July 19, 2025, 7:36am

The error getting thrown indicates that torch.cuda.is_available() is returning false. Try running that method on a Python console on machine you’re using to deserialize the model, and see if it works in isolation - that might provide more information on what’s going wrong.

The issue, as best I can tell, isn’t within Ray or RLlib - it might be that something’s wrong with your PyTorch install.

ZanhaPeng · July 22, 2025, 7:00am

But I downloaded my torch from the official website, and the versions of cuda and other libraries matched it.

MCW_Lad · July 22, 2025, 9:08am

If you run torch.cuda.is_available() in isolation, outside of another script, what does it output?

ZanhaPeng · July 23, 2025, 6:08am

Of course, it will return true

MCW_Lad · July 24, 2025, 7:28pm

That is interesting. If you have the full script/repo on hand, I can try pulling it onto a different machine and see if I get the same error. That should at least narrow things down.

Theo_Fan · July 25, 2025, 1:26am

Hi, @MCW_Lad

I encountered the same issue, maybe you can reproduce it with the following code:

from ray import train, tune
from ray.rllib.algorithms.ppo import PPOConfig

config = (
    PPOConfig()
    .environment("Pendulum-v1")
    
    .training(
        lr=tune.grid_search([0.001, 0.0001]),
    )
    .env_runners(
        num_env_runners=2,
        batch_mode="complete_episodes"
    )
    .learners(
        num_learners=1,
        num_gpus_per_learner=1, # gpu config
    )
    
)


tuner = tune.Tuner(
    config.algo_class,
    param_space=config,
    run_config=train.RunConfig(
        stop={
        	"training_iteration": 5,
        },
        checkpoint_config=tune.CheckpointConfig(
            checkpoint_at_end=True,  # Problem
        ),
    ),
)

results = tuner.fit()

And I get the following error:

ERROR actor_manager.py:873 -- Ray error (System error: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. 
If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

MCW_Lad · July 26, 2025, 10:29am

Just ran that code on my own machine. It seems to work perfectly fine, making me think that something with your CUDA setup is the problem. Full outputs uploaded to pastebin for reference, here.

Theo_Fan · July 31, 2025, 1:29am

Your output is precisely what I had hoped to see, I ran the above test code again, and the complete output in here,

After that, I tested the CUDA installation in my environment manually,

(rl) ~ % nvidia-smi
Thu Jul 31 09:05:51 2025
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:C1:00.0 Off |                  Off |
| 32%   30C    P8              15W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 4090        Off | 00000000:E1:00.0 Off |                  Off |
| 31%   29C    P8              18W / 450W |     14MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      4520      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      4520      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+
(rl) ~ %
(rl) ~ % nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2022 NVIDIA Corporation
Built on Wed_Sep_21_10:33:58_PDT_2022
Cuda compilation tools, release 11.8, V11.8.89
Build cuda_11.8.r11.8/compiler.31833905_0
(rl) ~ % python -c "import torch; print(torch.__version__); print(torch.version.cuda); print(torch.cuda.is_available())"
2.7.1+cu118
11.8
True

and… well, things look a bit strange.

Topic		Replies	Views
Cannot checkpoint a simple model RLlib	4	171	June 6, 2025
Attempting to deserialize object on a CUDA device... error on 2 GPU machine Ray Tune	3	3046	April 6, 2021
Pytorch+ray train example not working Ray Train	4	828	November 9, 2023
Possibly Checkpoint error while running Ray tune	4	1245	December 2, 2022
CUDA serialization error with Population Based tuning Checkpointing, Restoring	2	428	June 8, 2025

Using checkpoint causes GPU failure and error during training process

Related topics