How severe does this issue affect your experience of using Ray?
High: It blocks me to complete my task.
I’ve finished training with a bunch of algorithms using the Tuner() API and air library and they all have their appropriate checkpoint folders and files. However I can’t seem to restore those checkpoints. I tried using Tuner.restore() and run(restore=), both didn’t work.
When using Tuner.restore() I got this error:
(ApexDQN pid=476180) 2022-11-14 14:39:07,333 INFO trainable.py:715 – Checkpoint path was not available, trying to recover from latest available checkpoint instead. Unavailable checkpoint path: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42\checkpoint_004000\checkpoint-4000
And for run(restore=) I got this error:
RuntimeError: Could not find Tuner state in restore directory. Did you passthe correct path (including experiment directory?) Got: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42
The training code:
I’ve also tried referring to the folders above the checkpoint file, it all resulted in the same error output.
hi, i think you need to restore from: G:\Repos\ML_CIV6\models(3w2s)-d(2w1s)_default\APEX\APEX_my_env_5ed3c_00000_0_2022-09-27_20-07-42\ if you are using Tuner()
Algorithm.load_checkpoint() is deprecated.
The “correct” way to load checkpoints is with the Algorithm.from_checkpoint() API.
You can find multiple examples in the examples folder and our documentation.
You can, for example retrieve the best checkpoint (and later load it) with:
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 278, in from_checkpoint
return Algorithm.from_state(state)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 306, in from_state
new_algo = algorithm_class(config=config)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py”, line 368, in init
config.validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/ppo/ppo.py”, line 222, in validate
super().validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/pg/pg.py”, line 91, in validate
super().validate()
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py”, line 556, in validate
self._resolve_tf_settings(_tf1, _tfv)
File “/usr/local/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm_config.py”, line 2490, in _resolve_tf_settings
_tf1.enable_eager_execution()
File “/usr/local/lib/python3.10/site-packages/tensorflow/python/framework/ops.py”, line 6155, in enable_eager_execution
return enable_eager_execution_internal(
File “/usr/local/lib/python3.10/site-packages/tensorflow/python/framework/ops.py”, line 6223, in enable_eager_execution_internal
raise ValueError(
ValueError: tf.enable_eager_execution must be called at program startup.
@arturn This error has to do with the _resolve_tf_settings() from the AlgorithmConfig. It checks for tf1 if eager_execution is enabled, if not it calls on tf1 ènable_eager_execution()` which creates this error.
Good morning. Just to confirm, I am still getting this error on 2.4.0.
Code (in .py):
checkpoint = algo.save()
restored_algo = Algorithm.from_checkpoint(checkpoint)
Error:
2023-06-05 09:46:28,769 WARNING checkpoints.py:109 – No rllib_checkpoint.json file found in checkpoint directory /home/marc/ray_results/DQN_GymEnvironment_2023-06-05_09-39-47ykvo2ont/checkpoint_000010! Trying to extract checkpoint info from other files found in that dir.
Just checking - is there fix due imminently please?
Many thanks.