When I try to restore an incomplete tune job using this script:
"""Resume experiment script."""
# %% Imports
import os
import ray
from ray.rllib.models import ModelCatalog
from ray.tune import Tuner
from ray.tune.registry import register_env
from punchclock.nets.lstm_mask import MaskedLSTM
from punchclock.ray.build_env import buildEnv
# %% Register model and Env
ModelCatalog.register_custom_model("MaskedLSTM", MaskedLSTM)
register_env("ssa_env", buildEnv)
checkpoint_dir = "/home/user/ray_results/exp_name"
num_cpus = 20
num_workers = num_cpus - 1
ray.init(num_cpus=num_cpus, num_gpus=0)
os.environ["TUNE_MAX_PENDING_TRIALS_PG"] = str(num_workers)
tuner = Tuner.restore(
trainable="PPO",
path=checkpoint_dir,
resume_errored=True,
restart_errored=True,
)
tuner.fit()
I get the following error, which indicates that the environment is not recognized.But in the above script, I register the environment using ray.tune.registry.register_env
, so I don’t know why this error would occur. There is also a second failture that I am unsure is related to the environment error.
Failure # 1 (occurred at 2023-11-20_20-39-26)
The actor died because of an error raised in its creation task, e[36mray::PPO.__init__()e[39m (pid=130077, ip=10.128.8.91, actor_id=1969b2ae33b28f9c1b1fa87701000000, repr=PPO)
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/evaluation/worker_set.py", line 242, in _setup
self.add_workers(
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/evaluation/worker_set.py", line 635, in add_workers
raise result.get()
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/utils/actor_manager.py", line 488, in __fetch_result
result = ray.get(r)
ray.exceptions.RayActorError: The actor died because of an error raised in its creation task, e[36mray::RolloutWorker.__init__()e[39m (pid=131032, ip=10.128.8.91, actor_id=73949f41d65591a253439f3e01000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x2abf619836d0>)
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/gymnasium/envs/registration.py", line 569, in make
_check_version_exists(ns, name, version)
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/gymnasium/envs/registration.py", line 219, in _check_version_exists
_check_name_exists(ns, name)
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/gymnasium/envs/registration.py", line 197, in _check_name_exists
raise error.NameNotFound(
gymnasium.error.NameNotFound: Environment ssa_env doesn't exist.
During handling of the above exception, another exception occurred:
e[36mray::RolloutWorker.__init__()e[39m (pid=131032, ip=10.128.8.91, actor_id=73949f41d65591a253439f3e01000000, repr=<ray.rllib.evaluation.rollout_worker.RolloutWorker object at 0x2abf619836d0>)
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/evaluation/rollout_worker.py", line 609, in __init__
self.env = env_creator(copy.deepcopy(self.env_context))
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/env/utils.py", line 178, in _gym_env_creator
raise EnvError(ERR_MSG_INVALID_ENV_DESCRIPTOR.format(env_descriptor))
ray.rllib.utils.error.EnvError: The env string you provided ('ssa_env') is:
a) Not a supported/installed environment.
b) Not a tune-registered environment creator.
c) Not a valid env class string.
Try one of the following:
a) For Atari support: `pip install gym[atari] autorom[accept-rom-license]`.
For VizDoom support: Install VizDoom
(https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
`pip install vizdoomgym`.
For PyBullet support: `pip install pybullet`.
b) To register your custom env, do `from ray import tune;
tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
Then in your config, do `config['env'] = [name]`.
c) Make sure you provide a fully qualified classpath, e.g.:
`ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`
During handling of the above exception, another exception occurred:
e[36mray::PPO.__init__()e[39m (pid=130077, ip=10.128.8.91, actor_id=1969b2ae33b28f9c1b1fa87701000000, repr=PPO)
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 475, in __init__
super().__init__(
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 170, in __init__
self.setup(copy.deepcopy(self.config))
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 601, in setup
self.workers = WorkerSet(
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/evaluation/worker_set.py", line 194, in __init__
raise e.args[0].args[2]
ray.rllib.utils.error.EnvError: The env string you provided ('ssa_env') is:
a) Not a supported/installed environment.
b) Not a tune-registered environment creator.
c) Not a valid env class string.
Try one of the following:
a) For Atari support: `pip install gym[atari] autorom[accept-rom-license]`.
For VizDoom support: Install VizDoom
(https://github.com/mwydmuch/ViZDoom/blob/master/doc/Building.md) and
`pip install vizdoomgym`.
For PyBullet support: `pip install pybullet`.
b) To register your custom env, do `from ray import tune;
tune.register('[name]', lambda cfg: [return env obj from here using cfg])`.
Then in your config, do `config['env'] = [name]`.
c) Make sure you provide a fully qualified classpath, e.g.:
`ray.rllib.examples.env.repeat_after_me_env.RepeatAfterMeEnv`
Failure # 2 (occurred at 2023-11-25_19-33-43)
e[36mray::PPO.train()e[39m (pid=217206, ip=10.128.8.112, actor_id=67791a70455479e668d7bad601000000, repr=PPO)
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 389, in train
raise skipped from exception_cause(skipped)
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/tune/trainable/trainable.py", line 386, in train
result = self.step()
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 832, in step
results = self._compile_iteration_results(
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/algorithms/algorithm.py", line 3046, in _compile_iteration_results
results["sampler_results"] = summarize_episodes(
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/evaluation/metrics.py", line 221, in summarize_episodes
filt = [v for v in v_list if not np.any(np.isnan(v))]
File "/home/user/.conda/envs/punch/lib/python3.10/site-packages/ray/rllib/evaluation/metrics.py", line 221, in <listcomp>
filt = [v for v in v_list if not np.any(np.isnan(v))]
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The custom environment works normally if I am initializing a new tuning run; I only see this problem when attempting to restore an experiment from a checkpoint.
If anyone has any ideas on where to look, I’d much appreciate it.