Migration to Ray 2.2.0

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

It prevents me from migrating from Ray 2.1.0 to 2.2.0

I have a custom environment and wanted to migrate to Ray 2.2.0. However, when I started training an Impala agent in rllib i noticed that the result[“episode_reward_mean”] kept on being “nan”.

After replacing my custom model and environment with the default fc-model and “CartPole-v1” / “MountainCar-v0” it was clear that the cause of this issue had to lay with my custom environment. Hence, I tried to replicate the example from the documentation/examples found here

However, this would not run. To debug I tried to reduce it to a minimal rllib setup with a default model (tf) but kept on getting these errors:

from /ray/rllib/utils/pre_checks/env.py", line 68, in check_env:
ValueError: Env must be of one of the following supported types: BaseEnv, gym.Env, MultiAgentEnv, VectorEnv, RemoteBaseEnv, ExternalMultiAgentEnv, ExternalEnv, but instead is of type <class ‘main.SimpleCorridor’>.

or if setting disable_env_checking=True:

from /ray/rllib/algorithms/algorithm_config.py", line 2182, in get_multi_agent_setup:
ValueError: observation_space not provided in PolicySpec for default_policy and env does not have an observation space OR no spaces received from other workers’ env(s) OR no observation_space specified in config!

Versions:
Python: 3.9.16
Ray: 2.2.0
Tensorflow: 2.11.0
Gymnasium: 0.27.1
Gym: 0.23.1 (had to install this as well as some rllib files caused import errors without)

The code below should replicate the problem (comment in and out to get different errors)

Can anybody tell me what I’m doing wrong?

BR

Jorgen

import gymnasium as gym
from gymnasium.spaces import Discrete, Box
import numpy as np
import random
# from ray.rllib.env.env_context import EnvContext
# from ray.tune.registry import register_env
import ray.rllib.algorithms.impala as impala
from ray.tune.logger import pretty_print

class SimpleCorridor(gym.Env):
    """Example of a custom env in which you have to walk down a corridor.
    You can configure the length of the corridor via the env config."""

    # def __init__(self, config: EnvContext):
    def __init__(self, config):
        
        # super(SimpleCorridor, self).__init__()
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = Discrete(2)
        self.observation_space = Box(0.0, self.end_pos, shape=(1,), dtype=np.float32)
        # Set the seed. This is only used for the final (reach goal) reward.
        self.reset(seed=config.worker_index * config.num_workers)

    def reset(self, *, seed=None, options=None):
        random.seed(seed)
        self.cur_pos = 0
        return [self.cur_pos], {}

    def step(self, action):
        assert action in [0, 1], action
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        elif action == 1:
            self.cur_pos += 1
        done = truncated = self.cur_pos >= self.end_pos
        # Produce a random reward when we reach the goal.
        return (
            [self.cur_pos],
            random.random() * 2 if done else -0.1,
            done,
            truncated,
            {},
        )


# def env_creator(env_config):
#     return SimpleCorridor(env_config)


# register_env("simple_corridor", env_creator)

algo = (
    impala.ImpalaConfig()
    .rollouts(num_rollout_workers=0)
    .resources(num_gpus=0)
    .environment(env=SimpleCorridor,env_config={"corridor_length": 5} )
    # .environment(env="simple_corridor",env_config={"corridor_length": 5},
    #              #disable_env_checking=True 
    #              )
    .build()
)

for i in range(10):
    result = algo.train()
    print(pretty_print(result))

    if i % 5 == 0:
        checkpoint_dir = algo.save("./ray_22_test")
        print(f"Checkpoint saved in directory {checkpoint_dir}")

Hey @Jorgen_Svane , thanks for raising this issue. I do see that there is actually a bug in RLlib wrt. running IMPALA without any workers. This is not usually something that users would do, but a totally valid scenario for debugging. We’ll provide a fix for this. In the meantime, you can use the following workaround:

In your config, make sure your local worker has an environment (by default, the local worker doesn’t):

        .rollouts(num_rollout_workers=0, create_env_on_local_worker=True)

After at most 2 iterations, you should then see reward stats under the sampler_results key in your results dict.

e.g.

sampler_results:
  connector_metrics:
    ObsPreprocessorConnector_ms: 0.004864154377050101
    StateBufferConnector_ms: 0.0070486334558148
    ViewRequirementAgentConnector_ms: 0.095882689911314
  custom_metrics: {}
  episode_len_mean: 8.492160278745645
  episode_media: {}
  episode_reward_max: 1.5947446366008506
  episode_reward_mean: 0.241397651458262
  episode_reward_min: -10.821602077679723
  episodes_this_iter: 1148

I can confirm that the following repro script now runs w/o any problems on the latest master (all lines in questions uncommented back into the code):

import gymnasium as gym
from gymnasium.spaces import Discrete, Box
import numpy as np
import random
from ray.rllib.env.env_context import EnvContext
from ray.tune.registry import register_env
import ray
import ray.rllib.algorithms.impala as impala
from ray.tune.logger import pretty_print


class SimpleCorridor(gym.Env):
    """Example of a custom env in which you have to walk down a corridor.
    You can configure the length of the corridor via the env config."""

    # def __init__(self, config: EnvContext):
    def __init__(self, config):
        super(SimpleCorridor, self).__init__()
        self.end_pos = config["corridor_length"]
        self.cur_pos = 0
        self.action_space = Discrete(2)
        self.observation_space = Box(0.0, self.end_pos, shape=(1,), dtype=np.float32)
        # Set the seed. This is only used for the final (reach goal) reward.
        self.reset(seed=config.worker_index * config.num_workers)

    def reset(self, *, seed=None, options=None):
        random.seed(seed)
        self.cur_pos = 0
        return [self.cur_pos], {}

    def step(self, action):
        assert action in [0, 1], action
        if action == 0 and self.cur_pos > 0:
            self.cur_pos -= 1
        elif action == 1:
            self.cur_pos += 1
        done = truncated = self.cur_pos >= self.end_pos
        # Produce a random reward when we reach the goal.
        return (
            [self.cur_pos],
            random.random() * 2 if done else -0.1,
            done,
            truncated,
            {},
        )


def env_creator(env_config):
     return SimpleCorridor(env_config)


register_env("simple_corridor", env_creator)

algo = (
    impala.ImpalaConfig()
        .rollouts(num_rollout_workers=0, create_env_on_local_worker=True)
        .resources(num_gpus=0)
        .environment(env=SimpleCorridor, env_config={"corridor_length": 5})
        .build()
)

for i in range(10):
    result = algo.train()
    print(pretty_print(result))

    if i % 5 == 0:
        checkpoint_dir = algo.save("./ray_22_test")
        print(f"Checkpoint saved in directory {checkpoint_dir}")

Here is the PR that fixes the NaN issue:

Hi @sven1977

Many thanks. I failed to get it running even after adding create_env_on_local_worker=True. This was also the case when changing to PPO

However, when running in the latest master everything worked like a charm!

One final thing though, I could not get it running when using legacy “gym” as appose to “gymnasium” . Can you confirm that backward compatibility is terminated?

BR

Jorgen

Hi @Jorgen_Svane ,

Yes, we don’t provide backward compatibility but instead advice on how to migrate.
You should be getting an error message when using an old gym env that tells you what to do.

Cheers

Hi @arturn & @sven1977

Many thanks for your quick answers and solutions.

BR

Jorgen

1 Like

Hi @arturn & @sven1977

As I’m trying to migrate from gym to gymnasium as well as ray 2.1.0 to 2.2.0 I decided to run one of the standard envs - more specifically LunarLander-v2 in the continuous mode.

However, the new syntax requires it to be specified like this:

env = gym.make(“LunarLander-v2”, continuous=True)

as there are no longer specific continuous envs

So my general question is how to feed the default gymnasium envs with arguments similar to the above?

BR

Jorgen

Hi @arturn & @sven1977 and anyone else having interest in this issue and possibly #12940 or similar.

In the meantime I upgraded to Ray 2.3.0. The below solution also provides at least in part a workaround for the issue with valueerror ('observation ( ) outside given space ( ) by setting preprocessor_pref=None as described in #12940 and possibly others.

import gymnasium as gym
from ray.tune.registry import register_env
# import ray.rllib.algorithms.ppo as ppo
import ray.rllib.algorithms.impala as impala
from ray.tune.logger import pretty_print

def env_creator(env_config):
    env = gym.make("LunarLander-v2",continuous=True)
    return env

register_env("lunar", env_creator)

algo = (
    impala.ImpalaConfig()
    # .rollouts(num_rollout_workers=2) # raises: valueerror ('observation ( ) outside given space ( ) !
    .rollouts(num_rollout_workers=2, preprocessor_pref=None)
    .resources(num_gpus=1)
    .environment(env="lunar")
    # .environment(env="lunar",disable_env_checking=True)
    .build()
)

for i in range(10):
    result = algo.train()
    print(pretty_print(result))

    if i % 5 == 0:
        checkpoint_dir = algo.save("./ray_23_test")
        print(f"Checkpoint saved in directory {checkpoint_dir}")
        
import pprint
env = gym.make("LunarLander-v2",continuous=True)
state,_ = env.reset()
actions = algo.compute_single_action(state,full_fetch=True)
pprint.pprint(actions)
# prints the below:
# (array([-1.,  1.], dtype=float32),
#  [],
#  {'action_dist_inputs': array([ 9.965262 , 13.640695 ,  6.174242 ,  1.7361563], dtype=float32),
#   'action_logp': -10.218893,
#   'action_prob': 3.6474652e-05})

So please don’t discontinue register_env - thanks. It makes it a bit easier to work with gymnasium registered environments not just the build-in ones as you can pass kwargs etc.

BR

Jorgen

3 Likes