How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Dear all,
I’m trying out Ray and RLlib. I think it is a very promising and powerful library. I like it very much!
I created a custom environment, a custom model and even a custom exploration class.
I also have a callback which prints something on every on_sample_end()
.
They seem to work fine as my program runs fine and I can see it learning reaching the maximum score given enough training iterations.
The rough simplified program is shown below, including the exact config I use (it works using that config)
The next step is trying to distribute the compute over multiple workers (my reason for getting familiar with Ray). For this I add the following ‘distribute compute config’ to the script:
# === Settings for Rollout Worker processes ===
config_simple["num_workers"] = 2
config_simple["num_envs_per_worker"] = 1
config_simple["create_env_on_driver"] = False
config_simple["rollout_fragment_length"] = 200
config_simple["batch_mode"] = "complete_episodes"
config_simple["train_batch_size"] = 400
# === Resource Settings ===
config_simple["num_gpus"] = 0 # For driver, later I will use GPU.
config_simple["num_cpus_per_worker"] = 1
config_simple["num_gpus_per_worker"] = 0
I cannot get this to work. I have two problems which block me from proceeding further.
I can see my model trains by the print()
statement in the training iteration loop.
However, when I evaluate the model, it always produces the exact same (bad) results. It looks like it produces the results when evaluating with a completely untrained agent.
If I do many many more training iterations, the result start to change a bit, but still the performance is much much worse than the same number of training iterations without the distribute compute config above.
In the same amount of time, it never learns the maximum score, but in this time, much more CPU-clock cycles have been used.
Clearly I’m doing/understanding something wrong. Can someone help figuring out what?
The print from my callback always appears about 5 times.
In my config I have: config_simple["rollout_fragment_length"] = 200
and config_simple["train_batch_size"] = 400
so i would expect only 2 rollouts would be enough.
Why are there more than 2?
When I run this without the the distribute compute config above, I get many, many prints per iteration (200 or 400), I expected 1 since 200 should be the default value for both these configs. The on_learn_on_batch()
call back happens once per iteration as expected in both cases.
So again, clearly I misunderstand something, can someone enlighten me? thank you!
(This is of less imprtance than the first point, but maybe these observations are related somehow.)
_
My program follows these lines, this program runs fine and produces results as expected until the above config is added, then results become unexpected (I expected not much change).
(I hope this is enough information and I don’t have to post all of my custom model/environment/exploration classes. They seem to work fine when the above config is not mentioned.):
# IMPORTS
... # All the imports
# EXPERIMENT CONFIG
n_iter = 5 # Number of training iterations.
# RAY CONFIG
config_simple = DEFAULT_CONFIG.copy() # from ray.rllib.agents.dqn.simple_q import DEFAULT_CONFIG
config_simple['seed'] = 1
config_simple["model"] = {"custom_model": "my_custom_model_v0",
"custom_model_config": {"config_arg_1": some_value,
"config_arg_2": some_value}}
# === Environment Settings ===
config_simple["env"] = "stack-v0"
# === Debug Settings ===
config_simple["log_level"] = "WARN"
# === Deep Learning Framework Settings ===
config_simple["framework"] = "torch"
# === Exploration Settings ===
config_simple["explore"] = True
config_simple["exploration_config"] = { # The Exploration class to use
"type": MyCustomExploration,
# Config for the Exploration class' constructor:
"initial_epsilon": 1.0,
"final_epsilon": 0.02,
"epsilon_timesteps": 5000,
}
# === API deprecations/simplifications/changes ===
config_simple["_disable_preprocessor_api"] = True
# === Evaluation settings ===
config_simple["evaluation_interval"] = n_iter
config_simple["evaluation_duration"] = 1
# === Callback settings ===
config_simple["callbacks"] = MyCallback
# START RAY
num_cpus = psutil.cpu_count(logical=False)
ray.init(num_cpus=num_cpus, ignore_reinit_error=True)
# REGISTER STUFF
# environment
register_env("my_custom_env-v0", lambda config: MyCustomEnvironment())
# model
ModelCatalog.register_custom_model("my_custom_model_v0", MyCustomModelV0)
# CREATE AGENT
agent = SimpleQTrainer(config=config_simple,
env="my_custom_env-v0")
# TRAIN
status = "{:2d} reward {:6.2f} ; {:6.2f} ; {:6.2f} len {:4.2f}"
for n in range(n_iter):
result = agent.train()
print(status.format(n + 1, result["episode_reward_min"], result["episode_reward_mean"], result["episode_reward_max"], result["episode_len_mean"]))
agent.stop()
# EVALUATE
agent.evaluate()
env = MyCustomEnvironment()
sum_reward = 0
# Run episode
state = env.reset()
while True:
action = agent.compute_single_action(state, explore=explore)
state, reward, done, info = env.step(action)
sum_reward += reward
if done:
break
# SHOW OUTPUT
print(sum_reward)