How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi all,
I am currently experiencing an issue where increasing the number of workers from 1/2 to 7, significantly effects the performance of the algorithms where a higher reward/ correct completion of the episode, is obtained from the 1/2 worker model.
Below are images of the mean reward and a custom callback which shows the amount of successful episodes. The 2 worker model converges to a high success rate and falls back down but returns to a success rate of 1, basically successfully completing all episodes.
On the other hand, the 7 worker model fails to even complete one single successful episode.
7 Workers Success Rate (When it successfully completed the episode)
7 Workers Reward Mean
2 Workers Reward Mean
2 Workers Success Rate (When it successfully completed the episode)
I am using an SAC algorithm with the following configuration options
# Works for both torch and tf.
num_workers: 7
num_gpus: 1
num_cpus_per_worker: 2
framework: torch
gamma: 1
twin_q: True
# these probably do nothing
q_model_config:
"fcnet_hiddens": [ 512, 512, 1024 ]
"fcnet_activation": "relu"
policy_model_config:
"fcnet_hiddens": [ 512, 512, 1024 ]
"fcnet_activation": "relu"
#model:
# "fcnet_hiddens": [ 256, 512 ]
# "fcnet_activation": "tanh"
#batch_mode: "complete_episodes"
# temp change because carla crashed for some reason
recreate_failed_workers: True
# Do hard syncs.
# Soft-syncs seem to work less reliably for discrete action spaces.
tau: 1
#lr: 0.001
target_network_update_freq: 8000
#initial_alpha: 0.2
# auto = 0.98 * -log(1/|A|)
target_entropy: auto
clip_rewards: False
n_step: 1
rollout_fragment_length: 1
replay_buffer_config:
type: MultiAgentPrioritizedReplayBuffer
capacity: 400000
# How many steps of the model to sample before learning starts.
# If True prioritized replay buffer will be used.
prioritized_replay_alpha: 0.6
prioritized_replay_beta: 0.4
prioritized_replay_eps: 0.000001
store_buffer_in_checkpoints: False
num_steps_sampled_before_learning_starts: 10000
train_batch_size: 256
min_sample_timesteps_per_iteration: 4
# Paper uses 20k random timesteps, which is not exactly the same, but
# seems to work nevertheless. We use 100k here for the longer Atari
# runs (DQN style: filling up the buffer a bit before learning).
optimization:
actor_learning_rate: 0.00005
critic_learning_rate: 0.00005
entropy_learning_rate: 0.00005
"exploration_config": {
"type": "EpsilonGreedy",
"initial_epsilon": 1.0,
"final_epsilon": 0.01,
"epsilon_timesteps": 500000
}
I have tried changing the target_network_update_freq but it doesn’t seem to make much difference apart from having a smoother reward curve which still doesn’t produce a single successful episode.
I am leaning towards having an incorrect rollout_fragment_length but I am not sure what values to try. Is there a way to know what rollout_fragment_length should be based on other values or against what I should compare it to? Could train_batch_size be affecting this in any way?
Any other behavior which you noticed that maybe could lead to something?
Thank you for any help in advance. Any leads will be highly appreciated.