RL problem
I have a single agent with custom reward function, custom actions, custom environment, custom observation, and uses the generic SAC policy with a neural network. The agent iterates through 1e6 episodes serially to update the SAC policy. This is time consuming.
Want to parallelize
To speed up learning, we have multiple agents (e.g., say 10 agents) stepping in parallel. At the end of each episode, we batch update the same single SAC policy. The updated SAC policy shall be used by all the agents in the next episode. This will reduce the learning time. Alternatively, any other ray methods to speed up the RL training problem stated above is appreciated.
I am new to reinforcement learning and also new to ray. Could you provide a simple step-by-step complete working ray code example with explanation to achieve the above idea? I have already scoured the entire Ray documentation website, but still am unsure on how to exactly achieve the above. Hence, any help specific to my problem is much appreciated.
Thank you.