For my application, the model interacts with 2 environments simultaneously. The model starts with a shared encoder and then branches into two actors. These environments give their own rewards and each reward is used to train one branch of the network. For every timestep in environment 1, environment 2 will finish one episode (T>=1). What’s the most Ray-ish way of handling this? Thank you.
My current ideas:
- I can write a wrapper for these two environments and alternate between them. I can record which environment a transition is from, and sort them into 2 different batches when learning. Is this a good idea? Will there be any problems with this implementation?