I am not sure I followed you about the gradient flowing through the internal states part. There are no gradients during the sample phase of the execution plan only the learning portion.
I do not know a nice clean way to do this in rllib. Maybe someone else will have a better approach.
I would probably try to implement this using a custom callback. The on_episode_step method has access to the worker and the environments. The worker has access to the policies and they in turn have the model.
What I would do is create a custom model that stores whatever values I wanted to communicate as a member variable in the model then in the calback I would take that info from the model and put it on the environment.
Keep in mind that on episode step is called after an action is taken. So you would be using the callback to add info to the environment for step t+1.