Hi @rob65 welcome to the community,
is not used during training only during the initial setup of the policy. It is used to try and automatically determine which items and how many timesteps are required in the sample_batch during rollouts and training. This process uses dummy data that is not obtained from the environment.
When it is actually used during training the entire sample_batch is used. Does this help or are you still seeing an issue?