I am working on a banking transactions project that involves following MDP components in a custom Gym environment: states including account balances, actions made of different combinations of source & target accounts along with different levels of transaction amounts. Initially, I tried using tune library for hyperparameter tuning followed by rllib train (choosing around 5 samples from hyperparameter space). The results were not consistent in multiple experiments. Then, I switched to default config of algorithms and I got reproducible convergence after tweaking reward function. When I changed number of accounts and levels of allowed transaction amounts, it changed the states and actions of MDP. I am no longer achieving reproducible convergence. What should I do to make code adapt according to the changes in scale of problem, with reproducible results? Should I resort to setting a manual lookup for each hyperparameter based on the changes in state space and action space?