Setting for Infinite Horizon MDPs

Hi all,

I am using DQN to solve a simple economic problem with infinite horizon. Right now, I am setting it as

soft_horizon = True
no_done_at_end = True

Is this correct? The results I am getting make me doubt that the algorithms is truly discounting all the future rewards (it behaves very myopically).

1 Like

I believe that’s the correct setting.

What’s going wrong in your scenario? Do you have an equivalent/similar episodic version of your problem that works better?

With those settings (soft_horizon=100 and no_dome_at_end = True), the algorithms were struggling to solve even the simplest dynamic problems. I wasn’t sure if with those setting the algorithms were truly maximizing the infinite discounted sum of rewards or the sum of episodic rewards. I tried with no horizon or with horizon = float(inf) but them I get nan in mean episode rewards so it was hard to get feedback and I was also unsure if that works. Since then, I’ve found an episodic version of the model with a large horizon that manages to learn well. Overall, my doubt is if Rllib assumes an episodic problem under the hood.

I appreciate your help! I am PhD in economics at NYU, where one of the strongs suit of the PhD is programming big macro models. I am going on the market and I am trying to convince economist that RL can be used to solve high dimensional economic models. I am building an open source economy simulator in Python.

Rllib is very powerful, but it’s so opaque that I am on the fence on wether it is suitable for academic research. I’ve been working with it for months and I’ve tried to get under the hood but the depth and interdependencies are overwhelming.

I appreciate your help!

1 Like

I understand; I’m also still learning about all the different features, options, etc.

As you already figured out, the three relevant config options are horizon, soft_horizon, no_done_at_end as described here: RLlib Training APIs — Ray v1.4.0

I think soft_horizon needs to be boolean and soft_horizon = 100 doesn’t make sense. But you probably mean horizon=100 and soft_horizon=True as your current setting?

If you don’t want your environment to terminate, did you try keeping horizon: None?
Do you have a custom environment and can simply ensure in the environment implementation that it runs forever without returning done=True?
Of course, with infinite “episodes”, you won’t get any episode-related metrics, but maybe you could log a custom metric (via the on_episode_step callback) instead?

From the description here and here:

soft_horizon (bool): If True, calculate bootstrapped values as if
episode had ended, but don’t physically reset the environment
when the horizon is hit.

It seems to me that you’d want to also keep soft_horizon = False to ensure that the bootstrapped value estimates are calculated identically in each step.
But if you have an horizon configured, you’d also need soft_horizon=True to avoid that the environment is actually reset.

I tried in one of my environments to set a horizon and soft_horizon=True and no_done_at_end=True, which did mean that the episode ran forever. But I guess the rewards were still calculated based on “episodes” of length horizon. Still the result was ok.

Overall, I do think most RL frameworks, incl. RLlib, focus more on episodic scenarios. Also see related issue: [rllib] Continuous instead of episodic problem · Issue #9756 · ray-project/ray · GitHub

For me, I figured that I’ll just consider sufficiently long episodes instead of a continuous problem, which went well. Is that an option for you?

Yes I have found an episodic version that works well. Thanks for your help!