Why is APEX logging metrics for a 1/3 of the workers only?

Hi all,

I noticed after using APEX for a while that its execution plan only reports metrics for a third of the remote workers. This isn’t documented and it has consequences in terms of interpreting some of the metrics reported. For example, the number of episode per iteration is wrong while the number of steps sampled and trained are ok, all custom metrics are computed on this subset of workers, etc.

Is there a particular reason for reporting only 1/3 of the workers? Should this remain the default behavior or at least be configurable?

Thanks,
Thomas

Hey @thomaslecat , great question! :slight_smile:
The answer is in rllib/agents/dqn/apex.py (APEX’ execution plan), see below.
I’m guessing we did this to avoid collecting metrics from those workers that have a high epsilon (these are mostly acting randomly anyways). APEX uses the PerWorkerEpsilonGreedy exploration component, which splits up (and pins) epsilon values per-worker.

    # Only report metrics from the workers with the lowest 1/3 of epsilons.
    selected_workers = workers.remote_workers()[
        -len(workers.remote_workers()) // 3:]

We should probably make this configurable. Also, I agree that it’s not optimal to have e.g. the number of episodes counted based on this cutoff.