Scaling combinatorial optimization with RLlib

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

  • Ray version: 2.44.1
  • Python version: 3.10

3. What happened vs. what you expected:

  • Expected: An Agent which should learn to reduce total picking cost in a warehouse finding effective sawp sequencess between bins.
  • Actual: The Agent quickly converges to repetitive or suboptimal actions and often getting stuck in a loop. As the number of bins increases, the performance degrades due to exploding action space.

Hi all,
I am working on a combinatorial bin reordering problem using RLlib. The core idea is to optimize the arrangement of bins in a warehouse by performing pairwise swaps to minimize costs.

I am currently trying to find a conceptional RL setup to understand how to deal with the large combinatorial action space, therefore I just want to order simple numbers. At this stage, I’m not yet tackling the full real-world problem — I want to first understand how RLlib can handle this kind of swap-based combinatorial task, especially as the number of elements increases.

  • I define an environment where the agent sees a sequence of values (priorities and demands).
  • The agent can only perform swap actions (i.e., swap two demands at a time).
    self.swap_actions = list(itertools.combinations(range(len(bins[0])), 2))
    self.action_space = spaces.Discrete(len(self.swap_actions))
  • The goal is to sort or reorder the elements to minimize a total cost function (based on the sum of priority × demand).

I am currently trying to solve it with a simple PPO setup, but the agent gets stuck with the same actions, as I try to increase the numbers.
I tried many different approaches like other algorithms, multidiscrete action space or different reward designs.

Now I am looking for a best practice on how to deal with problems with such large action spaces.

Any ideas and suggetions welcome!
Thanks!

Combinatorial optimization with PPO has been investigated by many scientific researchers, on short notice I woud like to refer to the following:

Hubbs et al (2020), OR-Gym: A Reinforcement Learning Library for Operations Research Problems - they used PPO in an early ray version, also available on GitHub

While their conceptual approach is a bit different, i.e. their action space foresees the selection of heuristics, you may also have a look at
Kallestad et al (2023), A general deep reinforcement learning hyperheuristic framework for solving combinatorial optimization problems

Multiple stages of trainings applied in Curriculum learning approaches can help to reduce sparsity in reward functions, which might be related, or a side-effect, of big action space.
Li et al (2023), A two-stage RNN-based deep reinforcement learning approach for solving the parallel machine scheduling problem with due dates and family setups