Scaling combinatorial optimization with RLlib

hofi16 · June 16, 2025, 7:50pm

1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.

2. Environment:

Ray version: 2.44.1
Python version: 3.10

3. What happened vs. what you expected:

Expected: An Agent which should learn to reduce total picking cost in a warehouse finding effective sawp sequencess between bins.
Actual: The Agent quickly converges to repetitive or suboptimal actions and often getting stuck in a loop. As the number of bins increases, the performance degrades due to exploding action space.

Hi all,
I am working on a combinatorial bin reordering problem using RLlib. The core idea is to optimize the arrangement of bins in a warehouse by performing pairwise swaps to minimize costs.

I am currently trying to find a conceptional RL setup to understand how to deal with the large combinatorial action space, therefore I just want to order simple numbers. At this stage, I’m not yet tackling the full real-world problem — I want to first understand how RLlib can handle this kind of swap-based combinatorial task, especially as the number of elements increases.

I define an environment where the agent sees a sequence of values (priorities and demands).
The agent can only perform swap actions (i.e., swap two demands at a time).
self.swap_actions = list(itertools.combinations(range(len(bins[0])), 2))
self.action_space = spaces.Discrete(len(self.swap_actions))
The goal is to sort or reorder the elements to minimize a total cost function (based on the sum of priority × demand).

I am currently trying to solve it with a simple PPO setup, but the agent gets stuck with the same actions, as I try to increase the numbers.
I tried many different approaches like other algorithms, multidiscrete action space or different reward designs.

Now I am looking for a best practice on how to deal with problems with such large action spaces.

Any ideas and suggetions welcome!
Thanks!

PhilippWillms · June 19, 2025, 9:03pm

Combinatorial optimization with PPO has been investigated by many scientific researchers, on short notice I woud like to refer to the following:

Hubbs et al (2020), OR-Gym: A Reinforcement Learning Library for Operations Research Problems - they used PPO in an early ray version, also available on GitHub

While their conceptual approach is a bit different, i.e. their action space foresees the selection of heuristics, you may also have a look at
Kallestad et al (2023), A general deep reinforcement learning hyperheuristic framework for solving combinatorial optimization problems

Multiple stages of trainings applied in Curriculum learning approaches can help to reduce sparsity in reward functions, which might be related, or a side-effect, of big action space.
Li et al (2023), A two-stage RNN-based deep reinforcement learning approach for solving the parallel machine scheduling problem with due dates and family setups

Topic		Replies	Views
Multi-agent Supply Chain Optimization with RLlib RLlib	1	283	June 19, 2024
Action space for choosing a sequence of items from a bigger sequence RLlib	6	314	March 1, 2023
My RLlib implementation seems to compute random actions RLlib	4	917	February 15, 2022
Not Sure Which RLlib Algorithm To Use RLlib	5	641	April 27, 2021
Multi reward optimization RLlib	6	400	September 29, 2021

Scaling combinatorial optimization with RLlib

Related topics