1. Severity of the issue: (select one)
Medium: Significantly affects my productivity but can find a workaround.
2. Environment:
- Ray version: 2.44.1
- Python version: 3.10
3. What happened vs. what you expected:
- Expected: An Agent which should learn to reduce total picking cost in a warehouse finding effective sawp sequencess between bins.
- Actual: The Agent quickly converges to repetitive or suboptimal actions and often getting stuck in a loop. As the number of bins increases, the performance degrades due to exploding action space.
Hi all,
I am working on a combinatorial bin reordering problem using RLlib. The core idea is to optimize the arrangement of bins in a warehouse by performing pairwise swaps to minimize costs.
I am currently trying to find a conceptional RL setup to understand how to deal with the large combinatorial action space, therefore I just want to order simple numbers. At this stage, I’m not yet tackling the full real-world problem — I want to first understand how RLlib can handle this kind of swap-based combinatorial task, especially as the number of elements increases.
- I define an environment where the agent sees a sequence of values (priorities and demands).
- The agent can only perform swap actions (i.e., swap two demands at a time).
self.swap_actions = list(itertools.combinations(range(len(bins[0])), 2))
self.action_space = spaces.Discrete(len(self.swap_actions)) - The goal is to sort or reorder the elements to minimize a total cost function (based on the sum of priority × demand).
I am currently trying to solve it with a simple PPO setup, but the agent gets stuck with the same actions, as I try to increase the numbers.
I tried many different approaches like other algorithms, multidiscrete action space or different reward designs.
Now I am looking for a best practice on how to deal with problems with such large action spaces.
Any ideas and suggetions welcome!
Thanks!