Multi-agent Supply Chain Optimization with RLlib

Dear Community,
I am reaching out to seek insights and assistance on the challenging problem I am currently facing on my project.

warehouse X is facing the complex task of optimizing its supply chain process for a diverse range of 1000 products sourced from over 200 suppliers. The goal is to avoid overstocking, prevent stockouts, and consistently meet customer demand. The products are classified as perishable and non-perishable.

Our proposed solution:
A multi-agent environment where each agent learns a policy for a product line. We used the PPO algorithm implemented through Ray RLlib. Our RL environment is designed to consider the folllowing actions:

  1. When to raise a purchase order
  2. Which products to order
  3. The quantity to order for each product
  4. Which supplier to consider

The agent is trained on historical sales from the past year. The environment consists of N observations, representing the stock level and trends, and 2 actions (quantity to order & supplier ID). Agents are trained in a multi-agent environment with a shared policy option set, and each episode consists of 365 time steps. The reward is calculated on the net profit made in one episode after a series of actions. Moreover, reward shaping is applied to guide the agents in making the right decisions. We applied penalties at each time step when the demand was missed, late orders and the choice of wrong suppliers (with a higher price or longer lead time). Indeed, each decision made by the agent at the time step “t”, will have a negative/positive impact at future time steps “t+M”.

At each time step, the demand from historical data is used to update the observations. Thus, the agent is expected to reach last year’s net profit or to overcome it.

About the neural network, we used a custom model consisting of linear layers with ReLU activations, branching into softmax for supplier selection and a customized activation function for the quantity branch output values with the range [0, max_purchase_quantity]. Batch normalization has been applied to enhance convergence and reduce overfitting, and attention mechanisms have been incorporated to focus on critical values in the observation space.

Issues Encountered:
Despite efforts, I am facing challenges in achieving optimal convergence and performance. The training process is notably slow (using a Nvidia GPU GeForce RTX 3080 10GB). After 15,000 iterations on 50 products, spanning 28 hours, the agent has not reached the expected reward where it can autonomously set the order quantity and supplier ID. We have tried implementing penalties in the reward function for incorrect decisions, experimented with different exploration strategies, used curriculum learning (to apply penalties progressively), and applied reward normalization. However, the results were not as expected.

How You Can Help:
I am reaching out for assistance and fresh perspectives. If you have experience with reinforcement learning in supply chain optimization, suggestions on improving convergence speed, dealing with complex action spaces, or insights into effective exploration strategies would be immensely valuable.

Moreover, if you have encountered similar challenges or have successfully implemented RL agents in supply chain contexts, your guidance on tuning hyperparameters, designing effective reward functions, or any other relevant advice would be highly appreciated.

Thank you in advance for considering this topic.
I eagerly await your responses and insights.