Distributed multi-agent training

limenutt · February 20, 2022, 1:29am

I have a multi-agent environment (1 env ~ 10 agents) that is wall-time-expensive, CPU-only and it’s tricky to run multiple instances of environment on one machine.
I want to run about 10 small machines, 1 environment on each, which will give me ~100 agents to step through at the time. I want all of them to train a single policy (they are independent agents, do not interact with each other at all)

My intuition is that I would need to do the inference and training on one big machine with a GPU (e.g. on a head node), but open to other experiments, like doing inference locally but training centrally and syncing the policy from a head node to worker nodes every time it changes.

What would be the best way to do it with ray clusters?

Topic		Replies	Views
How to run multiple trainers? RLlib	2	336	August 26, 2022
Train for multi-agents with multi-machines and multi-GPUs Configure Algorithm, Training, Evaluation, Scaling	0	190	November 9, 2023
Multiagent only using one cpu RLlib	1	408	December 14, 2020
Decentralised pre-trained policies loaded into multi-agent environment for further training and evaluation RLlib	0	50	June 6, 2024
Training parallelisation in RLLIB Configure Algorithm, Training, Evaluation, Scaling	3	613	December 9, 2022

Distributed multi-agent training

Related topics