Distributed multi-agent training

I have a multi-agent environment (1 env ~ 10 agents) that is wall-time-expensive, CPU-only and it’s tricky to run multiple instances of environment on one machine.
I want to run about 10 small machines, 1 environment on each, which will give me ~100 agents to step through at the time. I want all of them to train a single policy (they are independent agents, do not interact with each other at all)

My intuition is that I would need to do the inference and training on one big machine with a GPU (e.g. on a head node), but open to other experiments, like doing inference locally but training centrally and syncing the policy from a head node to worker nodes every time it changes.

What would be the best way to do it with ray clusters?