How to use Custom Action Distributions for this?

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I have a custom network that’s outputting an N x M matrix of probabilities. These are essentially stacked Categorical distributions. So, if I have 4 x 5 matrix, it’s 4 independent distributions over the same 5 options. The 5 options are something like, up/down/left/right/noop.

I want to sample from each distribution a variable number of times per step. Let’s say my 4 options are cars/trucks/busses/bikes. In my simulation I might have 2 cars, 1 truck, no busses, and 6 bikes. So my vehicles are an array like [2, 1, 0, 6], and I want to draw that many times from each respectively indexed distribution. So post-sample, I expect a matrix that might look like:

[[1, 0, 0, 1, 0], ← cars
[0, 0, 1, 0, 0],
[0, 0, 0, 0, 0],
[3, 0, 2, 1, 0]] ← bikes

I’m struggling to figure out how to implement this sampling appropriately through Rllib. In particular, I can’t figure out how to stick this logic into a Custom Action Distribution.

I’ve got the torch logic to DO the sampling in hand, I think my question essentially:

  • Is a custom action distribution the right place to do this sampling?
  • if so, how should I pass the list of available vehicles to the action_dist each step?
  • should I instead pass the network output, unsampled, to the env, and have this sampling occur in the env?
  • should I be trying to recreate this sampling approach with the built-in action spaces?

I guess another way to ask this question is: how do I sample a variable number of times from a MultiDiscrete?

Where the number of times I sample changes per-step.

@drj great question! From what I read it sounds to me that this problem could be better formulated as a multi-agent problem with multiple policies and multiple agents (N x M, so the same policy could be used by all cars and trained by their experiences, each car would then act by the sample from the distribution when rolled out, so they would all act differently but similar). Like this you do not have to hack these distributions into a single one for a single policy that steers all agents in the environment.

You can try to take a look into our MultiAgentEnv and into the multi-agent examples.

@Lars_Simon_Zehnder thanks so much! I probably actually oversimplified my problem when I was explaining it.

I actually already have this formulated as a hierarchical, multiagent problem. My lower-level is exactly what you described, with a policy for each vehicle-type that makes individual choices.

My high-level agent is choosing the starting point for these vehicles – which “cell” they initialize in. For problem-specific reasons, I think this kind of learned initialization will help the overall system performance a lot. And on any given episode, I have different distributions of lower-level agents, so I need to be able to “place” different distributions of vehicles.

Does that make sense?

@drj, alright got you. Thanks for the explanation.

Would it work, if the high-level agent uses our TorchMultiCategoricalDistribution? That can choose basically for N different agents from M different cells.

Btw see here for an hierarchcial example.

Ahh, I see – so basically my Distribution just becomes a list of CategoricalDistributions.

I guess the tricky thing is that I want to sample an uneven number of times from each CategoricalDistribution. So I can’t just do [cat.sample() for cat in self._cats] since I may want to sample twice from the cars Distribution, and zero times from the bikes distribution.

Can I pass additional inputs to sample, logp, etc? I’ll have a 1-D tensor called something like vehicles in my obs that gives the number of each vehicle-type available in that step, like: [2, 1, 0, 6]. So I’d want to do something like def sample(self, vehicles):. I assume I’d need to further modify the code that calls the ActionDistribution.

Or, is the better option to sample N times, where N is max(vehicles), in this case 6, and then in my env handle throwing away the samples that don’t have a corresponding vehicle available. In my head that would be sort of like action masking, but after sampling, and I don’t know if that works algorithmically.