3D cloud as input array for RL algorithm

I would like to use RL to solve permutation problem where my input is sparse zeros-ones 3D array.
This 3D array defines cloud of points in 3D space, the array is large 1000x500x50 and sparse number of ones is below 1 % off the array.

My problem is to prepare school timetable where:
x-axis is timeslot
y-axis is room
z-axis is subject

so RL alrorithm try to get current timetable, next to place value 1 in proper timeslot-room-subject index, this position defines time, room and subject in timetable.
In this problem thereare very large observation array and very large action space 1000x500x50 actions , each possible action is the position of single lesson in school timetable (some of inputs are masked).

To solve it I use typical rllib PPO model with action masking, of course process is very slow and consumes a lot of memory for the model. I wonder if I can use 3D convolutions or attention, but:

  • 3D convolutions need a lot of CPU
  • attention - I don’t know how to properly calculate similarity between 3D objects, I don’t know proper metric for this purpose ( of course I can flat 3D array and use cosine or dot similarity but I think it’s not correct direction)
  • use some feature extraction from 3D cloud to decrease size of input array but I don’t know proper method for that, typical feature extractors are for images.

I will be grateful for any suggestions or ideas.

Just out of curiosity: How do you plan to model this as a sequential decision-making process? How sparse would the reward signal be?

RL does not sound like a good fit for this problem vs traditional optimization. But if I was going to tackle it with RL anyways, I would start on very small problem instances.

@arturn @robin Thank You for Yours answers. I see making the time tablecan be sequential proces in that way, evry step is placing single lecture in the timetable, so:

step 1 -place, 1st lesson from 1st lecture
step 2 -place, 2nd lesson from 1st lecture

step i -place, n th lesson from 1st lecture

step j -place, 1st lesson from m -th lecture

The last step is planning the last lesson from the last subject

Why using I try to use RL? I would like to optimize by modificate reward by events:

  • cumulate subject in period of time or expand into whole period
  • reward depens by day of week or prefered classroom for this subject

The most important part is that different subject are in different buildings in different parts of the city so I would like to optimize time of travel between places. Time of the travel is calculated with using bus timetable.
Of course there are diffent methods than RL but I would like to test RL in this task.