GtrXL Multimodal tokenizer

Hi i would like to train an actor for a custom enviroment using transformer like policy net. I came across the GTrXL net and would like if this supports multimodal input to the net? By this i mean a tokenizer for visual features (images) and perceptual features (joint states, etc.)