The 2 GB limit seems odd to Ray project philosophy and goals. It’s not uncommon in the field of NLP to serve models with large weights, and 2GB isn’t really that much.
Maybe I’m missing something? But the project docs states that the serve/actor state “can have a very large neural network weight.”
One way to solve this problem is to start the actor on the cluster containing the model (and one more process that continuously train weights) and then pass a handle of the actor to the serve deployment. It’s not exactly the right way, but it works and can be organised into separate steps to prevent large weights from being saved on external storage.