Well, actually it has nothing to do with serving models for now. I have a small graphql endpoint made available through serve. Ray is running on kubernetes and the cluster node resources are very limited, 2 cpus/8gb ram, but i am able to spawn as many of them as needed. This means that the ray head lives on one of these nodes and its CPU count is 2. Just deploying the endpoint will take up those 2 CPUS, 1 for the proxy and another for the backend.
Since the endpoint usage is light and not very regular, making its resource requirements smaller seemed like a good option. Managed to reduce the backend CPU requirement but not the proxy, therefore my question here.