Serve huggingface transformer on GPU with batching

How severe does this issue affect your experience of using Ray?

  • High: It blocks me to complete my task.

Hi, thanks for the excellent software.

I use Ray 2.0 and has been following the example in https://docs.ray.io/en/master/serve/end_to_end_tutorial.html

I got it to work on CPU but when I use the @ray.remote decorator, it throws an error:

TypeError: Remote functions cannot be called directly. Instead of running ‘main.init()’, try ‘main.init.remote()’.

My full script, along with the shell script for executing on Ubuntu 21 is here:

ray stop

ray start --head --port=8016 --num-gpus=2

python3 torchserverv2.py

import ray
from ray import serve
#from fastapi import FastAPI
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

#app = FastAPI()
ray.init(address=“auto”, namespace=“serve”)
serve.start()

@serve.deployment
class M2M:
@ray.remote(num_gpus=2)
def init(self):
self.model = M2M100ForConditionalGeneration.from_pretrained(“facebook/m2m100_418M”)
self.tokenizer = M2M100Tokenizer.from_pretrained(“facebook/m2m100_418M”)
self.src_lang = ‘zh’
self.trg_lang = ‘en’

    @ray.remote(num_gpus=2)
    def __call__(self, request):
        self.tokenizer.src_lang = self.src_lang
        txt = request.query_params['txt']
        encoded_text = self.tokenizer(txt, return_tensors="pt", padding=True, truncation = True)
        generated_tokens = self.model.generate(**encoded_text,forced_bos_token_id=self.tokenizer.get_lang_id(self.trg_lang))
        return self.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

M2M.deploy()

Question 2: My next step is to get ray serve to do batching, so that it only executes M2M.call every 12 requests (or mini-batch = 12), or when 0.1 second has passed since the last request was made

I checked the tutorials and they are based on v1.12 (not the v2 I am trying to use). Any help on implementing this? Many thanks

RT

Hi, sorry for the confusion here! You don’t need to specify ray.remote when using Serve deployments, our @serve.deployment wrapper handles that for you.

Check here for how to specify GPUs: Core API: Deployments — Ray 1.12.0

@serve.deployment(name="deployment1", ray_actor_options={"num_gpus": 0.5})
def func(*args):
    return do_something_with_my_gpu()

And does the following batching tutorial help for the batching question? Batching Tutorial — Ray 1.12.0

1 Like

For clarity, this is a modified version of your program that I have confirmed works with Ray 2.0.0

import ray
from ray import serve
#from fastapi import FastAPI
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

#app = FastAPI()
ray.init(address="auto", namespace="serve")
serve.start(detached=True)

@serve.deployment(
    ray_actor_options={
        "num_gpus": 2
    }
)
class M2M:
        def __init__(self):
                self.model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
                self.tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
                self.src_lang = 'zh'
                self.trg_lang = 'en'

        def eval(self, txt: str):
                encoded_text = self.tokenizer(txt, return_tensors="pt", padding=True, truncation = True)
                generated_tokens = self.model.generate(**encoded_text,forced_bos_token_id=self.tokenizer.get_lang_id(self.trg_lang))
                return self.tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)

        def __call__(self, request):
                self.tokenizer.src_lang = self.src_lang
                txt = request.query_params['txt']
                return self.eval(txt)

M2M.deploy()
(anyscale) alinaqvi@Syeds-MBP product % curl "http://127.0.0.1:8000/M2M?txt=hello"
[
  "Hello"
]%
2 Likes