How severe does this issue affect your experience of using Ray?
- High: It blocks me to complete my task.
Hi Team,
I have a problem when using Ray.Serve for my api deployment.
My computer has 8 physical CPU cores and 16 logical processors,
When I assigned 4 replicas by
@serve.deployment(num_replicas=4)
And the dashboard shows that 6 process are working (2 of all are used by ray).
But in request test by postman, I found that Ray can only process 2 requests parallelly. for exam, I send 3 requests at the same time, only 2 processors are working at the same time, another 2 are idle.
When I updated num_replicas to 6, only 3 processors are working, anthors 3 are idle. It seems that I can only use half of the CPUs that I config.
Is there any mistake in my config or code?
PS: I set the OMP_NUM_THREADS to match the number of parallelism,but it doesn’t work.
@serve.deployment(num_replicas=4,ray_actor_options={"num_cpus":1,"num_gpus":0})
# @serve.deployment
@serve.ingress(app)
class CutOptimize:
def __init__(self):
os.environ["OMP_NUM_THREADS"]="4"
After testing, it seems that caused by FastAPI.
By using fastAPI to code, the CPUs must be double to complete one task.
But using raw __ call __(), it works good.
def __call__(self, request: Request) -> Dict:
return {"result": self._msg}
1 Like
Hi @liu_meteorfall, glad you fix the issue by your own! but do you mind sharing a simple script to reproduce the issue? (the team can help you diagnose more)
1.simple case of using raw call by ray serve:
import ray
from ray import serve
from fastapi import FastAPI,Form
import os
import time
app = FastAPI()
origins = [
"*"
]
ray.init(address="auto",_node_ip_address='192.168.11.5', namespace="serve")
serve.start(http_options={"host":"0.0.0.0"},detached=True)
@serve.deployment(num_replicas=2)
# @serve.deployment
@serve.ingress(app)
class TestFastApi:
def __init__(self):
pass
@app.get("/")
async def test(self):
startTime = time.time()
receiveTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(startTime))
pid = os.getpid()
try:
time.sleep(5)
# for i in range(150000):
# print(i)
return {
'receiveTime':receiveTime,
'runTime':time.time()-startTime,
'infoTrace':"successs",
'pid':pid
}
except Exception as e:
return {
'receiveTime':receiveTime,
'runTime':time.time()-startTime,
'errorTrace':e,
'pid':pid
}
TestFastApi.deploy()
- simple case of using FastApi by ray serve:
import ray
from ray import serve
import os
import time
from starlette.requests import Request
ray.init(address="auto",_node_ip_address='192.168.11.5', namespace="serve")
serve.start(http_options={"host":"0.0.0.0"},detached=True)
@serve.deployment(num_replicas=2)
class TestApi:
def __init__(self):
pass
async def __call__(self, request: Request) -> str:
try:
#params: str = await request.json()
startTime = time.time()
receiveTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(startTime))
pid = os.getpid()
#api 参数
time.sleep(5)
# for i in range(150000):
# print(i)
return {
'receiveTime':receiveTime,
'runTime':time.time()-startTime,
'infoTrace':"successs",
'pid':pid,
}
except Exception as e:
return {
'receiveTime':receiveTime,
'runTime':time.time()-startTime,
'infoTrace':"successs",
'pid':pid,
}
TestApi.deploy()
- We use “time.sleep(5)” to simulate the long time working and assign 2 CPUs for parallelism.
- deploy the above 2 to server for comparing the time results
- We use postman to send two request for raw case api in a very short time, the result is:
The 2 CPUs is working parallelly as expect
- We use postman to send two request for fastapi case in a very short time, the result is:
- Is there any mistake in my fastapi case code?
By the way, when I trace the log of ray serve, it seems that with fastapi, server will use one process to handle the request, but response with another process, so it may occupy 2 CPUs in one request.
I am not sure about my guess.