What is the best approach for long running IO tasks (pollers)?

We have currently employed long running IO tasks as ray remote functions. That Keep polling a cloud based queue to listen to events.

Resources:

  • Ray runs inside a container process where the infra resources allocated are as vcpu:1024 (meaning only 1 core), RAM (7GB).
  • no attributes like num_cpus are set for the remote tasks or functions. we just use @ray.remote without any arguments

Few observations:

  • Increasing number of worker (number of tasks or remote functions) for polling seems to be a magic. Few events are lost but we could see they are being processed by the remote actors. to be precise intial steps seem to complete but the next steps are lost in that worker process. Is there a known issue in raylets?
  • With 1 raylet process (1 remote function) it works fine. we see the above issue when the number of workers are more than 1. In our environment it is 3.

Additional Questions:

  • Is it a correct approach to run a ray task that keeps polling the queue to receive events? (there is no other alternative for to fetch events/data from the queue)
  • Should we use actors?
  • Any other recommendations for long running pollers?
  • I also read in one of the posts Is there any method to make ray task sharing with 1 cpu core? Like multithreading. Should I use num_cpus with value < 1 to have actual parallel running workers? But we do see 3 pids in the logs that the workers have started and are listening to the queue.
  • Is it a correct approach to run a ray task that keeps polling the queue to receive events? (there is no other alternative for to fetch events/data from the queue)
  • Should we use actors?

It seems like using an async actor https://docs.ray.io/en/master/async_api.html is the right way to go for this use case. Btw, for this specific use case (lots of IO), async actors is more recommended than tasks because task is doing the process-level parallelism, and that can cause some overhead if your workload is IO heavy.

I believe there’s a 3rd party library written on top of Ray to support this use case; GitHub - project-codeflare/rayvens: Rayvens makes it possible for data scientists to access hundreds of data services within Ray with little effort..

  • Increasing number of worker (number of tasks or remote functions) for polling seems to be a magic. Few events are lost but we could see they are being processed by the remote actors. to be precise intial steps seem to complete but the next steps are lost in that worker process. Is there a known issue in raylets?

There’s no known issue. But it seems like this setup Ray runs inside a container process where the infra resources allocated are as vcpu:1024 (meaning only 1 core), RAM (7GB). is probably not a good setup to have a ray cluster. Ray itself already starts several processes (5~6) and use multi threads at each component, so having 1 cpu is usually not enough and prone to more bugs. Is it possible to at least have 8 cpus on each container?

  • no attributes like num_cpus are set for the remote tasks or functions. we just use @ray.remote without any arguments

In this case, the default is num_cpus=1