Stream processing of events (feature pre-processing) with "at least once" guarantee & auto-scaling

I’m looking into using Ray for a project and I would need some feedback or guidance on whether this is a good fit, and how to best implement it. The workflow is the following:

  • Clients upload videos on an S3 buckets.
  • Adding a video on bucket sends an event on an SQS queue announcing the new video file.
  • I need to process videos with an expensive GPU model, in a distributed streaming fashion. The longer the video, the longer this process will take. I cannot split a single video into fixed-sized chunks to parallellize processing a single video.
  • I need the processing to scale up or down according to queue length (to provide “low-latency” when many videos are in the queue, but to avoid paying for idle GPUs when the queue is empty). I would want the system to scale down to 0 GPUs if there is not video in the queue.
  • I need to make sure I process all videos at least once. Ideally exactly-once but at-least-once is fine as long as re-processing happens only under specific circumstances (e.g. redeploying).
  • I need to persist the output on another S3 bucket, or a feature store

I could do it with Spark streaming: it provides the at-least once guarantee, the ability to scale up and down BUT the mini-batch synchronisation is problematic as a long video would block the batch for a long time, and other videos would need to wait until this video is processed to be considered.

I could also theoretically do it in Flink, which provides at-least-once guarantees and does not suffer from Spark’s micro-batch limitation, BUT auto-scaling seems to be experimental and has strong limitations.

Because none of these options are great, I was thinking of doing this as a simple K8s deployment, where each pod reads from the queue and processes the events as they come in. I would scale the deployment up and down according to the message queue length.

Would there be a simple and clean solution to do it with Ray?

  1. I was looking into ray workflows, where I could have a SQS listener create a new workflow for each incoming message. But I am not sure about the folloiwng:
    a. Would Ray be able to scale up or down according to how many workflows are in the queue?
    b. Would Ray workflow be able to handle millions of events?
    c. Would Ray workflow be able to resume processing events after a redeployment?
  2. I was also looking at implementing it with a streaming Ray dataset, but as far as I can tell:
    a. this would not support scaling up and down based on load as parallellism would be fixed?
    b. There is no support for checkpointing/at-least-once guarantees?
  3. I was otherwise looking at implementing it with a Queue + Actors, but:
    a. I don’t think there would be any out-the-box at-least-once support, and I would need to reimplement it
    b. I am also not sure the actor pool could scale up and down automtically?
  4. Finally I saw people do it with Ray Serve:
    a. auto-scaling could work there, BUT it would not scale down to 0 when no video are in the queue (minimum is 1)?
    b. I would still need some code to poll from SQS and emit calls to serve, and support at-least once myself on that side?

I would greatly appreciate your feedback & insight on this topic, there might be something obvious that I am missing :sweat_smile: