I’m exploring whether Ray is a good fit for the following large-scale GPU inference setup, and would appreciate guidance or validation.
High-Level Requirements:
- I have billions of small files stored in S3.
- I want to run inference using a model that fits on any single GPU.
- Assume the Ray cluster runs on Kubernetes with heterogeneous GPUs (e.g., A100, T4, V100), including spot instances — so pods may disappear anytime.
I want to maximize GPU utilization by: - Dynamically assigning file batches to any available GPU.
- Using larger batches for faster GPUs, and smaller ones for slower GPUs.
- Automatically feeding more data as soon as a GPU finishes a batch.
- Ensuring fault tolerance in case a pod dies midway (no duplicates or lost batches).
My Question:
Let’s say I’ve done benchmarking in advance and know approximately how many files or batches each GPU type (e.g., A100 vs T4) can handle efficiently. Given this:
Can Ray:
- Dynamically push work to whichever pod/GPU becomes available — without pre-assigning static partitions?
- Use the available resource metadata (e.g., GPU type or speed) to adjust batch sizes or workload dynamically?
- Handle fault-tolerant task re-execution if a pod (e.g., spot instance) is interrupted mid-processing?
- Integrate with an orchestrator (like Airflow or Argo) to manage this whole setup in a multi-stage pipeline?
Thanks in advance — looking to validate this before investing further into implementation!