What is the best practice to launch jobs with really large dependencies, e.g large binaries with many dynamic libs that may be larger than 500MB? How do you compare the 3 options below
- submit via runtime_env
- Let Ray actors/workers individually retrieve those binaries/.so from shared storage, e.g. S3
- Build those binary/.so together with Ray into image that run on each pod.
I think runtime env doesn’t support data that’s bigger than 100MB now (cc @architkulkarni for. confirmation), so I feel like 1 is not the viable option. I think 2 & 3 should be similarly a good option? The third one might be heavyweight, but you won’t have additional overhead waiting for downloading & loading shared objects.
That’s right, for uploading directly from your local machine to the Ray cluster, the limit is 100 MB, but I agree that 2 and 3 are good options and the way Sang compared them makes sense. Here’s some documentation for 2: Handling Dependencies — Ray v1.10.0