Hi @sangcho , just see this thread.
As a BI company, we build many ETL to serve the data warehouse.
I had a chance to experience most libraries out there, let me list some:
Airflow: The first to think of, but it’s too heavy… and it doesn’t have a native executor. The best executor choice here is Celery which I don’t like, I’ll tell you later why.
Celery: The most popular worker I think. It’s superb to execute parallel tasks, but it’s bad when it comes to concurrent tasks (which are the major of ETL jobs).
Celery does have greenlet workers, but still not my choice anyway. First, those greenlet workers are buggy, you can’t call external commands (such as shutdown, restart…).
Second, greenlet doesn’t return the result of the job in an async style (first done first return)
Third, greenlet doesn’t work with async functions. So for calling apis, you can’t use aiohttp but requests instead. The greenlet will try to monkeypatch the normal function to async one… But you can’t control, you are not aware whether a function is able to be patched or not… so not a solid choice…
Faust: this one is great, use kafka, it’s perfect for async tasks… again it uses single core…
In case of Celery, I can create a prefork worker and a greenlet worker and pass tasks around… sound a little bit stupid but still work haha… In case of Faust I can’t, there is no parallel worker option…
What if I have a computer with 16 cores, should I start 16 instances of concurrent worker? How can I manage their health? A Cluster Manager is important which is not available.
Kafka is a bit heavy too( Kafka+zookeeper eat ~1gb ram on average, while rabbitmq ~100mb and Redis ~ 10mb)
Arq: very promising but not so many stars… I’m not confident enough to use it in our production work.
Back to RAY. What I really like:
- Super lightweight and easy to config. A don’t need another Container for the message queue. I believe you guys do use queues(Redis?) to pass the message around actors, but the end-user is care-free of that => good point.
- The flexibility to swap between async/normal tasks. I shared my pain above… with Ray, you just need to add the async function in the actor. The result can be either ray.get or await => so this feature is really a dream to me.
- I’m a fan of reactive programming, so I love actors, I love Akka… Although Ray actor is not as fully featured as Akka, using Python is a strong advantage (in term of integration and HR)
- One of the most complete RPC I’ve ever seen in Python. With all other frameworks above, you need to define the functions in advance on the host… Ray helps me to separate the development process and the worker server.
- … some more that I don’t remember lol