Hi @sangcho , just see this thread.
As a BI company, we build many ETL to serve the data warehouse.
I had a chance to experience most libraries out there, let me list some:
-
Airflow: The first to think of, but itās too heavy⦠and it doesnāt have a native executor. The best executor choice here is Celery which I donāt like, Iāll tell you later why.
-
Celery: The most popular worker I think. Itās superb to execute parallel tasks, but itās bad when it comes to concurrent tasks (which are the major of ETL jobs).
Celery does have greenlet workers, but still not my choice anyway. First, those greenlet workers are buggy, you canāt call external commands (such as shutdown, restartā¦).
Second, greenlet doesnāt return the result of the job in an async style (first done first return)
Third, greenlet doesnāt work with async functions. So for calling apis, you canāt use aiohttp but requests instead. The greenlet will try to monkeypatch the normal function to async one⦠But you canāt control, you are not aware whether a function is able to be patched or not⦠so not a solid choiceā¦
-
Faust: this one is great, use kafka, itās perfect for async tasks⦠again it uses single coreā¦
In case of Celery, I can create a prefork worker and a greenlet worker and pass tasks around⦠sound a little bit stupid but still work haha⦠In case of Faust I canāt, there is no parallel worker optionā¦
What if I have a computer with 16 cores, should I start 16 instances of concurrent worker? How can I manage their health? A Cluster Manager is important which is not available.
Kafka is a bit heavy too( Kafka+zookeeper eat ~1gb ram on average, while rabbitmq ~100mb and Redis ~ 10mb)
-
Arq: very promising but not so many stars⦠Iām not confident enough to use it in our production work.
Back to RAY. What I really like:
- Super lightweight and easy to config. A donāt need another Container for the message queue. I believe you guys do use queues(Redis?) to pass the message around actors, but the end-user is care-free of that => good point.
- The flexibility to swap between async/normal tasks. I shared my pain above⦠with Ray, you just need to add the async function in the actor. The result can be either ray.get or await => so this feature is really a dream to me.
- Iām a fan of reactive programming, so I love actors, I love Akka⦠Although Ray actor is not as fully featured as Akka, using Python is a strong advantage (in term of integration and HR)
- One of the most complete RPC Iāve ever seen in Python. With all other frameworks above, you need to define the functions in advance on the host⦠Ray helps me to separate the development process and the worker server.
- ⦠some more that I donāt remember lol