[Workflows] S3 storage results in extremely slow workflow scheduling

dirtyValera · April 16, 2023, 6:03am

I’m executing a small sized DAG (about 30 tasks) using workflows with S3 as a storage. The problem is that workflow initialization is extremly slow (see logs timestamps) :

(Scheduler pid=5463) 2023-04-16 08:52:20,301    INFO workflow_access.py:356 -- Initializing workflow manager...
(Scheduler pid=5463) 2023-04-16 08:53:41,355    INFO api.py:203 -- Workflow job created. [id="workflow_0736d415-212d-4d02-bb07-8094740f7f54.1681624333.095297098_0"].
(WorkflowManagementActor pid=5466) 2023-04-16 08:58:28,482      INFO workflow_executor.py:86 -- Workflow job [id=workflow_0736d415-212d-4d02-bb07-8094740f7f54.1681624333.095297098_0] started.
(_workflow_task_executor_remote pid=5465) 2023-04-16 08:58:32,393       INFO task_executor.py:78 -- Task status [RUNNING]       [workflow_0736d415-212d-4d02-bb07-8094740f7f54.1681624333.095297098_0@workflow_0736d415-212d-4d02-bb07-8094740f7f54.1681624333.095297098_0_catalog_df_0_0]

It takes about 5-10 mins for workflow to start executing tasks. With local storage everything is instant.

Ray 2.3.1, tried with pyarrow 8.0.0 and 10.0.1, same results. Ran locally and in kubernetes with minikube, same results

dirtyValera · April 16, 2023, 6:12am

Tracked S3 contents while Ray hangs, looks like it populates duplicate_name_counter/ folder (what’s it for?) in S3 for each task in a DAG. Does it happen serially, separate call for each task? If so, this could be one of the problems. What if a DAG has 100s of tasks, this results in 100 sequential calls to S3?

The above happens at INFO workflow_access.py:356 -- Initializing workflow manager... line, takes about 2 minutes. What happens after INFO api.py:203 -- Workflow job created. ? Why additional 5 mins?

Topic		Replies	Views
Ray Workflow storage on S3 unexpected behavior Ray Workflows	3	532	December 1, 2022
Cannot use S3 inside of task? Ray Data	4	1009	October 19, 2022
Not able to batch processes that includes S3 operations Ray Workflows	0	399	June 19, 2023
Strange errors running dask on ray Ray Core	3	275	July 1, 2021
Job Queue in Ray? Ray Core	23	8501	December 17, 2021

[Workflows] S3 storage results in extremely slow workflow scheduling

Related topics