Ray workflows storage cleanup after completion of the pipeline

How severe does this issue affect your experience of using Ray?

  • Medium: It contributes to significant difficulty to complete my task, but I can work around it.

I want to discuss the ray workflow storage. As I understand workflow storage stores result for a particular dag in the pipeline. But after completion, they are completely useless and have no value in the further task. (treated as the temp file).

In cluster mode, It shouldn’t be the responsibility of the ray to clean up the storage after the completion of the pipeline.

In my case, a huge-sized file is downloaded and pre-processed in a workflow dag, and workflows stores the path of that file. The downloaded file due to its huge size is deleted after processing. After a few days, a new task is pipelined, if the same file needs to be downloaded again under another pre-processing, ray workflows return the path of that file which does not even exist today.

Either, we need to have a functionality to restart the workflow, or we need ray to delete the workflows after the successful completion of the pipeline.

what are your thoughts?

Thanks in advance

@Harshal_Mittal
I think this is very useful feature!

I think we probably can add an option for this or we can do workflow.delete_dag(workflow_id=…) for this.

I don’t think it’s a complicated feature, but because we are prioritizing the stability/scalability issues of ray cluster, we might not have enough bandwidth to implement this.

Do you mind helping contributing this? I can help you on this one.