in a guide I found with some tips on how to use the compiled DAG workflow, there is a suggestion to avoid repeated messaging on the driver node with a graphic like this
is there any way to visualize the execution of my workload in such a manner? or more generally any good performance debugging tips for large jobs?
my application works pretty well for small and medium size workloads, but it seems to have some poor scaling properties as I grow closer to the cluster capacity — it almost seems like there is deadlocks of some kind