Here all the workers have different IP. Network reachability is not an issue as smaller scale cluster worked fine.
The tasks are stuck for hours (both dashboand and CLI shows worker is idle / new worker has been added), so the issue is likely not related to synchronization.
ray.nodes() call correctly reports dead workers and alive (idle) workers, same as in dashboard.