Hello
I am facing unusually slow task execution in my Ray cluster & I’m struggling to pinpoint the root cause. The tasks involve moderate data processing (around 100MB per task), and while my cluster has sufficient resources (8 nodes, each with 16 cores and 64GB RAM), tasks seem to take much longer than expected.
I have verified that the data is loaded correctly, and no significant network bottlenecks are apparent. However, the slowdown persists, especially when scaling to a higher number of tasks.
I’ve tried enabling Ray’s logging and monitoring features, but the logs aren’t showing any clear errors or warnings. Are there specific debugging tools or configurations within Ray that can help identify the issue?
Could this be related to task scheduling, resource allocation, or some subtle misconfiguration? I have checked Ray Distributed Debugger — Ray 2.40.0 java but the issue remains unresolved .
Any guidance on how to systematically debug such performance problems in a Ray cluster would be greatly appreciated.
Thank you !