Hey everyone,
I started using Ray for distributed computing in my app not long ago, but I’m running into some problems with scaling. Right now, I’m trying to run machine learning models across multiple nodes. The speed boost is clear, but I’m struggling to manage resources well.
Here’s what I’m dealing with:
Resource Allocation: Ray doesn’t seem to use all the available resources across nodes in the best way. I’ve played around with the cluster settings, but I’m not seeing the performance improvements I expected.
Task Scheduling: As I scale up, some tasks get stuck or take longer than they should even though other tasks run without issues.
Error Handling: I’m not sure how to deal with task failures across distributed workers. Sometimes tasks just hang without giving me a clear error message.
Also, I’ve been thinking about how to learn cybersecurity while working with Ray, as keeping distributed systems safe is a top concern for me. Any ideas on how to bring in cybersecurity best practices when using Ray would help!
Has anyone else run into similar problems? Any tips or tried-and-true methods would be great. Thanks in advance!
Can’t wait to hear what you think!
Best,
joseph