Challenges with Scaling My Application Using Ray – Need Some Guidance

Hey everyone,

I started using Ray for distributed computing in my app not long ago, but I’m running into some problems with scaling. Right now, I’m trying to run machine learning models across multiple nodes. The speed boost is clear, but I’m struggling to manage resources well.

Here’s what I’m dealing with:

Resource Allocation: Ray doesn’t seem to use all the available resources across nodes in the best way. I’ve played around with the cluster settings, but I’m not seeing the performance improvements I expected.

Task Scheduling: As I scale up, some tasks get stuck or take longer than they should even though other tasks run without issues.

Error Handling: I’m not sure how to deal with task failures across distributed workers. Sometimes tasks just hang without giving me a clear error message.
Also, I’ve been thinking about how to learn cybersecurity while working with Ray, as keeping distributed systems safe is a top concern for me. Any ideas on how to bring in cybersecurity best practices when using Ray would help!

Has anyone else run into similar problems? Any tips or tried-and-true methods would be great. Thanks in advance!

Can’t wait to hear what you think!

Best,
joseph

Hi azizi,
Can you provide logs or screenshots of what you’re expecting and what you’re actually seeing as well as how you’ve set up your cluster / resource settings? :slight_smile: For the task scheduling too if possible. What error messages are you seeing in the clusters also?

As for cybersecurity, I found a few resources that might be helpful but idk if you’ve read them before already: