Challenges with Scaling My Application Using Ray – Need Some Guidance

azizi · April 16, 2025, 6:46am

Hey everyone,

I started using Ray for distributed computing in my app not long ago, but I’m running into some problems with scaling. Right now, I’m trying to run machine learning models across multiple nodes. The speed boost is clear, but I’m struggling to manage resources well.

Here’s what I’m dealing with:

Resource Allocation: Ray doesn’t seem to use all the available resources across nodes in the best way. I’ve played around with the cluster settings, but I’m not seeing the performance improvements I expected.

Task Scheduling: As I scale up, some tasks get stuck or take longer than they should even though other tasks run without issues.

Error Handling: I’m not sure how to deal with task failures across distributed workers. Sometimes tasks just hang without giving me a clear error message.
Also, I’ve been thinking about how to learn cybersecurity while working with Ray, as keeping distributed systems safe is a top concern for me. Any ideas on how to bring in cybersecurity best practices when using Ray would help!

Has anyone else run into similar problems? Any tips or tried-and-true methods would be great. Thanks in advance!

Can’t wait to hear what you think!

Best,
joseph

christina · April 16, 2025, 11:10pm

Hi azizi,
Can you provide logs or screenshots of what you’re expecting and what you’re actually seeing as well as how you’ve set up your cluster / resource settings? For the task scheduling too if possible. What error messages are you seeing in the clusters also?

As for cybersecurity, I found a few resources that might be helpful but idk if you’ve read them before already:

Security — Ray 2.44.1
https://cloud.google.com/blog/products/containers-kubernetes/securing-ray-to-run-on-google-kubernetes-engine (I know this is for kubernetes specifically but this might be good since they touch upon security in ray nonetheless)

Topic		Replies	Views
Scaling Ray Serve efficiently Ray Serve	0	37	December 10, 2024
Need Help with Scaling Up My Ray Cluster Ray Clusters	0	20	July 31, 2024
Some questions about Ray on Kubernetes Ray Clusters	3	753	December 3, 2021
Parallelise Compute Intensive Task	0	3	November 29, 2024
Troubleshooting Slow Task Execution in Ray Clusters Dashboard, Monitoring & Debugging	1	60	December 27, 2024

Challenges with Scaling My Application Using Ray – Need Some Guidance

Related topics