I am trying to use Ray on a High-Performance computing managed by Slurm.
Here is my code, which is really simple…
import ray
ray.init()
@ray.remote
def f(x):
return x * x
futures = [f.remote(i) for i in range(4)]
print(ray.get(futures)) # [0, 1, 4, 9]
I tried to run python test.py
on the allocated node but it failed… Can anyone kindly help?
It just stuck at:
2025-01-26 02:10:34,453 INFO worker.py:1841 -- Started a local Ray instance.
E0126 02:10:37.663225099 1169908 thd.cc:157] pthread_create failed: Resource temporarily unavailable
Thanks for any help in advance!
Hi there!
It seems like you might be running into some resource allocation issues in Slurm. Here’s a few things you might wanna try out.
- Set OMP_NUM_THREADS: If you are using libraries that utilize OpenMP (such as NumPy or SciPy), set the environment variable
OMP_NUM_THREADS=1
before running your script. This limits the number of threads used by these libraries. Read more about it here: Install RLlib for Development — Ray 2.41.0
- Increase Slurm Resources: Ensure that your Slurm job is requesting enough resources (CPUs, memory) to handle the workload. You might need to adjust your Slurm script to request more resources. What is the current resources you have in Slurm? (Here’s some docs you can read too: Debugging Memory Issues — Ray 2.41.0)
- Try running some of the Ray debugging tools, like
ray memory
or ray stack
to see if there’s any other errors running.
Just from what you described though my guess is that there might not be enough resources allocated so let me know what you find out after adjusting some settings!
Christina