ValueError: Can't find a `node_ip_address.json` file from /scratch/session_2024-04-02_16-32-35_009347_3527526. for 60 seconds
I know the default is at the /tmp/ray of the running node, but it’s not convenient for me to access it to look into the details of the jobs running. I’m using slurm for the jobs submission.
Any help is appreciated.
Thanks @yic . Not quite following. The __temp_dir is an argument of ray.init(). Actually I’ve been using it for ray.train and it does redirect the output from /tmp to a specified directory. The issue only occurs when using ray.tune
Hi @wxie2013 everything started with _ is a private API. We use it for testing or other purpose. Although you’ve been using it, it doesn’t mean it’s a public API and you are not supposed to use it. In the worst case, it can be deleted in the future.
To change the tmp dir, just provide --temp-dir= when you start the head node. This is the right way to setup the tmp dir.
Right now, all ray nodes are supposed to have the same tmp dir. If this doesn’t work for you, you can create a ticket and let me know. I’ll make sure it’s being tracked.
Thanks again@yic. I don’t use “ray start” command directly. I use ray.init() in a python script to start the ray process. For example, a python script named “example.py” :
import ray
ray.init()
Then I run it via
python example.py
Is there a way to specify the directory as an init argument?
Hi @yic I did a try to use ray start --temp-dir=/scratch, the issue is that in the artifacts directory is missing, i.e. /scratch/ray/session_latest/artifacts. When using the ray.init(__temp_dir=‘/scratch’), the artifacts directory does exist. Is there another argument to keep the artifacts in the redirected output when using ray start --temp-dir?
Hi @yic. Yes I indeed used only head node. The issue is that artifacts directory is missing. I will need the content in that directory to do some detailed checks. By using ray.init(__temp_dir), the artifacts directory is there.
using the command “ray start --head --temp-dir=/scratch” will create a directory “/scratch” with “artifacts” missing. When running my model, it will create a different session at the default “/tmp/ray” which contains “artifacts” directory. This session is different from the one in the /scratch, i.e. there are two different sessions created, one before running the model, the other after running the model.