Ray log location

bbudescu · May 4, 2023, 6:39pm

I think I get why it’s like this. If in the cluster config yaml cache_stopped_nodes is set to True (the default), then, indeed, you’d have good chances of running out of space when running multiple consecutive jobs on the same cluster unless the cache is cleared. However, it’d still be nice to have an option to keep the logs on restart.

bbudescu · May 4, 2023, 6:44pm

By the way, is there a way to automatically stop all the nodes in the cluster when the tuning finishes its time budget?

I tried again with adding the --stop option to ray exec and I can see it appends a bunch of commands to the one I’m passing. They are:

ray stop
ray teardown ~/ray_bootstrap_config.yaml --yes --workers-only
sudo shutdown -h now

However, the command fails and the tune session ends up not being run at all, probably due to quotes and stuff like that. That’s why I gave up on using it before. I’m most surprised by ray teardown because it appears to be doing what I need, but is not documented here

bbudescu · May 4, 2023, 7:17pm

I also tried using idle_timeout_minutes within the cluster config yaml, but it doesn’t seem to stop the vms after the timeout I specified.

bbudescu · May 4, 2023, 7:44pm

I also tried appending the above ray stop, ray teardown and shutdown manually to my command. Now, the optimization runs and one of stop and teardown (most likely teardown) fails because of lack of aws rights to call GetInstanceProfile. Is this step crucial in the process or could it, perhaps, be avoided when getting the workers to shut sown?

Huaiwei_Sun · May 4, 2023, 9:12pm

@bbudescu
Can we separate different issues into different posts or github issues? Having a lengthy discussion here for different topics makes it really hard for us to help…

Can we focus on why the worker nodes don’t get initialized correctly? Thanks!

bbudescu · May 5, 2023, 4:25am

Yes, sure. I was just trying to keep track of the stuff I tried so I can structure them later into proper github issues.

Topic		Replies	Views
Log monitor failing Ray Core	5	1062	December 15, 2022
Remote Worker Nodes die after a few seconds Ray Clusters	5	1996	July 17, 2024
How to persist logs directory after head node restart Dashboard, Monitoring & Debugging	2	280	May 15, 2024
Reading logs on worker nodes Ray Tune	4	703	March 23, 2022
Finding worker logs on (auto)scaled down kubernetes nodes / using shared temp_dir Dashboard, Monitoring & Debugging	4	933	July 12, 2021

Ray log location

Related topics