Ray log location

I think I get why it’s like this. If in the cluster config yaml cache_stopped_nodes is set to True (the default), then, indeed, you’d have good chances of running out of space when running multiple consecutive jobs on the same cluster unless the cache is cleared. However, it’d still be nice to have an option to keep the logs on restart.

By the way, is there a way to automatically stop all the nodes in the cluster when the tuning finishes its time budget?

I tried again with adding the --stop option to ray exec and I can see it appends a bunch of commands to the one I’m passing. They are:

  • ray stop
  • ray teardown ~/ray_bootstrap_config.yaml --yes --workers-only
  • sudo shutdown -h now

However, the command fails and the tune session ends up not being run at all, probably due to quotes and stuff like that. That’s why I gave up on using it before. I’m most surprised by ray teardown because it appears to be doing what I need, but is not documented here

I also tried using idle_timeout_minutes within the cluster config yaml, but it doesn’t seem to stop the vms after the timeout I specified.

I also tried appending the above ray stop, ray teardown and shutdown manually to my command. Now, the optimization runs and one of stop and teardown (most likely teardown) fails because of lack of aws rights to call GetInstanceProfile. Is this step crucial in the process or could it, perhaps, be avoided when getting the workers to shut sown?

@bbudescu
Can we separate different issues into different posts or github issues? Having a lengthy discussion here for different topics makes it really hard for us to help…

Can we focus on why the worker nodes don’t get initialized correctly? Thanks!

Yes, sure. I was just trying to keep track of the stuff I tried so I can structure them later into proper github issues.