I think I get why it’s like this. If in the cluster config yaml cache_stopped_nodes
is set to True
(the default), then, indeed, you’d have good chances of running out of space when running multiple consecutive jobs on the same cluster unless the cache is cleared. However, it’d still be nice to have an option to keep the logs on restart.
By the way, is there a way to automatically stop all the nodes in the cluster when the tuning finishes its time budget?
I tried again with adding the --stop
option to ray exec
and I can see it appends a bunch of commands to the one I’m passing. They are:
ray stop
ray teardown ~/ray_bootstrap_config.yaml --yes --workers-only
sudo shutdown -h now
However, the command fails and the tune
session ends up not being run at all, probably due to quotes and stuff like that. That’s why I gave up on using it before. I’m most surprised by ray teardown
because it appears to be doing what I need, but is not documented here
I also tried using idle_timeout_minutes
within the cluster config yaml, but it doesn’t seem to stop the vms after the timeout I specified.
I also tried appending the above ray stop
, ray teardown
and shutdown
manually to my command. Now, the optimization runs and one of stop
and teardown
(most likely teardown
) fails because of lack of aws rights to call GetInstanceProfile
. Is this step crucial in the process or could it, perhaps, be avoided when getting the workers to shut sown?
@bbudescu
Can we separate different issues into different posts or github issues? Having a lengthy discussion here for different topics makes it really hard for us to help…
Can we focus on why the worker nodes don’t get initialized correctly? Thanks!
Yes, sure. I was just trying to keep track of the stuff I tried so I can structure them later into proper github issues.