we are using PPO to train in custom env on local cluster consisting of 4 machines managed by kubernetes. After updating to ray 1.3 we observed peculiar behavior in which cluster for most of the time scales up and down (along whole training, not only in start-up phase). In our setup, PPO is scaled to 180 workers, where each worker consumes 1 cpu. The scaling up & down process looks like that:
(…)
(autoscaler +1m0s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m6s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m12s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m18s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m23s) Resized to 125 CPUs, 3 GPUs.
(autoscaler +1m29s) Resized to 63 CPUs, 2 GPUs.
(autoscaler +1m34s) Resized to 187 CPUs, 4 GPUs.
(autoscaler +1m45s) Resized to 63 CPUs, 2 GPUs.
(…)
Do you have any clue from where this behavior may came? Is this some misconfiguration of our training run, some cluster / hardware issue or maybe some issue in code? If any additional information would be required I will be more than happy to provide it.
Hello, please check your idle timeout param in the cluster config which will hint autoscaler to take down nodes. Not sure what was your previous ray version, If I understand correctly the autoscaler algorithm changed here is the link to the new autoscaler algorithm: A Glimpse into the Ray Autoscaler by Ameer Haj Ali - YouTube
I have checked that - idle_timeout_minutes is set to 5 minutes so as you may see from logs it is not expected behavior. Regarding previous version, we were using release 1.2 before.
Thanks for sharing the config!
I’m curious whether the problem persists when the head node’s Ray version is upgraded to the latest (as of writing) nightly version.
If upgrading the Ray version doesn’t cause problems for your workflow, that can be achieved by prepending the following to head setup commands: pip uninstall ray -y && pip install https://s3-us-west-2.amazonaws.com/ray-wheels/master/052d2acaee84b5bee8fd772d9d98dd56677d1533/ray-2.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl
That’s assuming you’re using a Python 3.7 image. Substitute the appropriate Python version if using another image.
(Installing Ray — Ray v2.0.0.dev0)
Could also be helpful to see the autoscaling monitor logs – these are /tmp/ray/session_latest/logs/monitor* in the head pod