I have a need to execute 100s of different model training scripts. Training each model requires a data pull from Snowflake and a tune.run() with Ax bayesopt for 15-20 samples.
My original thought was to ray exec config.yaml train_model.py --arg1 … --arg to the same cluster. However, each time I ray exec and the cluster autoscales then any existing tune.run() starts sending trials to new GPU workers (see image).
This is causing two issues:
Costly calls to DB mid-run significantly slows everything down
I believe it is filling up worker memory causing our nodes to be marked dead.
The code running this is proprietary and haven’t had time to create a reproducible example yet. I am wondering if anyone has similar use cases and any recommendations? Any other best practices I should be aware of?
Right now I am considering using the --start setting & ray object store to create a mini-cluster for each of the different model training scripts.
Both head & worker running;
rayproject/ray-ml:1.1.0 container on AWS Deep Learning Base AMI (Ubuntu 18.04)