Hi all. I have quite a perplexing problem: when num_samples=1 in the ray TuneConfig, then the HPO runs as expected and terminates after 1 trial. But when num_samples=x , with x>1, then the HPO runs indefinitely; it runs as expected for the first x trials, and then keeps training additional runs with the first set of trial params. Also, this only happens when trying to set the resources (CPUs/GPUs). Any ideas?
I’m not running on a cluster
I have ray 2.0.0 installed
seems to only occur when trying to use a GPU
Example: runs in red box are what is supposed to run. All other trials are not supposed to run, and get run with the first trial’s param values:
@kai unfortunately, I can’t seem to create a simple minimal example (as opposed to my admittedly complicated objective) where I see this behavior, so I will close this for now.
This is 2 years down the line but since this is the only thread I could find, and it doesn’t have an actual solution, I thought I would reply so people can find it easily.
I ran into the same issue trying to run my script in a container with a docker exec ... command and the tuner.fit call would hang before exiting.
I discovered that running the command with pseudo-tty on (aka docker exec -t ...) makes the issue go away. Not sure what the underlying cause is but it fixed the issue for me.
One thing I’m still trying to figure out though is it’s not working if i call it from my airflow container. Is it because it’s not running a TTY session when invoking the script?