Python ray tune unable to stop trial or experiment

wfskmoney · February 16, 2022, 5:15pm

Hello everyone,

I am trying to make ray tune with wandb stop the experiment under certain conditions. Ray tune just runs indefinitly, not honoring any of my stopping conditions.

I am using ray 1.10.0

stop all experiment if any trial raises an Exception (so i can fix the code and resume)
stop if my score gets -999
stop if the variable varcannotbezero gets 0

The following things i tried all failed in achieving desired behavior:

stop={“score”:-999,“varcannotbezero”:0}
max_failures=0
defining a Stoper class did also not work

class RayStopper(Stopper):
    def __init__(self):
        self._start = time.time()
        #self._deadline = 300
    def __call__(self, trial_id, result):
        self.score=result["score"]
        self.varcannotbezero=result["varcannotbezero"]
        return False
    def stop_all(self):
        if self.score==-999 or self.varcannotbezero==0:
            return True
        else:
            return False

Ray tune just continues to run

    wandb_project="ABC"
    wandb_api_key="KEY"
    ray.init(configure_logging=False)

    if current_best_params is None:
        algo = HyperOptSearch()
    else:
        algo = HyperOptSearch(points_to_evaluate=current_best_params,n_initial_points=n_initial_points)
    algo = ConcurrencyLimiter(algo, max_concurrent=1)

    scheduler = AsyncHyperBandScheduler()
    analysis = tune.run(
        tune_obj,
        name="Name",
        resources_per_trial={"cpu": 1},
        search_alg=algo,
        scheduler=scheduler,
        metric="score",
        mode="max",
        num_samples=10,
        stop={"score":-999,"varcannotbezero":0},
        max_failures=0,
        config=config,
        callbacks=[WandbLoggerCallback(project=wandb_project,entity="mycompany",api_key=wandb_api_key,log_config=True)],
        local_dir=local_dir,
        resume="AUTO",
        verbose=0
    )

amogkam · February 16, 2022, 5:25pm

Hey @wfskmoney, the args you have set for tune.run looks good to me.

Which stopping criteria is being met, but not. being honored? Do you also see the same behavior if you don’t have resume set?

Do you also mind sharing your training function, and the most recent stdout?

wfskmoney · February 16, 2022, 5:48pm

my optimization functions calls a script in the background, i am wrapping tune around it.

@amogkam : I tried raising Errors inside of myscript() and inside of tune_obj, but neither stops the experiment

def evaluation_fn(config):
    paramDict={k:v for k,v in config.items() if k.startswith("HP_")}
    Trial=str(uuid.uuid4().int>>64)[0:16]
    # run script
    Result=myscript(Project,Trial,local_dir,window)
    return Result, Trial

def tune_obj(config,checkpoint_dir=checkpoint_dir):
    Result, Trial = evaluation_fn(config)
    if len(Result)==0:
        tune.report(score=-999.0,Backtest=Backtest,varcannotbezero=0)
        raise TuneError("ERROR: Result empty")
        # raise Exception("ERROR: Result empty")
    else:
        Result={f"Lower{k}":v[0] for k,v in Result.items()}
        tune.report(score=Result['score'],varcannotbezero=1,**Result)

wfskmoney · February 16, 2022, 6:51pm

@amogkam i got it to work with a Stopper class.

However:

after the stop i cannot resume, after having changed/fixed my code
ray tune still ignores Exceptions or TuneErrors and just continues

class RayStopper(Stopper):
    def __init__(self):
        self._start = time.time()
        self._qccookie=1
        self._trial_id=""
    def __call__(self, trial_id, result):
        self._trial_id=trial_id
        self._qccookie=result["qccookie"]
        return False
    def stop_all(self):
        secs=int(time.time())
        runtime=secs - self._start
        if secs % 10 == 0:
            print(f"-----------------RayStopper--------------")
            print(f"trial_id={self._trial_id}")
            print(f"qccookie={self._qccookie}")
        if self._qccookie==0:
            return True
        else:
            return False

wfskmoney · February 17, 2022, 1:34pm

hello @amogkam, just wondering if you know how I can resume an experiment after it met the Stopper conditions.

Ideally i want it to stop, then expect my experiment, but then be able to continue from the stopping point, so that the hyperopt optimizer doesnt start from 0.

I tried to manually delete the tune_obj_e65e1e75 folder with the stopped trial, but when i resume the experiment, it just says “finished”

amogkam · February 19, 2022, 3:07am

Glad you got the Stopper working!

Can you try doing this:

Inside your training function, if any of the stopping criteria is met, then raise an error.
Specify fail_fast=True in tune.run. This will stop the entire experiment when any trial raises an error (and therefore when any trial reaches the stopping condition).
Then you can resume the experiment since it would be in an error state and not have finished yet.

Would this work for your use case?

Topic		Replies	Views
How to force ray tune to shutdown from inside to continue experiment later Ray Tune	1	419	February 19, 2022
Integration with tune and wandb "stop" button Ray Tune	2	355	May 7, 2021
Stop experiment, but finish currently running trials Ray Tune	7	436	February 21, 2023
[tune] Using an experiment-wide Stopper sometimes terminates prematurely Ray Tune	8	538	June 1, 2023
When Ray tune finish hyperparameter optimization Ray Tune	2	327	September 22, 2022

Python ray tune unable to stop trial or experiment

Related topics