Python ray tune unable to stop trial or experiment

Hello everyone,

I am trying to make ray tune with wandb stop the experiment under certain conditions. Ray tune just runs indefinitly, not honoring any of my stopping conditions. :frowning:

I am using ray 1.10.0

  • stop all experiment if any trial raises an Exception (so i can fix the code and resume)
  • stop if my score gets -999
  • stop if the variable varcannotbezero gets 0

The following things i tried all failed in achieving desired behavior:

  • stop={“score”:-999,“varcannotbezero”:0}
  • max_failures=0
  • defining a Stoper class did also not work
class RayStopper(Stopper):
    def __init__(self):
        self._start = time.time()
        #self._deadline = 300
    def __call__(self, trial_id, result):
        self.score=result["score"]
        self.varcannotbezero=result["varcannotbezero"]
        return False
    def stop_all(self):
        if self.score==-999 or self.varcannotbezero==0:
            return True
        else:
            return False

Ray tune just continues to run

    wandb_project="ABC"
    wandb_api_key="KEY"
    ray.init(configure_logging=False)

    if current_best_params is None:
        algo = HyperOptSearch()
    else:
        algo = HyperOptSearch(points_to_evaluate=current_best_params,n_initial_points=n_initial_points)
    algo = ConcurrencyLimiter(algo, max_concurrent=1)

    scheduler = AsyncHyperBandScheduler()
    analysis = tune.run(
        tune_obj,
        name="Name",
        resources_per_trial={"cpu": 1},
        search_alg=algo,
        scheduler=scheduler,
        metric="score",
        mode="max",
        num_samples=10,
        stop={"score":-999,"varcannotbezero":0},
        max_failures=0,
        config=config,
        callbacks=[WandbLoggerCallback(project=wandb_project,entity="mycompany",api_key=wandb_api_key,log_config=True)],
        local_dir=local_dir,
        resume="AUTO",
        verbose=0
    )

Hey @wfskmoney, the args you have set for tune.run looks good to me.

Which stopping criteria is being met, but not. being honored? Do you also see the same behavior if you don’t have resume set?

Do you also mind sharing your training function, and the most recent stdout?

my optimization functions calls a script in the background, i am wrapping tune around it.

@amogkam : I tried raising Errors inside of myscript() and inside of tune_obj, but neither stops the experiment

def evaluation_fn(config):
    paramDict={k:v for k,v in config.items() if k.startswith("HP_")}
    Trial=str(uuid.uuid4().int>>64)[0:16]
    # run script
    Result=myscript(Project,Trial,local_dir,window)
    return Result, Trial

def tune_obj(config,checkpoint_dir=checkpoint_dir):
    Result, Trial = evaluation_fn(config)
    if len(Result)==0:
        tune.report(score=-999.0,Backtest=Backtest,varcannotbezero=0)
        raise TuneError("ERROR: Result empty")
        # raise Exception("ERROR: Result empty")
    else:
        Result={f"Lower{k}":v[0] for k,v in Result.items()}
        tune.report(score=Result['score'],varcannotbezero=1,**Result)

@amogkam i got it to work with a Stopper class.

However:

  • after the stop i cannot resume, after having changed/fixed my code
  • ray tune still ignores Exceptions or TuneErrors and just continues
class RayStopper(Stopper):
    def __init__(self):
        self._start = time.time()
        self._qccookie=1
        self._trial_id=""
    def __call__(self, trial_id, result):
        self._trial_id=trial_id
        self._qccookie=result["qccookie"]
        return False
    def stop_all(self):
        secs=int(time.time())
        runtime=secs - self._start
        if secs % 10 == 0:
            print(f"-----------------RayStopper--------------")
            print(f"trial_id={self._trial_id}")
            print(f"qccookie={self._qccookie}")
        if self._qccookie==0:
            return True
        else:
            return False

hello @amogkam, just wondering if you know how I can resume an experiment after it met the Stopper conditions.

Ideally i want it to stop, then expect my experiment, but then be able to continue from the stopping point, so that the hyperopt optimizer doesnt start from 0.

I tried to manually delete the tune_obj_e65e1e75 folder with the stopped trial, but when i resume the experiment, it just says “finished”

Glad you got the Stopper working!

Can you try doing this:

  • Inside your training function, if any of the stopping criteria is met, then raise an error.
  • Specify fail_fast=True in tune.run. This will stop the entire experiment when any trial raises an error (and therefore when any trial reaches the stopping condition).
  • Then you can resume the experiment since it would be in an error state and not have finished yet.

Would this work for your use case?