PBT problem about trial restart

Hello,

I have a problem about tune.run and PBT. I am using PBT to find some best hyperparameters like lr and gamma. Except these I use tune.grid_search([0, 1]) to build server and control the number of trials like this:

-------------Trial name ---------- | --status — | ------------loc------------- | port | iter | avg_reward |
|--------------------------------------±----------------±-------------------------------±—±-----±---------------+
| train_config_fb25f_00000 | RUNNING | 10.28.232.224:96868 | 0 | 19 | -17.7667 |
| train_config_fb25f_00001 | RUNNING | 10.28.232.224:96870 | 1 | 18 | -18.4333 |
±--------------------------------------±----------------±-------------------------------±—±-----±---------------+

But after i use PBT, when it perturbe config, the trial seems restart and choose a new port parameter. It may casue the two trial use a same port which caused Address already in use problem.

|-------------Trial name ---------| --status — | ------------loc------------- | port | iter | avg_reward |
|--------------------------------------±----------------±-------------------------------±—±-----±---------------+
| train_config_fb25f_00000 | RUNNING | 10.28.232.224:96868 | 0 | 19 | -17.7667 |
| train_config_fb25f_00001 | RUNNING | 10.28.232.224:96870 | 0 | 18 | -18.4333 |
±--------------------------------------±----------------±-------------------------------±—±-----±---------------+

How can I fix it?

Thanks for your help.

Hey @Qian_Zhao can your code if possible? In particular, it would be great to see how you are constructing the PBT scheduler as well as how you are calling tune.run.

Hi, I solved this problem by using socket to get a free port. And accutally I did not realize a correct to save and lode checkpoint (I think that is why pbt always restart the trial but not from a checkpoint ?)

In a word, Thanks for your reply !