I am trying to train 1000 different networks, each one of them with a different configuration that I have labelled with an integer number.
What I want, at the current stage, is to simply perform an initial training (with the same number of epochs) on all of them.
Therefore, I run an analysis with the following parameters:
analysis = ray.tune.run(
TrainerWrapper,
name="initial_training",
stop=stopper,
config={
"configuration": ray.tune.grid_search(models_to_be_trained),
'lr': LR,
'gradient_clip': GRADIENT_CLIP
},
num_samples=1,
local_dir=RAY_CHECKPOINTS_DIR,
reuse_actors=True,
resume=True,
checkpoint_at_end=True
)
where LR and GRADIENT_CLIP are constants, models_to_be_trained is a list of integers and stopper is a simple function that stops if we reach enough epochs or if we overfit
def stopper(_, result):
training_iteration = result["training_iteration"]
test_linfty_error = result["test_average_linfty_error"]
train_linfty_error = result["train_average_linfty_error"]
if training_iteration > MAX_EPOCHS // EPOCHS_PER_STEP:
return True
if test_linfty_error > 1.5 * train_linfty_error:
return True
return False
I would expect this code to perform the test on all mt configuration once (because I fixed “num_samples = 1” in the analysis) and to perform 100 iterations (because this is the ratio of MAX_EPOCHS // EPOCHS_PER_STEP). But what I see is that ray restarts my test when they reach 20 iterations. Indeed this is what I see at some point in the logs:
(run pid=82760) == Status ==
(run pid=82760) Current time: 2022-06-19 18:25:27 (running for 01:12:07.44)
(run pid=82760) Memory usage on this node: 234.1/377.6 GiB
(run pid=82760) Using FIFO scheduling algorithm.
(run pid=82760) Resources requested: 256.0/512 CPUs, 15.360000000000039/16 GPUs, 0.0/1026.51 GiB heap, 0.0/443.92 GiB objects (0.0/4.0 accelerator_type:V100)
(run pid=82760) Result logdir: /home/spiani/ray_results/resnet_train_small_models
(run pid=82760) Number of trials: 336/1000 (80 PENDING, 256 RUNNING)
(run pid=82760) +----------------------------+----------+-----------------+-----------------+--------+------------------+-----------------------------+
(run pid=82760) | Trial name | status | loc | configuration | iter | total time (s) | test_average_linfty_error |
(run pid=82760) |----------------------------+----------+-----------------+-----------------+--------+------------------+-----------------------------|
(run pid=82760) | TrainerWrapper_ba6d9_00000 | RUNNING | 10.1.14.1:37692 | 386 | 18 | 82982.5 | 0.407178 |
(run pid=82760) | TrainerWrapper_ba6d9_00001 | RUNNING | 10.1.14.1:37875 | 389 | 18 | 83199.6 | 0.417423 |
(run pid=82760) | TrainerWrapper_ba6d9_00002 | RUNNING | 10.1.14.1:37958 | 392 | 18 | 83383.8 | 0.435176 |
(run pid=82760) | TrainerWrapper_ba6d9_00003 | RUNNING | 10.1.14.1:37960 | 395 | 18 | 83757.9 | 0.494883 |
(run pid=82760) | TrainerWrapper_ba6d9_00004 | RUNNING | 10.1.14.1:37962 | 242 | 19 | 85124.9 | 0.401442 |
(run pid=82760) | TrainerWrapper_ba6d9_00005 | RUNNING | 10.1.14.1:37964 | 245 | 19 | 85538.3 | 0.371104 |
(run pid=82760) | TrainerWrapper_ba6d9_00006 | RUNNING | 10.1.14.1:37966 | 248 | 19 | 85764.5 | 0.373599 |
(run pid=82760) | TrainerWrapper_ba6d9_00007 | RUNNING | 10.1.14.1:37968 | 251 | 19 | 85316.9 | 0.396243 |
(run pid=82760) | TrainerWrapper_ba6d9_00008 | RUNNING | 10.1.14.1:37973 | 385 | 18 | 82935.5 | 0.407906 |
(run pid=82760) | TrainerWrapper_ba6d9_00009 | RUNNING | 10.1.14.1:37982 | 388 | 18 | 83003.1 | 0.464052 |
(run pid=82760) | TrainerWrapper_ba6d9_00256 | PENDING | | 2118 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00257 | PENDING | | 2982 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00258 | PENDING | | 2121 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00259 | PENDING | | 2985 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00260 | PENDING | | 3002 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00261 | PENDING | | 3005 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00262 | PENDING | | 1718 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00263 | PENDING | | 3008 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00264 | PENDING | | 1721 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00265 | PENDING | | 3011 | | | |
(run pid=82760) +----------------------------+----------+-----------------+-----------------+--------+------------------+-----------------------------+
(run pid=82760) ... 316 more trials not shown (246 RUNNING, 70 PENDING)
(run pid=82760)
(run pid=82760)
(run pid=82760) Result for TrainerWrapper_ba6d9_00006:
(run pid=82760) date: 2022-06-19_18-25-27
(run pid=82760) done: false
(run pid=82760) experiment_id: fdbd45174bd443d6954dc3727c87ae3e
(run pid=82760) hostname: pn14-04
(run pid=82760) iterations_since_restore: 1
(run pid=82760) node_ip: 10.1.14.4
(run pid=82760) pid: 103245
(run pid=82760) test_average_l2_error: 7.358240948486328
(run pid=82760) test_average_l2_relative_error: 0.20228974905014038
(run pid=82760) test_average_linfty_error: 0.5165811861673991
(run pid=82760) test_average_linfty_relative_error: 0.24805314655303956
(run pid=82760) test_max_l2_error: 56.37067794799805
(run pid=82760) test_max_l2_relative_error: 1.1429545879364014
(run pid=82760) test_max_linfty_error: 4.586096286773682
(run pid=82760) test_max_linfty_relative_error: 1.3744010925292969
(run pid=82760) time_since_restore: 4241.323424339294
(run pid=82760) time_this_iter_s: 4241.323424339294
(run pid=82760) time_total_s: 4241.323424339294
(run pid=82760) timestamp: 1655655927
(run pid=82760) timesteps_since_restore: 0
(run pid=82760) train_average_l2_error: 7.390049274359809
(run pid=82760) train_average_l2_relative_error: 0.20256807630327012
(run pid=82760) train_average_linfty_error: 0.5167996492173936
(run pid=82760) train_average_linfty_relative_error: 0.24799219224717883
(run pid=82760) train_max_l2_error: 54.83803939819336
(run pid=82760) train_max_l2_relative_error: 1.1142292022705078
(run pid=82760) train_max_linfty_error: 4.486876487731934
(run pid=82760) train_max_linfty_relative_error: 1.2893394231796265
(run pid=82760) training_iteration: 1
(run pid=82760) trial_id: ba6d9_00006
(run pid=82760)
(run pid=82760) == Status ==
(run pid=82760) Current time: 2022-06-19 18:25:32 (running for 01:12:12.71)
(run pid=82760) Memory usage on this node: 234.2/377.6 GiB
(run pid=82760) Using FIFO scheduling algorithm.
(run pid=82760) Resources requested: 256.0/512 CPUs, 15.360000000000039/16 GPUs, 0.0/1026.51 GiB heap, 0.0/443.92 GiB objects (0.0/4.0 accelerator_type:V100)
(run pid=82760) Result logdir: /home/spiani/ray_results/resnet_train_small_models
(run pid=82760) Number of trials: 336/1000 (80 PENDING, 256 RUNNING)
(run pid=82760) +----------------------------+----------+------------------+-----------------+--------+------------------+-----------------------------+
(run pid=82760) | Trial name | status | loc | configuration | iter | total time (s) | test_average_linfty_error |
(run pid=82760) |----------------------------+----------+------------------+-----------------+--------+------------------+-----------------------------|
(run pid=82760) | TrainerWrapper_ba6d9_00000 | RUNNING | 10.1.14.1:37692 | 386 | 18 | 82982.5 | 0.407178 |
(run pid=82760) | TrainerWrapper_ba6d9_00001 | RUNNING | 10.1.14.1:37875 | 389 | 18 | 83199.6 | 0.417423 |
(run pid=82760) | TrainerWrapper_ba6d9_00002 | RUNNING | 10.1.14.1:37958 | 392 | 18 | 83383.8 | 0.435176 |
(run pid=82760) | TrainerWrapper_ba6d9_00003 | RUNNING | 10.1.14.1:37960 | 395 | 18 | 83757.9 | 0.494883 |
(run pid=82760) | TrainerWrapper_ba6d9_00004 | RUNNING | 10.1.14.1:37962 | 242 | 19 | 85124.9 | 0.401442 |
(run pid=82760) | TrainerWrapper_ba6d9_00005 | RUNNING | 10.1.14.1:37964 | 245 | 19 | 85538.3 | 0.371104 |
(run pid=82760) | TrainerWrapper_ba6d9_00006 | RUNNING | 10.1.14.4:103245 | 248 | 1 | 4241.32 | 0.516581 |
(run pid=82760) | TrainerWrapper_ba6d9_00007 | RUNNING | 10.1.14.1:37968 | 251 | 19 | 85316.9 | 0.396243 |
(run pid=82760) | TrainerWrapper_ba6d9_00008 | RUNNING | 10.1.14.1:37973 | 385 | 18 | 82935.5 | 0.407906 |
(run pid=82760) | TrainerWrapper_ba6d9_00009 | RUNNING | 10.1.14.1:37982 | 388 | 18 | 83003.1 | 0.464052 |
(run pid=82760) | TrainerWrapper_ba6d9_00256 | PENDING | | 2118 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00257 | PENDING | | 2982 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00258 | PENDING | | 2121 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00259 | PENDING | | 2985 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00260 | PENDING | | 3002 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00261 | PENDING | | 3005 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00262 | PENDING | | 1718 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00263 | PENDING | | 3008 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00264 | PENDING | | 1721 | | | |
(run pid=82760) | TrainerWrapper_ba6d9_00265 | PENDING | | 3011 | | | |
(run pid=82760) +----------------------------+----------+------------------+-----------------+--------+------------------+-----------------------------+
(run pid=82760) ... 316 more trials not shown (246 RUNNING, 70 PENDING)
(run pid=82760)
If you look at the test number 6 (configuration 248 for my code), you will see that, at some point, it goes from step 19 to step 1 and it is moved to another machine. Why this happens? How can I prevent this? I am running ray 1.11 but I have tried several different versions and I always see the same behaviour.
Thank you!
How severe does this issue affect your experience of using Ray?
- Medium: It contributes to significant difficulty to complete my task, but I can work around it.