All ray tune trials pending when increasing trainingset size

I’m running a ray tune parameter search for a pytorch lightning model locally on a ubuntu machine with 2 gpus and 16 cpus.

When I do the search with my debug data (just a small subset of the entire dataset for quicker loading) everything works fine.

When I then increase the dataset to the full dataset, all ray trials are pending, and stay pending.

(grid_search pid=18285) == Status ==
(grid_search pid=18285) Current time: 2022-10-27 10:26:48 (running for 00:04:27.50)
(grid_search pid=18285) Memory usage on this node: 37.7/62.8 GiB
(grid_search pid=18285) Using AsyncHyperBand: num_stopped=0
(grid_search pid=18285) Bracket: Iter 4.000: None | Iter 2.000: None | Iter 1.000: None
(grid_search pid=18285) Resources requested: 0/16 CPUs, 0/2 GPUs, 0.0/20.91 GiB heap, 0.0/10.45 GiB objects (0.0/1.0 accelerator_type:GTX)
(grid_search pid=18285) Result logdir: /home/femkegb/Documents/snowscooter_detector/ray_logs
(grid_search pid=18285) Number of trials: 17/1000 (17 PENDING)
(grid_search pid=18285) +-----------------+----------+-------+-----------------+--------------+
(grid_search pid=18285) | Trial name      | status   | loc   |   LEARNING_RATE |   BATCH_SIZE |
(grid_search pid=18285) |-----------------+----------+-------+-----------------+--------------|
(grid_search pid=18285) | run_79ff2_00000 | PENDING  |       |     0.00013431  |           32 |
(grid_search pid=18285) | run_79ff2_00001 | PENDING  |       |     0.000179495 |           32 |
(grid_search pid=18285) | run_79ff2_00002 | PENDING  |       |     0.0015748   |           32 |
(grid_search pid=18285) | run_79ff2_00003 | PENDING  |       |     0.0998674   |           32 |
(grid_search pid=18285) | run_79ff2_00004 | PENDING  |       |     0.0184158   |           32 |
(grid_search pid=18285) | run_79ff2_00005 | PENDING  |       |     0.000116037 |           32 |
(grid_search pid=18285) | run_79ff2_00006 | PENDING  |       |     0.0103559   |           32 |
(grid_search pid=18285) | run_79ff2_00007 | PENDING  |       |     0.0115985   |           32 |
(grid_search pid=18285) | run_79ff2_00008 | PENDING  |       |     2.93363e-05 |           32 |
(grid_search pid=18285) | run_79ff2_00009 | PENDING  |       |     0.000400672 |           32 |
(grid_search pid=18285) | run_79ff2_00010 | PENDING  |       |     0.000981144 |           32 |
(grid_search pid=18285) | run_79ff2_00011 | PENDING  |       |     0.0182533   |           32 |
(grid_search pid=18285) | run_79ff2_00012 | PENDING  |       |     0.0544841   |           32 |
(grid_search pid=18285) | run_79ff2_00013 | PENDING  |       |     0.0019499   |           32 |
(grid_search pid=18285) | run_79ff2_00014 | PENDING  |       |     0.000247789 |           32 |
(grid_search pid=18285) | run_79ff2_00015 | PENDING  |       |     0.0671835   |           32 |
(grid_search pid=18285) | run_79ff2_00016 | PENDING  |       |     0.00531574  |           32 |
(grid_search pid=18285) +-----------------+----------+-------+-----------------+--------------+

Every so many status changes I get the following warning:

    (scheduler +1h3m21s) Warning: The following resource request cannot be scheduled right now: {'CPU': 1.0, 'GPU': 1.0}. This is likely due to all cluster resources being claimed by actors. Consider creating fewer actors or adding more nodes to this Ray cluster.

Directly after preprocessing the training set, at the start of the search, I also get some warnings about spilled memory:

Spilled 3938 MiB, 1 objects, write throughput 619 MiB/s. Set RAY_verbose_spill_logs=0 to disable this message.

There are no errors and the status pending doesn’t change even if the script is allowed to run through an entire night.

I’m guessing the memory spill is important, especially since the issue only occurs when the dataset is increased. However, running the model training without ray tune works totally fine and does not encounter any memory issues.

So where should I start looking for issues?


Extra info:

torch.cuda.is_available()
True
ray.cluster_resources()
{'object_store_memory': 18746084966.0, 'CPU': 16.0, 'memory': 37492169934.0, 'accelerator_type:GTX': 1.0, 'node:10.218.212.16': 1.0, 'GPU': 2.0}

Hey @5ke, thanks for posting to the forum.

I’m not sure what’s going on here. It could be related to memory spilling as you guessed.

Could you share a reproducible example with me so I can help you debug this?

Hi @5ke, can you point us to the locations in the code where the preprocessing and training is kicked off? It’s hard to navigate the code base without any context.

Are you using Ray Data for preprocessing?

A possible explanation could be that the preprocessing tasks take forever / do not finish at all (maybe due to heavy slowdown with object spilling) and occupy the resources, so that Ray Tune can’t schedule any trials. The problem to solve here would be to make the preprocessing feasible, as the Ray Tune trials probably depend on the data being preprocessed and available anyway.

Let’s have a look at the full pipeline and the preprocessing step.