Seeking recommendation for training Detectron2 with Ray Tune

heng2j · January 11, 2023, 3:25pm

Hi everyone, I have uploaded my first attempt at training Detectron2 models with Ray Train. And I pushed the demo code here in Github. It is the naive integration approach to meet the goal for our Phase 1 development, which leverage Ray Tune to train Detectron2 models. The piping for using Detectron2’s SimpleTrainer in Ray’s TorchTrainer is working. I can train Detectron2 models, get Ray checkpoints, and log training progress in Tensorboard. However, I also noticed that I could not adequately leverage Tune’s scaling and tuning capabilities with my current naive implementation. For example, I don’t think I can scale to use more than 1 worker in ScalingConfig.

I am seeing the following error when I try to run with 2 workers:

    _LOCAL_PROCESS_GROUP is not None
AssertionError: Local process group is not created! Please use launch() to spawn processes!
2023-01-10 11:23:45,881 ERROR tune.py:758 -- Trials did not complete: [TorchTrainer_08b5d_00000]
Result(metrics={'trial_id': '08b5d_00000'}, error=RayTaskError(AssertionError)(AssertionError('Local process group is not created! Please use launch() to spawn processes!')), log_dir=PosixPath('/heng/output/RayDetectron2/ray_results/Detector_Training_Demo/TorchTrainer_08b5d_00000_0_2023-01-10_11-22-53'))

With some code example here, I hope I can get some pointers from the Ray team for how to move on to incorporate Ray Tune to auto-scale and tune parameters for training Detectron2 models in Ray.

Topic		Replies	Views
Parallel Detectron2(Pytorch) inference with GPU Ray Core	8	1703	August 16, 2022
Want advice on Improving Ray for Long Machine Learning Model Training	1	63	July 13, 2024
Where do I find documentation on the tune.run method	3	2083	June 12, 2023
Shared dataset on a local desktop	1	286	March 7, 2023
ray.tune.Experiment.from_json() is giving error	0	316	May 16, 2023

Seeking recommendation for training Detectron2 with Ray Tune

Related topics