Seeking recommendation for training Detectron2 with Ray Tune

Hi everyone, I have uploaded my first attempt at training Detectron2 models with Ray Train. And I pushed the demo code here in Github. It is the naive integration approach to meet the goal for our Phase 1 development, which leverage Ray Tune to train Detectron2 models. The piping for using Detectron2’s SimpleTrainer in Ray’s TorchTrainer is working. I can train Detectron2 models, get Ray checkpoints, and log training progress in Tensorboard. However, I also noticed that I could not adequately leverage Tune’s scaling and tuning capabilities with my current naive implementation. For example, I don’t think I can scale to use more than 1 worker in ScalingConfig.

I am seeing the following error when I try to run with 2 workers:

    _LOCAL_PROCESS_GROUP is not None
AssertionError: Local process group is not created! Please use launch() to spawn processes!
2023-01-10 11:23:45,881 ERROR tune.py:758 -- Trials did not complete: [TorchTrainer_08b5d_00000]
Result(metrics={'trial_id': '08b5d_00000'}, error=RayTaskError(AssertionError)(AssertionError('Local process group is not created! Please use launch() to spawn processes!')), log_dir=PosixPath('/heng/output/RayDetectron2/ray_results/Detector_Training_Demo/TorchTrainer_08b5d_00000_0_2023-01-10_11-22-53'))

With some code example here, I hope I can get some pointers from the Ray team for how to move on to incorporate Ray Tune to auto-scale and tune parameters for training Detectron2 models in Ray.