Hi Team,
I am currently migrating our legacy Detectrong2-based object detection training pipeline into a new unified training pipeline leveraging Ray Tune and Ray Train. For Phase 1, we want to integrate Detectron2 with Ray Tune. So we have a unified interface that uses Ray to train with other frameworks. In Phase 2, we plan to incorporate Ray Tune to auto-scale and tune parameters in Detectron2’s global shared config object. We are still at Phase 1.
We came across the following related GitHub issues from both the Ray project and Detectron2, which indicated that due to the design and implementation choice by Detectron2 to use a shared global config object to encapsulate all the training properties, it might be incompatible with Ray’s distributed fashion. For example:
# import some common detectron2 utilities
from detectron2.config import get_cfg
from detectron2.engine import DefaultTrainer
from detectron2 import model_zoo
from detectron2.utils.visualizer import Visualizer
from detectron2.data import MetadataCatalog, DatasetCatalog
cfg = get_cfg()
cfg.merge_from_file(model_zoo.get_config_file("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml"))
cfg.DATASETS.TRAIN = ("balloon_train",)
cfg.DATASETS.TEST = ()
cfg.DATALOADER.NUM_WORKERS = 2
cfg.MODEL.WEIGHTS = model_zoo.get_checkpoint_url("COCO-InstanceSegmentation/mask_rcnn_R_50_FPN_3x.yaml") # Let training initialize from model zoo
cfg.SOLVER.IMS_PER_BATCH = 2 # This is the real "batch size" commonly known to deep learning people
cfg.SOLVER.BASE_LR = 0.00025 # pick a good LR
cfg.SOLVER.MAX_ITER = 300 # 300 iterations seems good enough for this toy dataset; you will need to train longer for a practical dataset
cfg.SOLVER.STEPS = [] # do not decay learning rate
cfg.MODEL.ROI_HEADS.BATCH_SIZE_PER_IMAGE = 128 # The "RoIHead batch size". 128 is faster, and good enough for this toy dataset (default: 512)
cfg.MODEL.ROI_HEADS.NUM_CLASSES = 1 # only has one class (ballon). (see https://detectron2.readthedocs.io/tutorials/datasets.html#update-the-config-for-new-datasets)
# NOTE: this config means the number of classes, but a few popular unofficial tutorials incorrect uses num_classes+1 here.
os.makedirs(cfg.OUTPUT_DIR, exist_ok=True)
trainer = DefaultTrainer(cfg)
trainer.resume_or_load(resume=False)
trainer.train()
Things are getting even more complicated with shared data when we need to register the data globally:
# Code from official Detectron2 Tutorial: https://colab.research.google.com/drive/16jcaJoc6bCFAQ96jDe2HwtXj7BMD_-m5#scrollTo=PIbAM2pv-urF
for d in ["train", "val"]:
DatasetCatalog.register("balloon_" + d, lambda d=d: get_balloon_dicts("balloon/" + d))
MetadataCatalog.get("balloon_" + d).set(thing_classes=["balloon"])
balloon_metadata = MetadataCatalog.get("balloon_train")
Any recommendation for how to construct a Ray Train Function with Detectron2? We want to reuse the legacy code where we created a Detectron Trainer with the DefaultTrainer. For now, in Phase 1 we just want to integrate Detectron2 with Ray Tune so that we can use Ray Checkpoints across the end-to-end multi-stages training pipeline.
Github Issues References:
Ray Project Github issue:
# [core] modifications to global variable has no effect #15946
Detectron2 Github issue:
# Dataset is not registered when performing hyperparameter tuning #3057
I noticed that this is more like a brainstorming request, much appreciate any input from the team.
Thank you,
Heng