Learning semantic object segmentation for video post-production
Video postproduction pipeline will increasingly benefit from artificial intelligence tools. For instance, the automatic extraction of specific objects helps the postproduction workflow. In particular, booms mics removal could be accelerated and color chart detection could end up in a more efficient color pipeline. For now, the segmentation of these objects is usually done via rotoscoping and consequently necessitates huge manual work. Semantic segmentation has made huge progress since the use of convolutional networks. Existing and publicly available frameworks such as Detectron2 (\url{https://github.com/facebookresearch/detectron2}) and PointRend \cite{kirillov2019pointrend} already allow to perform high quality detection and segmentation of $80$ different generic classes. However, the performance of these frameworks is very much bound to the quantity and quality of training data. Unfortunately, fetching relevant video footage and manually extracting the objects (e.g., boom mics and color charts) is out of reach. To alleviate this problem, we propose in this paper a lightweight training strategy: training data is generated synthetically by inserting in an existing dataset the desired objects, along with data augmentation. A pretrained network is used and fine-tuned using this new dataset. Despite its simplicity, we show in this paper that the system can achieve good performances for an automatic video postproduction pipeline.