Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

Visual Recognition Group
Czech Technical University in Prague

TL;DR

The BBox-Mask-Pose (BMP) method integrates detection, pose estimation, and segmentation into a self-improving loop by conditioning these tasks on each other. This approach enhances all three tasks simultaneously. Using segmentation masks instead of bounding boxes improves performance in crowded scenarios, making top-down methods competitive with bottom-up approaches.

Video Explanation (2 min)

Abstract

Human pose estimation methods work well on separated people but struggle with multi-body scenarios. Recent work has addressed this problem by conditioning pose estimation with detected bounding boxes or bottom-up-estimated poses. Unfortunately, all of these approaches overlooked segmentation masks and their connection to estimated keypoints. We condition pose estimation model by segmentation masks instead of bounding boxes to improve instance separation. This improves top-down pose estimation in multi-body scenarios but does not fix detection errors. Consequently, we develop BBox-Mask-Pose (BMP), integrating detection, segmentation and pose estimation into self-improving feedback loop. We adapt detector and pose estimation model for conditioning by instance masks and use Segment Anything as pose-to-mask model to close the circle. With only small models, BMP is superior to top-down methods on OCHuman dataset and to detector-free methods on COCO dataset, combining the best from both approaches and matching state of art performance in both settings. Code and data will be available for research purposes.

Contributions

  • MaskPose: a pose estimation model conditioned by segmentation masks instead of bounding boxes, boosting performance in dense scenes without adding parameters
  • BBox-Mask-Pose: method linking bounding boxes, segmentation masks, and poses to simultaneously address multi-body detection, segmentation and pose estimation

Results

Comparison of RTMDet (left) detection and segmentation and BBox-Mask-Pose (right). BMP improves segmentation masks of given detector, especially for disconnected body barts such as limbs. BBox-Mask-Pose also detect correct amount of people even in scenes with extreme boundng box overlap.

BibTeX

@misc{purkrabek2024BBoxMaskPose,
        title={Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle}, 
        author={Miroslav Purkrabek and Jiri Matas},
        year={2024},
        eprint={2412.01562},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2412.01562}, 
  }