Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle
ICCV 2025
SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds
CVWW 2026
BBoxMaskPose v2:
Expanding Mutual Conditioning to 3D
Bounding boxes, instance masks, and poses capture complementary aspects of the human body; enforcing their mutual consistency resolves ambiguities that dominate crowded scenes. The BBox–Mask–Pose framework links detection, pose estimation, and segmentation in an iterative loop, where each prediction is used to refine the others. ProbPose adds calibrated uncertainty, visibility, and presence modeling, stabilizing keypoints under occlusion and cropping. PMPose combines probabilistic modeling with mask conditioning, enabling robust top-down pose estimation in dense interactions. SAM-pose2seg specializes SAM for pose-guided human segmentation, simplifying prompting and improving mask quality in crowds. Together, these components form BBoxMaskPose v2, delivering clear improvements in separating interacting people and setting new state-of-the-art on COCO and OCHuman and downstream 3D pose estimation. It is the first methods with result above 50 AP on OCHuman. This work shows that structured mutual conditioning of small, task-specific models can be more effective than scaling up large, shared-feature human-centered foundation models.
All components are part of the GitHub codebase.
BBox-Mask-Pose
Iterative loop of detection, pose estimation, and instance segmentation, where each prediction is explicitly conditioned on the others. By enforcing consistency between representations, the loop progressively corrects errors, separates interacting people, and recovers missed instances.
PMPose
Top-down 2D pose estimator that combines mask conditioning with a probabilistic keypoint representation, establishing state-of-the-art performance among top-down methods, especially in crowded scenes.
SAM-pose2seg
Pose-guided human instance segmentation model that adapts SAM to segment people from 2D pose keypoints. By aligning prompting and decoder with human pose cues, it produces cleaner, more stable masks in crowded scenes.
OCHuman-Pose dataset
New multi-person data for challenging crowded scenarios. Extension of the OCHuman dataset with 2D pose annotations for all visible people, including previously ignored instances. It enables more accurate evaluation of detection and pose estimation than original OCHuman.
The OCHuman-Pose dataset is hosted on Hugging Face. Download the files from the link below.
Comparison of RTMDet (left) detection and segmentation and BBox-Mask-Pose (right). BMP improves segmentation masks of given detector, especially for disconnected body barts such as limbs. BBox-Mask-Pose also detect correct amount of people even in scenes with extreme boundng box overlap.
@InProceedings{BMPv2,
author = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
title = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
booktitle = {arXiv preprint arXiv:to be added},
year = {2026}
}
@InProceedings{Purkrabek2025ICCV,
author = {Purkrabek, Miroslav and Matas, Jiri},
title = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2025}
}
@InProceedings{Kolomiiets2026CVWW,
author = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri},
title = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds},
booktitle = {Computer Vision Winter Workshop (CVWW)},
year = {2026}
}