Language
English 中文 日本語 한국어 Español Deutsch Français العربية Čeština

BBox-Mask-Pose Project

ICCV 2025 CVPR 2025
Miroslav Purkrabek, Constantin Kolomiiets, Jiri Matas
Visual Recognition Group
Czech Technical University in Prague
BBox-Mask-Pose loop GIF

Papers

Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle

ICCV 2025

Visual Recognition Group
Czech Technical University in Prague

SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds

CVWW 2026

Constantin Kolomiiets, Miroslav Purkrabek, Jiri Matas
Visual Recognition Group
Czech Technical University in Prague

BBoxMaskPose v2:
Expanding Mutual Conditioning to 3D


Miroslav Purkrabek, Constantin Kolomiiets, Jiri Matas
Visual Recognition Group
Czech Technical University in Prague

Project Overview

Bounding boxes, instance masks, and poses capture complementary aspects of the human body; enforcing their mutual consistency resolves ambiguities that dominate crowded scenes. The BBox–Mask–Pose framework links detection, pose estimation, and segmentation in an iterative loop, where each prediction is used to refine the others. ProbPose adds calibrated uncertainty, visibility, and presence modeling, stabilizing keypoints under occlusion and cropping. PMPose combines probabilistic modeling with mask conditioning, enabling robust top-down pose estimation in dense interactions. SAM-pose2seg specializes SAM for pose-guided human segmentation, simplifying prompting and improving mask quality in crowds. Together, these components form BBoxMaskPose v2, delivering clear improvements in separating interacting people and setting new state-of-the-art on COCO and OCHuman and downstream 3D pose estimation. It is the first methods with result above 50 AP on OCHuman. This work shows that structured mutual conditioning of small, task-specific models can be more effective than scaling up large, shared-feature human-centered foundation models.

Contributions

All components are part of the GitHub codebase.

BBox-Mask-Pose

Iterative loop of detection, pose estimation, and instance segmentation, where each prediction is explicitly conditioned on the others. By enforcing consistency between representations, the loop progressively corrects errors, separates interacting people, and recovers missed instances.

PMPose

Top-down 2D pose estimator that combines mask conditioning with a probabilistic keypoint representation, establishing state-of-the-art performance among top-down methods, especially in crowded scenes.

SAM-pose2seg

Pose-guided human instance segmentation model that adapts SAM to segment people from 2D pose keypoints. By aligning prompting and decoder with human pose cues, it produces cleaner, more stable masks in crowded scenes.

OCHuman-Pose dataset

New multi-person data for challenging crowded scenarios. Extension of the OCHuman dataset with 2D pose annotations for all visible people, including previously ignored instances. It enables more accurate evaluation of detection and pose estimation than original OCHuman.

OCHuman-Pose Dataset

The OCHuman-Pose dataset is hosted on Hugging Face. Download the files from the link below.

Hugging Face Dataset (comming soon)

Video Explanation (2 min)

Results

Comparison of RTMDet (left) detection and segmentation and BBox-Mask-Pose (right). BMP improves segmentation masks of given detector, especially for disconnected body barts such as limbs. BBox-Mask-Pose also detect correct amount of people even in scenes with extreme boundng box overlap.

BibTeX


        @InProceedings{BMPv2,
            author    = {Purkrabek, Miroslav and Kolomiiets, Constantin and Matas, Jiri},
            title     = {BBoxMaskPose v2: Expanding Mutual Conditioning to 3D},
            booktitle = {arXiv preprint arXiv:to be added},
            year      = {2026}
       }

        @InProceedings{Purkrabek2025ICCV,
            author    = {Purkrabek, Miroslav and Matas, Jiri},
            title     = {Detection, Pose Estimation and Segmentation for Multiple Bodies: Closing the Virtuous Circle},
            booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
            month     = {October},
            year      = {2025}
        }

        @InProceedings{Kolomiiets2026CVWW,
            author    = {Kolomiiets, Constantin and Purkrabek, Miroslav and Matas, Jiri},
            title     = {SAM-pose2seg: Pose-Guided Human Instance Segmentation in Crowds},
            booktitle = {Computer Vision Winter Workshop (CVWW)},
            year      = {2026}
        }