Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning


Buzhen Huang1,2   Chen Li4,5   Chongyang Xu3   Dongyue Lu2
Jinnan Chen2   Yangang Wang1   Gim Hee Lee2

Southeast University1 National University of Singapore2
Sichuan University3
IHPC, Agency for Science, Technology and Research, Singapore4
CFAR, Agency for Science, Technology and Research, Singapore5
CVPR, 2025

[Paper]
[Supp.]
[GitHub]
[Dataset]



Due to visual ambiguities and inter-person occlusions, existing human pose estimation methods cannot recover plausible close interactions from in-the-wild videos. Even state-of-the-art large foundation models~(\eg, SAM) cannot accurately distinguish human semantics in such challenging scenarios. In this work, we find that human appearance can provide a straightforward cue to address these obstacles. Based on this observation, we propose a dual-branch optimization framework to reconstruct accurate interactive motions with plausible body contacts constrained by human appearances, social proxemics, and physical laws. Specifically, we first train a diffusion model to learn the human proxemic behavior and pose prior knowledge. The trained network and two optimizable tensors are then incorporated into a dual-branch optimization framework to reconstruct human motions and appearances. Several constraints based on 3D Gaussians, 2D keypoints, and mesh penetrations are also designed to assist the optimization. With the proxemics prior and diverse constraints, our method is capable of estimating accurate interactions from in-the-wild videos captured in complex environments. We further build a dataset with pseudo ground-truth interaction annotations, which may promote future research on pose estimation and human behavior understanding. Experimental results on several benchmarks demonstrate that our method outperforms existing approaches. The code and data will be publicly available for research purpose.

Visual Appearance

With predicted UV Gaussian maps, we can map the Gaussians to 3D space with a UV coordinate map and splat them to the image plane. We can then reason the depth ordinal relationship and image-model alignment with the rendered and original images.

Method


Overview of our framework. We propose a dual-branch optimization framework to reconstruct close human interactions from a monocular in-the-wild video. By optimizing the proxemics prior, U-Net backbone, and two optimizable tensors, the framework simultaneously predicts interactive motions and coarse appearances. With the constraints from 2D observations, physics, and prior knowledge, the framework can finally output 3D interactions with plausible body poses, natural proxemic relationships and accurate physical contacts.

Qulaitative Comparisons

Our method leverages human appearance, proxemics, and physics to reduce visual ambiguity, resulting in improved performance.

Citation

@inproceedings{closeapp,
    title     = {Reconstructing Close Human Interaction with Appearance and Proxemics Reasoning},
    author    = {Huang, Buzhen and Li, Chen and Xu, Chongyang and Lu, Dongyue and Chen, Jinnan and 
                Wang, Yangang and Lee, Gim Hee},
    booktitle = {IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    year      = {2025}}
                    

Acknowledgments


This research is supported by China Scholarship Council under Grant Number 202306090192. This project page template is based on this page.