HRM^2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans

Shi, Chao; Jia, Shenghao; Liu, Jinhui; Zhang, Yong; Zhu, Liangchao; Yang, Zhonglei; Ma, Jinze; Niu, Chaoyue; Lv, Chengfei

HRM²Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans

Chao Shi^1*, Shenghao Jia^1,2*, Jinhui Liu¹, Yong Zhang^1†, Liangchao Zhu¹, Zhonglei Yang^1‡, Jinze Ma¹, Chaoyue Niu², Chengfei Lv^1†

¹Alibaba Group, ²Shanghai Jiao Tong University
SIGGRAPH Asia 2025
^*Equal Contribution, ^†Corresponding Author, ^‡Project Leader

Paper Supplementary Runtime Code arXiv

Our method creates high-fidelity avatars with realistic clothing dynamics by monocular smartphone scanning, and achieves 2048×945@120FPS on iPhone 15 Pro Max and 1920×1824x2@90FPS on Apple Vision Pro with 533,695 splats. Each subject's data is captured using a single iPhone for 5 minutes.

Abstract

We present HRM²Avatar, a novel framework for creating high-fidelity avatars from monocular phone scans, which can be rendered and animated in real-time on mobile devices. Monocular capture with commodity smartphones provides a low-cost, pervasive alternative to studio-grade multi-camera rigs, making avatar digitization accessible to non-expert users. Reconstructing high-fidelity avatars from single-view video sequences poses significant challenges due to deficient visual and geometric data relative to multi-camera setups. To address these limitations, at the data level, our method leverages two types of data captured with smartphones: static pose sequences for detailed texture reconstruction and dynamic motion sequences for learning pose-dependent deformations and lighting changes. At the representation level, we employ a lightweight yet expressive representation to reconstruct high-fidelity digital humans from sparse monocular data. First, we extract explicit garment meshes from monocular data to model clothing deformations more effectively. Second, we attach illumination-aware Gaussians to the mesh surface, enabling high-fidelity rendering and capturing pose-dependent lighting changes. This representation efficiently learns high-resolution and dynamic information from our tailored monocular data, enabling the creation of detailed avatars. At the rendering level, real-time performance is critical for rendering and animating high-fidelity avatars in AR/VR, social gaming, and on-device creation, demanding sub-frame responsiveness. Our fully GPU-driven rendering pipeline delivers 120 FPS on mobile devices and 90 FPS on standalone VR devices at 2K resolution, over 2.7× faster than representative mobile-engine baselines. Experiments show that HRM²Avatar delivers superior visual realism and real-time interactivity at high resolutions, outperforming state-of-the-art monocular methods.

Method Overview

Given the two-stage phone scans of a subject, we construct a clothed mesh-driven Gaussian avatar. Static, texture-rich images impose stringent supervision on Gaussian attributes 𝗴 , while dynamic, motion-intensive sequences prioritize optimization of deformation ΔV^d and illumination L. Through deformation MLP, illumination MLP and GPU-driven Gaussian rendering pipeline, real-time rendering and animation of realistic avatars is achievable on mobile devices.

Comparisons

Video Presentation

BibTeX

@misc{shi2025hrm2avatarhighfidelityrealtimemobile,
      title={HRM^2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans}, 
      author={Chao Shi and Shenghao Jia and Jinhui Liu and Yong Zhang and Liangchao Zhu and Zhonglei Yang and Jinze Ma and Chaoyue Niu and Chengfei Lv},
      year={2025},
      eprint={2510.13587},
      archivePrefix={arXiv},
      primaryClass={cs.GR},
      doi={https://doi.org/10.1145/3757377.3763894},
      url={https://arxiv.org/abs/2510.13587}, 
}

HRM2Avatar: High-Fidelity Real-Time Mobile Avatars from Monocular Phone Scans