WildAni4D: Towards 4D Animal Mesh Reconstruction

Abstract

Recovering animal motion in 4D.

WildAni4D tackles the data scarcity and temporal instability that make animal video reconstruction difficult.

Recovering 4D animal motion, including 3D geometry and global trajectory, is essential for quantitative biomechanics and behavioral analysis. Existing methods lack sufficient annotated video data and suffer from per-frame temporal instability. WildAni4D unites a synthetic animal video generation pipeline with the Animal Video Transformer, a reconstruction model that estimates temporally coherent motion using a single sequence-level shape and per-frame pose predictions. The resulting system reduces temporal pose flicker and shape drift, enabling large-scale 4D animal reconstruction and downstream applications including motion annotation, animatable reconstruction, and text-to-motion generation.

InputMonocular animal video

OutputMesh, pose, shape, trajectory

TrainingSynthetic videos with 3D labels

StabilityOne shape per sequence

Comparative Demo

Ours produces cleaner 4D recovery on Veo3 videos.

Three Veo3-generated animal samples are reconstructed with AniMer + DROID-SLAM, GenZoo + DROID-SLAM, and WildAni4D. The demo highlights temporal stability, shape consistency, and world-grounded motion recovery.

Baseline 1AniMer + DROID-SLAM

Baseline 2GenZoo + DROID-SLAM

OursWildAni4D

Across the three samples, our method preserves a consistent animal shape and predicts stable poses without jitter. In contrast, the baseline reconstructions show noticeable noise, including unstable tail motion and frame-to-frame shape fluctuations.

Contributions

A complete data-and-model pipeline.

Data

WildAni4D-Gen

Scalable synthetic video generation combining dynamic textured SMAL animals, diverse 3D scenes, and realistic camera motion.

Model

Animal Video Transformer

A video reconstruction model that uses temporal features and camera trajectory estimation to recover world-grounded animal motion.

Stability

Sequence-level shape

Predicting one shape for the full sequence suppresses frame-wise drift while preserving per-frame articulated motion.

Method

Synthetic videos meet video-native reconstruction.

WildAni4D first creates fully annotated animal videos, then trains a temporal reconstruction model for stable 4D mesh recovery.

WildAni4D synthetic animal video generation pipeline

WildAni4D-Gen. Dynamic animals, textured SMAL shapes, diverse 3D scenes, and camera trajectories are rendered into annotated training videos.

Animal Video Transformer. Temporal modeling after the ViT backbone predicts per-frame pose and translation together with a sequence-level shape parameter.

Why it matters

Frame-wise methods often produce plausible single frames but flicker across time. WildAni4D makes the reconstruction video-native: the animal identity is consistent across the sequence while pose and global motion evolve frame by frame.

Results

Stable reconstruction across challenging sequences.

Qualitative reconstruction results on challenging animal videos.

Additional animal reconstruction comparison results

Additional comparison visualizations across frames.

Applications

From reconstruction to reusable animal motion.

Temporally coherent 4D outputs can support annotation, animation, and generation pipelines.

Downstream applications include animal motion data annotation, animatable animal reconstruction, and text-to-motion generation.

Citation

Cite WildAni4D.

@inproceedings{cho2026wildani4d,
  title     = {WildAni4D: Towards 4D Animal Mesh Reconstruction},
  author    = {Cho, Gyeongsu and Hu, Hezhen and Soon, Donghyeon and Kang, Changwoo and Joo, Kyungdon},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings},
  year      = {2026}
}