Controllable Text-to-Motion Generation via Modular Body-Part Phase Control

Minyue Dai1 · Ke Fan2 · Anyi Rao3 · Jingbo Wang4 · Bo Dai5

1 Fudan University

2 Shanghai Jiao Tong University

3 HKUST

4 Shanghai AI Laboratory

5 The University of Hong Kong

Overview

Abstract

Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals—characterized by amplitude, frequency, phase shift, and offset—we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation.

Method

Pipeline / Method Overview

Pipeline overview figure
Pipeline Figure Placeholder Add your method overview image at assets/pipeline.jpg
Our framework augments a pretrained latent-space text-to-motion generator with a modular body-part phase control branch. A frozen body-part phase extractor predicts per-part periodic parameters including amplitude, frequency, phase shift, and offset from a reference motion. These structured controls are encoded into a phase manifold and injected into the backbone through a Phase ControlNet, enabling localized edits while preserving overall motion coherence.

Experiments

Video Results

We visualize localized body-part control results produced by our modular phase interface. The three groups below highlight how scalar edits to amplitude, frequency, and phase shift yield predictable changes in motion magnitude, execution pace, and temporal alignment while keeping the remaining motion coherent.

Reference

BibTeX / Citation

@article{dai2026bodypartphase,
  title     = {Controllable Text-to-Motion Generation via Modular Body-Part Phase Control},
  author    = {Minyue Dai and Ke Fan and Anyi Rao and Jingbo Wang and Bo Dai},
  journal   = {arXiv preprint},
  year      = {2026}
}