Controllable Text-to-Motion Generation

Overview

Abstract

Text-to-motion (T2M) generation is becoming a practical tool for animation and interactive avatars. However, modifying specific body parts while maintaining overall motion coherence remains challenging. Existing methods typically rely on cumbersome, high-dimensional joint constraints (e.g., trajectories), which hinder user-friendly, iterative refinement. To address this, we propose Modular Body-Part Phase Control, a plug-and-play framework enabling structured, localized editing via a compact, scalar-based phase interface. By modeling body-part latent motion channels as sinusoidal phase signals—characterized by amplitude, frequency, phase shift, and offset—we extract interpretable codes that capture part-specific dynamics. A modular Phase ControlNet branch then injects this signal via residual feature modulation, seamlessly decoupling control from the generative backbone. Experiments on both diffusion- and flow-based models demonstrate that our approach provides predictable and fine-grained control over motion magnitude, speed, and timing. It preserves global motion coherence and offers a practical paradigm for controllable T2M generation.

Method

Pipeline / Method Overview

Pipeline overview figure — Our framework augments a pretrained latent-space text-to-motion generator with a modular body-part phase control branch. A frozen body-part phase extractor predicts per-part periodic parameters including amplitude, frequency, phase shift, and offset from a reference motion. These structured controls are encoded into a phase manifold and injected into the backbone through a Phase ControlNet, enabling localized edits while preserving overall motion coherence.

Experiments

Video Results

We visualize localized body-part control results produced by our modular phase interface. The three groups below highlight how scalar edits to amplitude, frequency, and phase shift yield predictable changes in motion magnitude, execution pace, and temporal alignment while keeping the remaining motion coherent.

Text Prompt: a person is waving with his right hand.

Text Prompt: a person waves with their left hand.

Text Prompt: the person is waving at someone with the right hand.

Text Prompt: this person walks clumsy while moving forward.

Text Prompt: person carefully walks with left right first in a straight direction.

Text Prompt: a person runs and then jumps.

Text Prompt: a person punches in front of them with their left hand.

Text Prompt: figure jogs forward, arms bent in front of them.

Text Prompt: person carefully walks with left right first in a straight direction.

Text Prompt: the person claps and puts their hands down.

Text Prompt: the character scratches his head with his right arm.

Text Prompt: a person raises and then lowers their left hand.

Reference

BibTeX / Citation

@article{dai2026bodypartphase,
  title     = {Controllable Text-to-Motion Generation via Modular Body-Part Phase Control},
  author    = {Minyue Dai and Ke Fan and Anyi Rao and Jingbo Wang and Bo Dai},
  journal   = {arXiv preprint},
  year      = {2026}
}

Controllable Text-to-Motion Generation via Modular Body-Part Phase Control

Abstract

Pipeline / Method Overview

Video Results

Amplitude Control

Frequency Control

Phase Shift Control

Interactive Editing Demo

BibTeX / Citation