ACMo: Attribute Controllable Motion Generation

Xidian University
arXiv 2025
arXiv Code (Coming)

Attribute Controllable Motion achieve an controllable stylized text-to-motion generation architecture.

Abstract

Attributes such as style, fine-grained text, and trajectory are specific conditions for describing motion. However, existing methods often lack precise user control over motion attributes and suffer from limited generalizability to unseen motions. This work introduces an Attribute Controllable Motion generation architecture, to address these challenges via decouple any conditions and control them separately. Firstly, we explored the Attribute Diffusion Model to imporve text-to-motion performance via decouple text and motion learning, as the controllable model relies heavily on the pre-trained model. Then, we introduce Motion Adpater to quickly finetune previously unseen motion patterns. Its motion prompts inputs achieve multimodal text-to-motion generation that captures user-specified styles. Finally, we propose a LLM Planner to bridge the gap between unseen attributes and dataset-specific texts via local knowledage for user-friendly interaction. Our approach introduces the capability for motion prompts for stylize generation, enabling fine-grained and user-friendly attribute control while providing performance comparable to state-of-the-art methods.

Introduction

ACMo handles motions beyond dataset representation, using motion prompts for stylize multimodal generation and multi-attribute control, with LLM Planner mapping zero-shot unseen attributes to dataset texts. Employ rapid fine-tuning to enable the model to recognize new motion patterns. The bracketed text enhances the stability of style and trajectory. Control your motions as you wish!

Main contributions:

1) To the best of our knowledge, this is the first work to propose decoupling any condition and separately implementing multi-attribute text-to-motion generation.

2) The Motion Adapter is a lightweight, efficient fine-tuning method for multimodal motion generation, which achieves fine-tuning state-of-the-art on 100STYLE.

3) We leveraged LLMs to enable user-friendly text-to-motion generation, capable of handling previously unseen text inputs.

4) The Attribute Diffusion Model achieves performance on par with the state-of-the-art latent diffusion model on HumanML3D.

Method

The ACMo network architecture. Stage 1: Attribute Diffusion Model is trained by decoupling text and motion in a more powerful latent space. Stage 2: Motion Adapter finetunes new motion patterns and preserves the original knowledge. Stage 3: Trajectory control through Controlnet. Finally, the LLM Planner module inferences for text processing.

Result

BibTeX

BibTex Code Here