DisCo

open_source

DisCo is an open-source AI model that generates realistic human dance images and videos with disentangled control over subject appearance, background, and pose — enabling flexible, faithful, and compositional dance synthesis.

Research & Education

Art Generators

AI Video Generators

About

DisCo is a research-grade generative AI system designed to tackle the challenge of realistic human dance video synthesis in real-world scenarios. Unlike prior methods, DisCo introduces a novel disentangled control architecture that treats the human foreground subject, background, and target pose as fully independent conditioning signals. The human subject's appearance is encoded via CLIP image embeddings fed into U-Net cross-attention modules, while separate ControlNet branches handle background and pose guidance — enabling any combination of these three elements from different sources. A key component is Human Attribute Pre-training, a proxy task in which the model reconstructs complete images from disentangled foreground and background regions. This allows DisCo to learn rich representations of human faces, clothing, and body characteristics from large-scale unpaired image collections, greatly improving generalization to entirely unseen subjects and environments without requiring paired training data for pose control. DisCo supports a wide range of tasks: retargeting poses to new subjects, generating dance sequences from unseen poses or human identities, performing full unseen composition (new subject + new background + new pose), and even human-specific fine-tuning for personalized results. Extensive qualitative and quantitative benchmarks demonstrate state-of-the-art fidelity and flexibility. The system is primarily aimed at computer vision researchers, AI developers, and creative technologists exploring controllable human video generation, motion retargeting, and generative video synthesis.

Key Features

Disentangled Control Architecture: Separates human subject, background, and pose into independent conditioning signals using CLIP cross-attention and dual ControlNet branches, enabling precise and flexible control over each element.
Human Attribute Pre-training: A proxy reconstruction task trains the model on large-scale unpaired human image collections, allowing it to generalize to unseen faces, clothing, and body types without paired pose data.
Arbitrary Compositional Generation: Freely combine human subjects, backgrounds, and poses from entirely different sources — including unseen combinations — to create novel and diverse dance content.
Dance Video Sequence Synthesis: Generates temporally coherent dance video sequences with diverse motion patterns from a single reference image and a target pose sequence.
Human-Specific Fine-Tuning: Supports personalized fine-tuning on a specific individual to produce highly faithful dance videos tailored to that subject's appearance.

Use Cases

Retargeting dance poses from one reference person to a different human subject while preserving both subjects' appearances.
Generating dance video sequences for entirely unseen individuals using only a reference image and target pose sequence.
Creating creative dance videos by composing subjects, backgrounds, and poses sourced from completely different images.
Research into controllable, disentangled human video generation using diffusion models and ControlNet.
Personalized dance video production through human-specific fine-tuning on a target individual's appearance.

Pros

Strong Generalizability: Pre-training on large-scale unpaired human data enables DisCo to handle unseen subjects, backgrounds, and poses that were never seen during fine-tuning.
Flexible Compositionality: The disentangled architecture allows any mix-and-match combination of subject, background, and pose, giving users fine-grained creative control.
Research-Backed Quality: Developed by NTU and Microsoft Azure AI with rigorous qualitative and quantitative evaluation, ensuring reliable and high-fidelity output.

Cons

High Technical Barrier: Requires expertise in deep learning, diffusion models, and ControlNet to set up and run — not suitable for non-technical users.
Narrow Domain Focus: Specialized exclusively for human dance synthesis; not applicable to general-purpose video generation or non-human subjects.
Research Prototype Maturity: As an academic project, it lacks production-grade tooling, support infrastructure, and user-friendly interfaces.

Frequently Asked Questions

DisCo (Disentangled Control) is an open-source AI research model for generating realistic human dance images and videos. It disentangles control over the human subject, background, and pose to enable faithful and compositional dance synthesis.

DisCo takes a reference image (containing the human subject and background) and a target pose as inputs. It can also accept subjects, backgrounds, and poses from entirely separate sources for compositional generation.

Yes. Through human attribute pre-training on large-scale unpaired human image datasets, DisCo learns rich representations of diverse faces and clothing, enabling generalization to unseen human subjects.

Yes. DisCo is an academic research project with code publicly available. It was developed by researchers at Nanyang Technological University and Microsoft Azure AI.

DisCo is suited for pose retargeting, unseen pose generation, cross-identity dance transfer, full unseen compositional video generation, and human-specific fine-tuning for personalized dance video synthesis.