Pixel Motion Diffusion is What We Need for Robot Control

E-Ro Nguyen*, Yichi Zhang*, Kanchana Ranasinghe, Xiang Li, Michael S Ryoo
Stony Brook University
*Equal contribution

DAWN

DAWN Framework

Abstract

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning.

Demonstrations

Bimanual Manipulation

Task Goal: clean the cutting board

Task Goal: fold clothes

Real World Demonstrations

See DAWN in action with real-world robotic applications and scenarios.

Task Goal: Lift a grape from the table

Video

Predicted Pixel Motion

Task Goal: Lift a kiwi from the table

Video

Predicted Pixel Motion

Task Goal: Lift an orange from the table

Video

Predicted Pixel Motion

CALVIN Demonstrations

See DAWN in action with CALVIN Benchmark.

Task Goal: lift blue block slider -> place in slider -> turn on lightbulb -> open drawer -> push pink block left

Video

Predicted Pixel Motion

Task Goal: rotate blue block left -> open drawer -> lift pink block table -> place in drawer -> turn on led

Video

Predicted Pixel Motion