Diffusion Agents
TL;DR: Fine-tuning image and video generation models to 'draw actions' as visual patterns could lead us to powerful agents for robotics and GUI apps.
Just draw actions?
Large Language Models (LLMs) are currently experiencing their “agent” moment, evolving beyond next-token prediction to take actions in interactive environments. However, another class of models is poised to become powerful agents: image and video generation diffusion models. We don't typically think of diffusion models like Stable Diffusion as agents, but what if we fine-tune them to "draw actions"? If actions can be represented as visual patterns, then internet-pretrained diffusion models might be applicable to a wide range of decision-making tasks.
We explore one such task in our recent paper GENIMA. GENIMA fine-tunes Stable Diffusion with ControlNet to draw joint-actions for robot arms. Joints such as elbows and wrists are represented with uniquely colored spheres that provide targets for the robot's next movements. A low-level controller then guides the robot to reach these joint positions. Checkout the paper and website for details about the method, experiments, and code to reproduce the results.
Disclaimer: The rest of this blogpost presents my broader intuitions about the core idea. These thoughts are speculative and represent my personal opinions.
LLMs are not the only hammer in our toolbox
If LLMs are your hammer, every problem starts looking like a discrete sequence-modeling task. Nobody expected Transformer-based LLMs to become ubiquitous, but it turns out as along as you express inputs and outputs as tokens, LLMs are quite effective in image classification, protein-folding, and even planning robot actions. Some folks like Andrej Karpathy have suggested that maybe the “language part” is a distraction, because LLMs are really about modeling arbitrary token streams.
I think something similar is true of image and video generation diffusion models. The “image” and “video” part are maybe a distraction from the general-purpose pattern modeling capabilities of diffusion models. As long as inputs and outputs are pixels, diffusion models can generate arbitrary patterns.
Sander Dieleman made a similar observation about Riffusion, which fine-tuned Stable Diffusion to generate audio spectrograms. While spectrograms are not natural images, the fact that you can just “draw audio” demonstrates that diffusion models can generate arbitrary patterns in pixel-space. This capability is particularly useful in continuous domains like vision, audio, and even actions.
You can train domain-specific models for audio like Stable Audio, or actions like Diffusion Policies, but you don’t have to be restricted to a single domain! Images can serve as a common format across domains. By rendering audio as spectrograms (Riffusion), or actions as joint-spheres (GENIMA), we can leverage vision-based diffusion models like Stable Diffusion to generate patterns beyond the visual domain. These patterns don't need to be grounded in physical reality—they simply need to be consistent across the dataset. This cross-domain mixing opens up possibilities for exploring pre-training techniques that could bootstrap learning from data-rich domains like vision to data-poor domains like robotics.
Actions as visual-patterns
Agents can learn to imitate actions from datasets of expert demonstrations. Instead of training them from scratch, if you format the demonstrations as image-to-image or video-to-video tasks, you can exploit the general pattern generation capabilities of pre-trained diffusion models. For example, given RGB observations as input, you could:
(1) Draw trajectories for Autonomous Ground Vehicles (AGVs), with physical parameters like velocity encoded as color gradients
(2) Draw target-joint positions for quadrupeds or humanoids, and reach them with sim2real-RL controllers
(3) Draw hand trajectories like RT-Sketch or heatmaps like VoxPoser for robot arms
(4) Draw virtual actions like clicks and scrolls for Graphical-User-Interface (GUI) heavy apps such as Blender
You’d still need a controller that translates these drawn actions into steering commands or clicks, but that can be done by generating synthetic datasets of random actions, and learning a one-to-one mapping from image-actions to executable-actions.
Why draw actions instead of goal-images?
If drawing actions with diffusion models is simple and widely applicable, then why haven’t we seen more of it? A more popular approach is to use image and video generation models to create goal-images showing desired outcomes. Several works like SuSIE, UniPi, Video-Language-Planning, and Dreamitate already do this. However, I don't think agents need to model exactly how the world will look after each action. Consider scrolling down to find the last sentence of this blog post—you might expect more lines of black text, but you probably can't predict the exact webpage as you scroll down. Similarly, if you drop an object, you know it will fall, but you can't predict its exact landing spot. We navigate a chaotic world full of small, unpredictable details, yet we can still take effective actions. Stable Diffusion already has a visual world model that generates plausible images, but it's not precise enough to predict specific action consequences. So instead, predicting actions is an easier task (assuming you can somehow supervise actions).
A marketplace for Diffusion Agents
While working on GENIMA, I was constantly amazed by the speed at which the diffusion community iterates. With each new model release—Flux, SD3, SD-Turbo—quickly spawns LoRAs for specific styles, real-time demos, distilled models, faster inference techniques, and improved schedulers. We might see a similar marketplace for Diffusion Agents. Imagine an agent pre-trained on a large dataset of GUI interactions, which is then specialized with LoRAs for tasks like 3D asset editing in Blender. I am making a huge leap from robot arms to GUI agents, but the fundamental principle remains: leveraging pre-trained diffusion models for generating pixel actions.
What next?
Whenever I am working a project, I keep a list of unexplored ideas. Obviously, I never have the time or resources to explore everything. So here is my list of followups (although biased heavily towards robotics):
Draw joint-targets for humanoids and reach them with RL-based controller like HumanoidPlus, H2O, and MaskedMimic.
LoRA adapters for each robot embodiment.
Use video-generation diffusion models to draw targets instead of image-generation for better temporal consistency.
World models: input joint-spheres, output RGB observations. Basically flip GENIMA’s training objective. Use this for model-based RL/planning or as an auxiliary loss.
DPO or RLHF preference optimization for different trajectories.
Merge different GENIMAs to combine skills? Just the diffusion part is merged.
Pre-train GENIMA on 3D view-synthesis (like Zero-123) for better 3D understanding.
GENIMA API as a service for robots like Gen-3 Alpha or DALL.E-3.
Fine-tune Stable Diffusion for GUI-heavy interactions in web and local apps.
Train a single diffusion model for images, audio spectrograms, and rendered actions. Is this better than training on individual domains?
Most of these ideas are probably dumb and won’t work, but they might eventually lead you to things that do work. If so, let me know :)
Acknowledgements: Special thanks to Lili Chen, Jiafei Duan, Richie Lo, and Jafar Uruç for providing feedback on early drafts of this blogpost.