✨ D2F: Faster-than-AR Inference for Diffusion LLMs

GitHub | 📜 Paper | 🌐 Blog Post | 🤗 D2F-LLaDA LoRA | 🤗 D2F-Dream LoRA

This demo showcases Discrete Diffusion Forcing (D2F), a novel framework that enables Diffusion Language Models (dLLMs) to achieve faster-than-autoregressive inference speeds for the first time. D2F creates an AR-Diffusion hybrid paradigm that combines the efficiency of KV Caching with inter-block parallel decoding.

The model powering this demo is LLaDA-Instruct-8B, fine-tuned with our D2F method. Watch its unique block-wise generation in real-time, then replay the process in slow motion to see how it works!

💡 Try these examples

🤔 Enter your question	Max Tokens	Block Size	Block Add Threshold	Completion Threshold	Skip Threshold	Playback Speed

✨ D2F: Faster-than-AR Inference for Diffusion LLMs

📝 Generated Text (Real-time)