โœจ D2F: Faster-than-AR Inference for Diffusion LLMs

This demo showcases Discrete Diffusion Forcing (D2F), a novel framework that enables Diffusion Language Models (dLLMs) to achieve faster-than-autoregressive inference speeds for the first time. D2F creates an AR-Diffusion hybrid paradigm that combines the efficiency of KV Caching with inter-block parallel decoding.

The model powering this demo is LLaDA-Instruct-8B, fine-tuned with our D2F method. Watch its unique block-wise generation in real-time, then replay the process in slow motion to see how it works!

64 2048
16 128
0 1
0 1
0 1
0 1

๐Ÿ“ Generated Text (Real-time)

๐Ÿ’ก Try these examples
๐Ÿค” Enter your question Max Tokens Block Size Block Add Threshold Completion Threshold Skip Threshold Playback Speed