โจ D2F: Faster-than-AR Inference for Diffusion LLMs
This demo showcases Discrete Diffusion Forcing (D2F), a novel framework that enables Diffusion Language Models (dLLMs) to achieve faster-than-autoregressive inference speeds for the first time. D2F creates an AR-Diffusion hybrid paradigm that combines the efficiency of KV Caching with inter-block parallel decoding.
The model powering this demo is LLaDA-Instruct-8B, fine-tuned with our D2F method. Watch its unique block-wise generation in real-time, then replay the process in slow motion to see how it works!
64 2048
16 128
0 1
0 1
0 1
0 1
๐ Generated Text (Real-time)
๐ก Try these examples
๐ค Enter your question | Max Tokens | Block Size | Block Add Threshold | Completion Threshold | Skip Threshold | Playback Speed |
---|