Spotlight Poster
Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions
Jaeyeon Kim · Kulin Shah · Vasilis Kontonis · Sham Kakade · Sitan Chen
East Exhibition Hall A-B #E-3009
Outstanding Paper |
Tue 15 Jul 3:30 p.m. PDT — 4:30 p.m. PDT
Standard language models write strictly left-to-right, while newer “masked diffusion” models can fill in blanks in any order—but so far, they lag. We pinpoint the bottleneck: during training, they face an exponential number of fill-in-the-mask subproblems, many of which are mathematically intractable, so learning gets hard. We found the flaw isn’t in the model itself, but in how we let it answer. At test time, we can choose which blank to reveal first, so we use a simple rule: pick the spot where the model is most confident. This one-line tweak catapults Sudoku accuracy from 7% to nearly 90% and brings similar leaps on Zebra puzzles and text-quality checks. Bottom line: training these models is tough, but smart decoding turns them into powerful, order-agnostic reasoners—no extra training required.