Poster
WikiBigEdit: Understanding the Limits of Lifelong Knowledge Editing in LLMs
Lukas Thede · Karsten Roth · Matthias Bethge · Zeynep Akata · Thomas Hartvigsen
East Exhibition Hall A-B #E-2405
Keeping large language models factually up-to-date is crucial for deployment, yet costly retraining remains a challenge. Knowledge editing offers a promising alternative, but methods are only tested on small-scale or synthetic edit benchmarks. In this work, we aim to bridge research into lifelong knowledge editing to real-world edits at practically relevant scale. We first introduce \texttt{WikiBigEdit}; a large-scale benchmark of real-world Wikidata edits, built to automatically extend lifelong for future-proof benchmarking. In its first instance, it includes over 500K question-answer pairs for knowledge editing alongside a comprehensive evaluation pipeline. Finally, we use \texttt{WikiBigEdit} to study existing knowledge editing techniques' ability to incorporate large volumes of real-world facts and contrast their capabilities to generic modification techniques such as retrieval augmentation and continual finetuning to acquire a complete picture of the practical extent of current lifelong knowledge editing.
Large language models like ChatGPT are powerful, but their knowledge can quickly become outdated because they’re trained on static snapshots of the internet. Updating them regularly is important—especially in fields like medicine, law, or education—but retraining these massive models is expensive and slow.Researchers have been exploring faster ways to update facts in a model without retraining from scratch. One idea is “knowledge editing,” where specific facts are directly inserted or changed inside the model. However, until now, these methods have only been tested on small or artificial datasets that don’t accurately reflect how knowledge changes in the real world.In this work, we introduce WikiBigEdit, a new large-scale benchmark that tracks real changes in the Wikidata knowledge graph. It includes over half a million fact-based questions and can automatically grow over time to reflect ongoing updates. Using this benchmark, we evaluate how well existing knowledge editing techniques can handle large volumes of real-world updates. We also compare these methods with other ways of keeping models current, like attaching external memory systems or gradually fine-tuning them.The findings show that many editing methods struggle to scale up effectively, and in some cases, simpler alternatives actually work better. This benchmark and analysis provide a clearer picture of what is needed to build language models that can reliably and efficiently stay up to date over time.