Invited Talk
in
Workshop: DIG-BUGS: Data in Generative Models (The Bad, the Ugly, and the Greats)
Data-centric LM research on an academic budget
Tatsunori Hashimoto
Abstract:
It is widely agreed that pre‑training datasets have an enormous impact on LM behaviors. However, systematic understanding of these datasets remains limited because studying the interaction between datasets and LM training is both complex and costly. Can academics perform serious data‑centric LLM research without an equally serious budget? We discuss several works—including scaling studies and meta‑analyses—that show promise for enabling low‑cost, data‑centric experimentation.
Chat is not available.