Poster
Low-Dimension-to-High-Dimension Generalization and Its Implications for Length Generalization
Yang Chen · Long Yang · Yitao Liang · Zhouchen Lin
West Exhibition Hall B2-B3 #W-901
Low-Dimension-to-High-Dimension (LDHD) generalization, a subset of Out-of-Distribution (OOD) generalization, involves training on a low-dimensional subspace and testing in a high-dimensional space. Assuming instances are generated from latent variables reflecting problem scale, LDHD generalization captures the inherent scaling challenge of length generalization. We theoretically show that LDHD generalization is unattainable without appropriate inductive bias. Focusing on Boolean functions, we demonstrate that different architectures trained with (S)GD converge to min-degree interpolators w.r.t. different linearly independent sets, achieving LDHD generalization only when the target function aligns with this bias. From the perspective of LDHD generalization for length generalization, we explain the success of CoT in restructuring latent space for improved LDHD generalization. We further propose a principle for designing position embeddings to address both LDHD generalization and data format nuisances separately. Following the principle, we introduce RPE-Square, a novel embedding that enhances RPE to better handle data formats.
Why is it hard for machines to learn to solve large and complex problems by first practicing on smaller and simpler ones? In this paper, we show that there are two main reasons: First, problems often become fundamentally more complex as their size grows. Second, the way these problems are presented to the machine can make learning even harder.We prove that, in general, no single learning method can guarantee success in scaling up from small to large problems across all tasks. However, when we do have some prior knowledge about a particular problem, we can design machines that are better suited to learn this way.Our work helps researchers better understand the challenges machines face when generalizing from small to large problem instances. It also offers practical guidance on how to adapt machine designs when additional information about the task is available.