Track: Poster Session 1 West

Poster

#W-100

Autoformulation of Mathematical Optimization Models Using LLMs

Nicolás Astorga · Tennison Liu · Yuanzhang Xiao · Mihaela van der Schaar

Mathematical optimization is fundamental to decision-making across diverse domains, from operations research to healthcare. Yet, translating real-world problems into optimization models remains a difficult task, often demanding specialized expertise. This paper approaches the problem of $\textit{autoformulation}$: the automated creation of solver-ready optimization models from natural language problem descriptions.We identify three core challenges of autoformulation: $\textit{(1)}$ the vast, problem-dependent hypothesis space, $\textit{(2)}$ efficient and diverse exploration of this space under uncertainty, and $\textit{(3)}$ evaluation of formulation correctness against problem description.To address these challenges, we present a novel method leveraging $\textit{Large Language Models}$ (LLMs) with $\textit{Monte-Carlo Tree Search}$, exploiting the hierarchical nature of optimization modeling to generate and systematically explore possible formulations. To enhance search efficiency, we introduce symbolic pruning to eliminate trivially equivalent search paths (branches), and employ LLM-based evaluation of partial formulations to guide search.Empirical analysis on linear and mixed-integer programming benchmarks demonstrates our method's effectiveness, with significant performance gains from both LLM-based value estimation and symbolic pruning techniques.

Poster

#W-1000

Ladder-Residual: Parallelism-Aware Architecture for Accelerating Large Model Inference with Communication Overlapping

Muru Zhang · Mayank Mishra · Zhongzhu Zhou · William Brandon · Jue Wang · Yoon Kim · Jonathan Ragan-Kelley · Shuaiwen Song · Ben Athiwaratkun · Tri Dao

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens.

Poster

#W-1001

Delta Decompression for MoE-based LLMs Compression

Hao Gu · Wei Li · Lujun Li · Qiyuan Zhu · Mark Lee · Shengjie Sun · Wei Xue · Yike Guo

Mixture-of-Experts (MoE) architectures in large language models (LLMs) achieve exceptional performance, but face prohibitive storage and memory requirements. To address these challenges, we present $D^2$-MoE, a new delta decompression compressor for reducing the parameters of MoE LLMs. Based on observations of expert diversity, we decompose their weights into a shared base weight and unique delta weights. Specifically, our method first merges each expert's weight into the base weight using the Fisher information matrix to capture shared components. Then, we compress delta weights through Singular Value Decomposition (SVD) by exploiting their low-rank properties.Finally, we introduce a semi-dynamical structured pruning strategy for the base weights, combining static and dynamic redundancy analysis to achieve further parameter reduction while maintaining input adaptivity. In this way, our $D^2$-MoE successfully compacts MoE LLMs to high compression ratios without additional training. Extensive experiments highlight the superiority of our approach, with over 13\% performance gains than other compressors on Mixtral|Phi-3.5|DeepSeek|Qwen2 MoE LLMs at 40$\sim$60\% compression rates. Codes are available in https://github.com/lliai/D2MoE.

Poster

#W-1002

Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression

Peijie Dong · Zhenheng Tang · Xiang Liu · Lujun Li · Xiaowen Chu · Bo Li

Post-training compression reduces the computational and memory costs of large language models (LLMs), enabling resource-efficient deployment. However, existing compression benchmarks focus narrowly on language modeling (e.g., perplexity) and natural language understanding tasks (e.g., GLUE accuracy), ignoring the agentic capabilities—workflow, tool use/function call, long-context understanding and real-world application. We introduce the Agent Compression Benchmark (ACBench), the first comprehensive benchmark for evaluating how compression impacts LLMs' agentic abilities. ACBench spans (1) 12 tasks across 4 capabilities (e.g., WorfBench for workflow generation, Needle-in-Haystack for long-context retrieval), (2) 4-bit quantization (GPTQ, AWQ) and 50% pruning (Wanda, SparseGPT), and (3) 15 models, including small (Gemma-2B), standard (Qwen2.5-7B), and distilled reasoning LLMs (DeepSeek-R1-Distill). Our experiments reveal compression tradeoffs: 4-bit quantization preserves workflow generation and tool use (1%--3% drop) but degrades real-world application accuracy by 10%--15%. We introduce ERank, Top-k Ranking Correlation and Energy to systematize analysis. ACBench provides actionable insights for optimizing LLM compression in agentic scenarios, bridging the gap between algorithmic efficiency and real-world applicability.

Poster

#W-1003

MoE-SVD: Structured Mixture-of-Experts LLMs Compression via Singular Value Decomposition

Wei Li · Lujun Li · Hao Gu · Youliang Huang · Mark Lee · Shengjie Sun · Wei Xue · Yike Guo

Mixture of Experts (MoE) architecture improves Large Language Models (LLMs) with better scaling, but its higher parameter counts and memory demands create challenges for deployment. In this paper, we present MoE-SVD, a new decomposition-based compression framework tailored for MoE LLMs without any extra training. By harnessing the power of Singular Value Decomposition (SVD), MoE-SVD addresses the critical issues of decomposition collapse and matrix redundancy in MoE architectures. Specifically, we first decompose experts into compact low-rank matrices, resulting in accelerated inference and memory optimization. In particular, we propose selective decomposition strategy by measuring sensitivity metrics based on weight singular values and activation statistics to automatically identify decomposable expert layers. Then, we share a single V-matrix across all experts and employ a top-k selection for U-matrices. This low-rank matrix sharing and trimming scheme allows for significant parameter reduction while preserving diversity among experts. Comprehensive experiments on Mixtral, Phi-3.5, DeepSeek, and Qwen2 MoE LLMs show MoE-SVD outperforms other compression methods, achieving a 60\% compression ratio and 1.5× faster inference with minimal performance loss. Codes are available at: https://github.com/lliai/MoE-SVD.

Spotlight Poster

#W-1004

When Every Millisecond Counts: Real-Time Anomaly Detection via the Multimodal Asynchronous Hybrid Network

Dong Xiao · Guangyao Chen · Peixi Peng · Yangru Huang · Yifan Zhao · Yongxing Dai · Yonghong Tian

Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.

Poster

#W-1005

Compressed and distributed least-squares regression: convergence rates with applications to federated learning

Constantin Philippenko · Aymeric Dieuleveut

In this paper, we investigate the impact of compression on stochastic gradient algorithms for machine learning, a technique widely used in distributed and federated learning. We underline differences in terms of convergence rates between several unbiased compression operators, that all satisfy the same condition on their variance, thus going beyond the classical worst-case analysis. To do so, we focus on the case of least-squares regression (LSR) and analyze a general stochastic approximation algorithm for minimizing quadratic functions relying on a random field. We consider weak assumptions on the random field, tailored to the analysis (specifically, expected H{{\"o}}lder regularity), and on the noise covariance, enabling the analysis of various randomizing mechanisms, including compression. We then extend our results to the case of federated learning. More formally, we highlight the impact on the convergence of the covariance $\mathfrak{C}_{\mathrm{ania}}$ of the additive noise induced by the algorithm. We demonstrate despite the non-regularity of the stochastic field, that the limit variance term scales with $\mathrm{Tr}(\mathfrak{C}_{\mathrm{ania}} H^{-1})/K$ (where $H$ is the Hessian of the optimization problem and $K$ the number of iterations) generalizing the rate for the vanilla LSR case where it is $\sigma^2 \mathrm{Tr}(H H^{-1}) / K = \sigma^2 d / K$ (Bach and Moulines, 2013). Then, we analyze the dependency of $\mathfrak{C}_{\mathrm{ania}}$ on the compression strategy and ultimately its impact on convergence, first in the centralized case, then in two heterogeneous FL frameworks.

Poster

#W-1006

AKORN: Adaptive Knots generated Online for RegressioN splines

Sunil Madhow · Dheeraj Baby · Yu-Xiang Wang

In order to attain optimal rates, state-of-the-art algorithms for non-parametric regression require that a hyperparameter be tuned according to the smoothness of the ground truth (Tibshirani, 2014). This amounts to an assumption of oracle access to certain features of the data-generating process. We present a parameter-free algorithm for offline non-parametric regression over $TV_1$-bounded functions. By feeding offline data into an optimal online denoising algorithm styled after (Baby et al., 2021), we are able to use change-points to adaptively select knots that respect the geometry of the underlying ground truth. We call this procedure AKORN (Adaptive Knots gener- ated Online for RegressioN splines). By combining forward and backward passes over the data, we obtain an estimator whose empirical performance is close to Trend Filtering (Kim et al., 2009; Tibshirani, 2014), even when we provide the latter with oracle knowledge of the ground truth’s smoothness.

Poster

#W-1007

Pareto-Optimality, Smoothness, and Stochasticity in Learning-Augmented One-Max-Search

Ziyad Benomar · Lorenzo Croissant · Vianney Perchet · Spyros Angelopoulos

One-max search is a classic problem in online decision-making, in which a trader acts on a sequence of revealed prices and accepts one of them irrevocably to maximise its profit. The problem has been studied both in probabilistic and in worst-case settings, notably through competitive analysis, and more recently in learning-augmented settings in which the trader has access to a prediction on the sequence. However, existing approaches either lack smoothness, or do not achieve optimal worst-case guarantees: they do not attain the best possible trade-off between the consistency and the robustness of the algorithm. We close this gap by presenting the first algorithm that simultaneously achieves both of these important objectives. Furthermore, we show how to leverage the obtained smoothness to provide an analysis of one-max search in stochastic learning-augmented settings which capture randomness in both the observed prices and the prediction.

Poster

#W-1008

GaussMark: A Practical Approach for Structural Watermarking of Language Models

Adam Block · Alexander Rakhlin · Ayush Sekhari

Watermarking, the process by which Large Language Model (LLM) servers imbed an imperceptible signal at inference time in order to detect text generated by their own models, has grown in importance due to the significant improvements in natural language processing tasks by modern LLMs. Current approaches are often impractical due to generation latency, detection time, degradation in text quality, or robustness; such problems often arise due to the focus on token level watermarking, which ignores the inherent structure of text. In this work, we introduce a new scheme, GaussMark, that is simple and efficient to implement, has formal statistical guarantees, comes at no cost in generation latency, and embeds the watermark into the weights of the model itself, providing a structural watermark. Our approach is based on Gaussian independence testing and is motivated by recent empirical observations that minor additive corruptions to LLM weights can result in models of identical (or even improved) quality. We provide formal statistical bounds on the validity and power of our procedure and, through an extensive suite of experiments, demonstrate that GaussMark is reliable, efficient, relatively robust to corruption, and can be instantiated with essentially no loss in model quality.

Poster

#W-1009

Identifying Metric Structures of Deep Latent Variable Models

Stas Syrota · Yevgen Zainchkovskyy · Johnny Xi · Benjamin Bloem-Reddy · Søren Hauberg

Deep latent variable models learn condensed representations of data that, hopefully, reflect the inner workings of the studied phenomena. Unfortunately, these latent representations are not statistically identifiable, meaning they cannot be uniquely determined. Domain experts, therefore, need to tread carefully when interpreting these. Current solutions limit the lack of identifiability through additional constraints on the latent variable model, e.g. by requiring labeled training data, or by restricting the expressivity of the model. We change the goal: instead of identifying the latent variables, we identify relationships between them such as meaningful distances, angles, and volumes. We prove this is feasible under very mild model conditions and without additional labeled data. We empirically demonstrate that our theory results in more reliable latent distances, offering a principled path forward in extracting trustworthy conclusions from deep latent variable models.

Poster

#W-101

TAROT: Targeted Data Selection via Optimal Transport

Lan Feng · Fan Nie · Yuejiang Liu · Alexandre Alahi

We propose TAROT, a targeted data selection framework grounded in Optimal Transport theory. Previous targeted data selection methods primarily rely on influence-based greedy heuristics to enhance domain-specific performance. While effective on limited, unimodal data (i.e., data following a single pattern), these methods struggle as target data complexity increases. Specifically, in multimodal distributions, such heuristics fail to account for multiple inherent patterns, leading to suboptimal data selection. This work identifies two primary limitations: (i) the disproportionate impact of dominant feature components in high-dimensional influence estimation, and (ii) the restrictive linear additive assumptions in greedy selection strategies. To address these challenges, TAROT incorporates whitened feature distance to mitigate dominant feature bias, offering a more reliable measure of data influence. Building on this, TAROT leverages whitened feature distance to quantify and minimize the optimal transport distance between selected data and target domains. Notably, this minimization also facilitates the estimation of optimal selection ratios. We evaluate TAROT across multiple tasks, including semantic segmentation, motion prediction, and instruction tuning. Results consistently show that TAROT outperforms state-of-the-art methods, demonstrating its versatility across various deep learning tasks. Code is available at: https://github.com/vita-epfl/TAROT.

Poster

#W-1010

Core Knowledge Deficits in Multi-Modal Language Models

Yijiang Li · Qingying Gao · Tianwei Zhao · Bingyang Wang · Haoran Sun · Haiyun Lyu · Robert Hawkins · Nuno Vasconcelos · Tal Golan · Dezhi Luo · Hokin Deng

While Multi-modal Large Language Models (MLLMs) demonstrate impressive abilities over high-level perception and reasoning, their robustness in the wild remains limited, often falling short on tasks that are intuitive and effortless for humans. We examine the hypothesis that these deficiencies stem from the absence of core knowledge—rudimentary cognitive abilities innate to humans from early childhood. To explore the core knowledge representation in MLLMs, we introduce CoreCognition, a large-scale benchmark encompassing 12 core knowledge concepts grounded in developmental cognitive science.We evaluate 230 models with 11 different prompts, leading to a total of 2,530 data points for analysis. Our experiments uncover four key findings, collectively demonstrating core knowledge deficits in MLLMs: they consistently underperform and show reduced, or even absent, scalability on low-level abilities relative to high-level ones.Finally, we propose Concept Hacking, a novel controlled evaluation method that reveals MLLMs fail to progress toward genuine core knowledge understanding, but instead rely on shortcut learning as they scale.

Poster

#W-1011

A Sharper Global Convergence Analysis for Average Reward Reinforcement Learning via an Actor-Critic Approach

Swetha Ganesh · Washim Mondal · Vaneet Aggarwal

This work examines average-reward reinforcement learning with general policy parametrization. Existing state-of-the-art (SOTA) guarantees for this problem are either suboptimal or hindered by several challenges, including poor scalability with respect to the size of the state-action space, high iteration complexity, and a significant dependence on knowledge of mixing times and hitting times. To address these limitations, we propose a Multi-level Monte Carlo-based Natural Actor-Critic (MLMC-NAC) algorithm. Our work is the first to achieve a global convergence rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$ for average-reward Markov Decision Processes (MDPs) (where $T$ is the horizon length), using an Actor-Critic approach. Moreover, the convergence rate does not scale with the size of the state space, therefore even being applicable to infinite state spaces.

Poster

#W-1012

The Sample Complexity of Online Strategic Decision Making with Information Asymmetry and Knowledge Transportability

Jiachen Hu · Rui Ai · Han Zhong · Xiaoyu Chen · Liwei Wang · Zhaoran Wang · Zhuoran Yang

Information asymmetry is a pervasive feature of multi-agent systems, especially evident in economics and social sciences. In these settings, agents tailor their actions based on private information to maximize their rewards. These strategic behaviors often introduce complexities due to confounding variables. Simultaneously, knowledge transportability poses another significant challenge, arising from the difficulties of conducting experiments in target environments. It requires transferring knowledge from environments where empirical data is more readily available. Against these backdrops, this paper explores a fundamental question in online learning: Can we employ non-i.i.d. actions to learn about confounders even when requiring knowledge transfer? We present a sample-efficient algorithm designed to accurately identify system dynamics under information asymmetry and to navigate the challenges of knowledge transfer effectively in reinforcement learning, framed within an online strategic interaction model. Our method provably achieves learning of an $\epsilon$-optimal policy with a tight sample complexity of $\tilde{O}(1/\epsilon^2)$.

Poster

#W-1013

Overcoming the Curse of Dimensionality in Reinforcement Learning Through Approximate Factorization

Chenbei Lu · Laixi Shi · Zaiwei Chen · Chenye Wu · Adam Wierman

Factored Markov Decision Processes (FMDPs) offer a promising framework for overcoming the curse of dimensionality in reinforcement learning (RL) by decomposing high-dimensional MDPs into smaller and independently evolving components. Despite their potential, existing studies on FMDPs face three key limitations: reliance on perfectly factorizable models, suboptimal sample complexity guarantees for model-based algorithms, and the absence of model-free algorithms. To address these challenges, we introduce approximate factorization, which extends FMDPs to handle imperfectly factored models. Moreover, we develop a model-based algorithm and a model-free algorithm (in the form of variance-reduced Q-learning), both achieving the first near-minimax sample complexity guarantees for FMDPs. A key novelty in the design of these two algorithms is the development of a graph-coloring-based optimal synchronous sampling strategy. Numerical simulations based on the wind farm storage control problem corroborate our theoretical findings.

Poster

#W-1014

Can RLHF be More Efficient with Imperfect Reward Models? A Policy Coverage Perspective

Jiawei Huang · Bingcong Li · Christoph Dann · Niao He

Sample efficiency is critical for online Reinforcement Learning from Human Feedback (RLHF). While existing works investigate sample-efficient online exploration strategies, the potential of utilizing misspecified yet relevant reward models to accelerate learning remains underexplored. This paper studies how to transfer knowledge from those imperfect reward models in online RLHF. We start by identifying a novel property due to KL-regularization in the RLHF objective: \emph{a policy's coverability of the optimal policy is captured by its sub-optimality}. Building on this insight, we propose novel transfer learning principles and a theoretical algorithm---Transfer Policy Optimization (TPO)---with provable benefits compared to standard online learning. Empirically, inspired by our theoretical findings, we develop a win-rate-based transfer policy selection strategy with improved computational efficiency. Moreover, our empirical transfer learning technique is modular and can be integrated with various policy optimization methods, such as DPO, IPO and XPO, to further enhance their performance. We validate the effectiveness of our method through experiments on summarization tasks.

Poster

#W-1015

Diffusion models for Gaussian distributions: Exact solutions and Wasserstein errors

Emile Pierret · Bruno Galerne

Diffusion or score-based models recently showed high performance in image generation.They rely on a forward and a backward stochastic differential equations (SDE). The sampling of a data distribution is achieved by numerically solving the backward SDE or its associated flow ODE.Studying the convergence of these models necessitates to control four different types of error: the initialization error, the truncation error, the discretization error and the score approximation.In this paper, we theoretically study the behavior of diffusion models and their numerical implementation when the data distribution is Gaussian.Our first contribution is to derive the analytical solutions of the backward SDE and the probability flow ODE and to prove that these solutions and their discretizations are all Gaussian processes.Our second contribution is to compute the exact Wasserstein errors between the target and the numerically sampled distributions for any numerical scheme.This allows us to monitor convergence directly in the data space, while experimental works limit their empirical analysis to Inception features.An implementation of our code is available online.

Poster

#W-1016

Average Sensitivity of Hierarchical $k$-Median Clustering

Shijie Li · Weiqiang He · Ruobing Bai · Pan Peng

Hierarchical clustering is a widely used method for unsupervised learning with numerous applications. However, in the application of modern algorithms, the datasets studied are usually large and dynamic. If the hierarchical clustering is sensitive to small perturbations of the dataset, the usability of the algorithm will be greatly reduced. In this paper, we focus on the hierarchical $k$ -median clustering problem, which bridges hierarchical and centroid-based clustering while offering theoretical appeal, practical utility, and improved interpretability. We analyze the average sensitivity of algorithms for this problem by measuring the expected change in the output when a random data point is deleted. We propose an efficient algorithm for hierarchical $k$-median clustering and theoretically prove its low average sensitivity and high clustering quality. Additionally, we show that single linkage clustering and a deterministic variant of the CLNSS algorithm exhibit high average sensitivity, making them less stable. Finally, we validate the robustness and effectiveness of our algorithm through experiments.

Poster

#W-1017

Learning multivariate Gaussians with imperfect advice

Arnab Bhattacharyya · Davin Choo · Philips George John · Themistoklis Gouleakis

We revisit the problem of distribution learning within the framework of learning-augmented algorithms.In this setting, we explore the scenario where a probability distribution is provided as potentially inaccurate advice on the true, unknown distribution. Our objective is to develop learning algorithms whose sample complexity decreases as the quality of the advice improves, thereby surpassing standard learning lower bounds when the advice is sufficiently accurate. Specifically, we demonstrate that this outcome is achievable for the problem of learning a multivariate Gaussian distribution $N(\mu, \Sigma)$ in the PAC learning setting. Classically, in the advice-free setting, $\widetilde{\Theta}(d^2/\varepsilon^2)$ samples are sufficient and worst case necessary to learn $d$-dimensional Gaussians up to TV distance $\varepsilon$ with constant probability. When we are additionally given a parameter $\widetilde{\Sigma}$ as advice, we show that $\widetilde{\mathcal{O}}(d^{2-\beta}/\varepsilon^2)$ samples suffices whenever $|| \widetilde{\Sigma}^{-1/2} \Sigma \widetilde{\Sigma}^{-1/2} - I_d ||_1 \leq \varepsilon d^{1-\beta}$ (where $||\cdot||_1$ denotes the entrywise $\ell_1$ norm) for any $\beta > 0$, yielding a polynomial improvement over the advice-free setting.

Poster

#W-1018

On The Concurrence of Layer-wise Preconditioning Methods and Provable Feature Learning

Thomas T. Zhang · Behrad Moniri · Ansh Nagwekar · Faraz Rahman · Anton Xue · Hamed Hassani · Nikolai Matni

Layer-wise preconditioning methods are a family of memory-efficient optimization algorithms that introduce preconditioners per axis of each layer's weight tensors. These methods have seen a recent resurgence, demonstrating impressive performance relative to entry-wise ("diagonal") preconditioning methods such as Adam(W) on a wide range of neural network optimization tasks. Complementary to their practical performance, we demonstrate that layer-wise preconditioning methods are provably necessary from a statistical perspective. To showcase this, we consider two prototypical models, *linear representation learning* and *single-index learning*, which are widely used to study how typical algorithms efficiently learn useful *features* to enable generalization.In these problems, we show SGD is a suboptimal feature learner when extending beyond ideal isotropic inputs $\mathbf{x} \sim \mathsf{N}(\mathbf{0}, \mathbf{I})$ and well-conditioned settings typically assumed in prior work.We demonstrate theoretically and numerically that this suboptimality is fundamental, and that layer-wise preconditioning emerges naturally as the solution. We further show that standard tools like Adam preconditioning and batch-norm only mildly mitigate these issues, supporting the unique benefits of layer-wise preconditioning.

Spotlight Poster

#W-1019

Sparse-pivot: Dynamic correlation clustering for node insertions

Mina Dalirrooyfard · Konstantin Makarychev · Slobodan Mitrovic

We present a new Correlation Clustering algorithm for a dynamic setting where nodes are added one at a time. In this model, proposed by Cohen-Addad, Lattanzi, Maggiori, and Parotsidis (ICML 2024), the algorithm uses database queries to access the input graph and updates the clustering as each new node is added.Our algorithm has the amortized update time of $\log^{O(1)}(n)$. Its approximation factor is $20+\varepsilon$, which is a substantial improvement over the approximation factor of the algorithm by Cohen-Addad et al. We complement our theoretical findings by empirically evaluating the approximation guarantee of our algorithm. The results show that it outperforms the algorithm by Cohen-Addad et al.~in practice.

Poster

#W-102

Calibrating Video Watch-time Predictions with Credible Prototype Alignment

Chao · Shisong Tang · Fan Li · Jiechao Gao · Hechang Chen

Accurately predicting user watch-time is crucial for enhancing user stickiness and retention in video recommendation systems. Existing watch-time prediction approaches typically involve transformations of watch-time labels for prediction and subsequent reversal, ignoring both the natural distribution properties of label and the \textit{instance representation confusion} that results in inaccurate predictions. In this paper, we propose ProWTP, a two-stage method combining prototype learning and optimal transport for watch-time regression prediction, suitable for any deep recommendation model. Specifically, we observe that the watch-ratio (the ratio of watch-time to video duration) within the same duration bucket exhibits a multimodal distribution. To facilitate incorporation into models, we use a hierarchical vector quantised variational autoencoder (HVQ-VAE) to convert the continuous label distribution into a high-dimensional discrete distribution, serving as credible prototypes for calibrations. Based on this, ProWTP views the alignment between prototypes and instance representations as a Semi-relaxed Unbalanced Optimal Transport (SUOT) problem, where the marginal constraints of prototypes are relaxed. And the corresponding optimization problem is reformulated as a weighted Lasso problem for solution. Moreover, ProWTP introduces the assignment and compactness losses to encourage instances to cluster closely around their respective prototypes, thereby enhancing the prototype-level distinguishability. Finally, we conducted extensive offline experiments on two industrial datasets, demonstrating our consistent superiority in real-world application.

Poster

#W-1020

Incremental Gradient Descent with Small Epoch Counts is Surprisingly Slow on Ill-Conditioned Problems

Yujun Kim · Jaeyoung Cha · Chulhee Yun

Recent theoretical results demonstrate that the convergence rates of permutation-based SGD (e.g., random reshuffling SGD) are faster than uniform-sampling SGD; however, these studies focus mainly on the large epoch regime, where the number of epochs $K$ exceeds the condition number $\kappa$. In contrast, little is known when $K$ is smaller than $\kappa$, and it is still a challenging open question whether permutation-based SGD can converge faster in this small epoch regime (Safran and Shamir, 2021). As a step toward understanding this gap, we study the naive deterministic variant, Incremental Gradient Descent (IGD), on smooth and strongly convex functions. Our lower bounds reveal that for the small epoch regime, IGD can exhibit surprisingly slow convergence even when all component functions are strongly convex. Furthermore, when some component functions are allowed to be nonconvex, we prove that the optimality gap of IGD can be significantly worse throughout the small epoch regime. Our analyses reveal that the convergence properties of permutation-based SGD in the small epoch regime may vary drastically depending on the assumptions on component functions. Lastly, we supplement the paper with tight upper and lower bounds for IGD in the large epoch regime.

Poster

#W-103

Adaptive Self-improvement LLM Agentic System for ML Library Development

Genghan Zhang · Weixin Liang · Olivia Hsu · Kunle Olukotun

ML libraries, often written in architecture-specific programming languages (ASPLs) that target domain-specific architectures, are key to efficient ML systems. However, writing these high-performance ML libraries is challenging because it requires expert knowledge of both ML algorithms and the ASPL. Large language models (LLMs), on the other hand, have shown general coding capabilities. However, challenges remain when using LLMs for generating ML libraries using ASPLs because 1) this task is complicated even for human experts and 2) there are limited code examples due to the esoteric and evolving nature of ASPLs. We present an adaptive self-improvement agentic system that enables LLMs to perform such complex reasoning under limited data by iteratively improving their capability through self-generated experience. In order to evaluate the effectiveness of our system, we construct a benchmark of a typical ML library and generate ASPL code with both open and closed-source LLMs on this benchmark. Our results show improvements of up to $3.9\times$ over a baseline single LLM.

Poster

#W-104

PEINR: A Physics-enhanced Implicit Neural Representation for High-Fidelity Flow Field Reconstruction

Liming Shen · Liang Deng · Chongke Bi · Yu Wang · Xinhai Chen · Yueqing Wang · Jie Liu

Implicit neural representation (INR) has now been thrust into the limelight with its flexibility in high-fidelity flow field reconstruction tasks. However, the lack of standard benchmarking datasets and the grid independence assumption for INR-based methods hinder progress and adoption in real-world simulation scenarios. Moreover, naive adoptions of existing INR frameworks suffer from limited accuracy in capturing fine-scale structures and spatiotemporal dynamics. Tacking these issues, we first introduce HFR-Beach, a 5.4 TB public large-scale CFD dataset with 33,600 unsteady 2D and 3D vector fields for reconstructing high-fidelity flow fields. We further present PEINR, a physics-enhanced INR framework, to enrich the flow fields by concurrently enhancing numerical-precision and grid-resolution. Specifically, PEINR is mainly composed of physical encoding and transformer-based spatiotemporal fuser (TransSTF). Physical encoding decouples temporal and spatial components, employing Gaussian coordinate encoding and localized encoding techniques to capture the nonlinear characteristics of spatiotemporal dynamics and the stencil discretization of spatial dimensions, respectively. TransSTF fuses both spatial and temporal information via transformer for capturing long-range temporal dependencies. Qualitative and quantitative experiments and demonstrate that PEINR outperforms state-of-the-art INR-based methods in reconstruction quality.

Poster

#W-105

PDE-Transformer: Efficient and Versatile Transformers for Physics Simulations

Benjamin Holzschuh · Qiang Liu · Georg Kohl · Nils Thuerey

We introduce PDE-Transformer, an improved transformer-based architecture for surrogate modeling of physics simulations on regular grids. We combine recent architectural improvements of diffusion transformers with adjustments specific for large-scale simulations to yield a more scalable and versatile general-purpose transformer architecture, which can be used as the backbone for building large-scale foundation models in physical sciences. We demonstrate that our proposed architecture outperforms state-of-the-art transformer architectures for computer vision on a large dataset of 16 different types of PDEs. We propose to embed different physical channels individually as spatio-temporal tokens, which interact via channel-wise self-attention. This helps to maintain a consistent information density of tokens when learning multiple types of PDEs simultaneously. We demonstrate that our pre-trained models achieve improved performance on several challenging downstream tasks compared to training from scratch and also beat other foundation model architectures for physics simulations.Our source code is available at https://github.com/tum-pbs/pde-transformer.

Poster

#W-106

Topology-aware Neural Flux Prediction Guided by Physics

Haoyang Jiang · Jindong Wang · Xingquan Zhu · Yi He

Graph Neural Networks (GNNs) often struggle in preserving high-frequency components of nodal signals when dealing with directed graphs. Such components are crucial for modeling flow dynamics, without which a traditional GNN tends to treat a graph with forward and reverse topologies equal. To make GNNs sensitive to those high-frequency components thereby being capable to capture detailed topological differences, this paper proposes a novel framework that combines 1) explicit difference matrices that model directional gradients and 2) implicit physical constraints that enforce messages passing within GNNs to be consistent with natural laws. Evaluations on two real-world directed graph data, namely, water flux network and urban traffic flow network, demonstrate the effectiveness of our proposal.

Poster

#W-107

OneForecast: A Universal Framework for Global and Regional Weather Forecasting

Yuan Gao · Hao Wu · Ruiqi Shu · huanshuo dong · Fan Xu · Rui Chen · Yibo Yan · Qingsong Wen · Xuming Hu · Kun Wang · Jiahao Wu · Qing Li · Hui Xiong · Xiaomeng Huang

Accurate weather forecasts are important for disaster prevention, agricultural planning, etc. Traditional numerical weather prediction (NWP) methods offer physically interpretable high-accuracy predictions but are computationally expensive and fail to fully leverage rapidly growing historical data. In recent years, deep learning models have made significant progress in weather forecasting, but challenges remain, such as balancing global and regional high-resolution forecasts, excessive smoothing in extreme event predictions, and insufficient dynamic system modeling. To address these issues, this paper proposes a global-regional nested weather forecasting framework (OneForecast) based on graph neural networks. By combining a dynamic system perspective with multi-grid theory, we construct a multi-scale graph structure and densify the target region to capture local high-frequency features. We introduce an adaptive messaging mechanism, using dynamic gating units to deeply integrate node and edge features for more accurate extreme event forecasting. For high-resolution regional forecasts, we propose a neural nested grid method to mitigate boundary information loss. Experimental results show that OneForecast performs excellently across global to regional scales and short-term to long-term forecasts, especially in extreme event predictions. Codes link: \url{https://github.com/YuanGao-YG/OneForecast}.

Poster

#W-108

Efficient and Scalable Density Functional Theory Hamiltonian Prediction through Adaptive Sparsity

Erpai Luo · Xinran Wei · Lin Huang · Yunyang Li · Han Yang · Zaishuo Xia · Zun Wang · Chang Liu · Bin Shao · Jia Zhang

Hamiltonian matrix prediction is pivotal in computational chemistry, serving as the foundation for determining a wide range of molecular properties. While SE(3) equivariant graph neural networks have achieved remarkable success in this domain, their substantial computational cost—driven by high-order tensor product (TP) operations—restricts their scalability to large molecular systems with extensive basis sets. To address this challenge, we introduce SPHNet, an efficient and scalable equivariant network, that incorporates adaptive SParsity into Hamiltonian prediction. SPHNet employs two innovative sparse gates to selectively constrain non-critical interaction combinations, significantly reducing tensor product computations while maintaining accuracy. To optimize the sparse representation, we develop a Three-phase Sparsity Scheduler, ensuring stable convergence and achieving high performance at sparsity rates of up to 70\%. Extensive evaluations on QH9 and PubchemQH datasets demonstrate that SPHNet achieves state-of-the-art accuracy while providing up to a 7x speedup over existing models. Beyond Hamiltonian prediction, the proposed sparsification techniques also hold significant potential for improving the efficiency and scalability of other SE(3) equivariant networks, further broadening their applicability and impact.

Poster

#W-109

WyckoffDiff -- A Generative Diffusion Model for Crystal Symmetry

Filip Ekström Kelvinius · Oskar Andersson · Abhijith Parackal · Dong Qian · Rickard Armiento · Fredrik Lindsten

Crystalline materials often exhibit a high level of symmetry. However, most generative models do not account for symmetry, but rather model each atom without any constraints on its position or element. We propose a generative model, Wyckoff Diffusion (WyckoffDiff), which generates symmetry-based descriptions of crystals. This is enabled by considering a crystal structure representation that encodes all symmetry, and we design a novel neural network architecture which enables using this representation inside a discrete generative model framework. In addition to respecting symmetry by construction, the discrete nature of our model enables fast generation. We additionally present a new metric, Fréchet Wrenformer Distance, which captures the symmetry aspects of the materials generated, and we benchmark WyckoffDiff against recently proposed generative models for crystal generation. As a proof-of-concept study, we use WyckoffDiff to find new materials below the convex hull of thermodynamical stability.

Poster

#W-110

Zero-Shot Cyclic Peptide Design via Composable Geometric Constraints

Dapeng Jiang · Xiangzhe Kong · Jiaqi Han · Mingyu Li · Rui Jiao · Wenbing Huang · Stefano Ermon · Jianzhu Ma · Yang Liu

Cyclic peptides, characterized by geometric constraints absent in linear peptides, offer enhanced biochemical properties, presenting new opportunities to address unmet medical needs. However, designing target-specific cyclic peptides remains underexplored due to limited training data. To bridge the gap, we propose CP-Composer, a novel generative framework that enables zero-shot cyclic peptide generation via composable geometric constraints. Our approach decomposes complex cyclization patterns into unit constraints, which are incorporated into a diffusion model through geometric conditioning on nodes and edges. During training, the model learns from unit constraints and their random combinations in linear peptides, while at inference, novel constraint combinations required for cyclization are imposed as input. Experiments show that our model, despite trained with linear peptides, is capable of generating diverse target-binding cyclic peptides, reaching success rates from 38\% to 84\% on different cyclization strategies.

Poster

#W-1100

Annealing Flow Generative Models Towards Sampling High-Dimensional and Multi-Modal Distributions

Dongze Wu · Yao Xie

Sampling from high-dimensional, multi-modal distributions remains a fundamental challenge across domains such as statistical Bayesian inference and physics-based machine learning. In this paper, we propose Annealing Flow (AF), a method built on Continuous Normalizing Flows (CNFs) for sampling from high-dimensional and multi-modal distributions. AF is trained with a dynamic Optimal Transport (OT) objective incorporating Wasserstein regularization, and guided by annealing procedures, facilitating effective exploration of modes in high-dimensional spaces. Compared to recent NF methods, AF significantly improves training efficiency and stability, with minimal reliance on MC assistance. We demonstrate the superior performance of AF compared to state-of-the-art methods through extensive experiments on various challenging distributions and real-world datasets, particularly in high-dimensional and multi-modal settings. We also highlight AF’s potential for sampling the least favorable distributions.

Poster

#W-1101

Sounding that Object: Interactive Object-Aware Image to Audio Generation

Tingle Li · Baihe Huang · Xiaobin Zhuang · Dongya Jia · Jiawei Chen · Yuping Wang · Zhuo Chen · Gopala Anumanchipalli · Yuxuan Wang

Generating accurate sounds for complex audio-visual scenes is challenging, especially in the presence of multiple objects and sound sources. In this paper, we propose an interactive object-aware audio generation model that grounds sound generation in user-selected visual objects within images. Our method integrates object-centric learning into a conditional latent diffusion model, which learns to associate image regions with their corresponding sounds through multi-modal attention. At test time, our model employs image segmentation to allow users to interactively generate sounds at the object level. We theoretically validate that our attention mechanism functionally approximates test-time segmentation masks, ensuring the generated audio aligns with selected objects. Quantitative and qualitative evaluations show that our model outperforms baselines, achieving better alignment between objects and their associated sounds.

Poster

#W-111

PINNsAgent: Automated PDE Surrogation with Large Language Models

Qingpo Wuwu · Chonghan Gao · Tianyu Chen · Yihang Huang · Yuekai Zhang · Jianing Wang · Jianxin Li · Haoyi Zhou · Shanghang Zhang

Solving partial differential equations (PDEs) using neural methods has been a long-standing scientific and engineering research pursuit. Physics-Informed Neural Networks (PINNs) have emerged as a promising alternative to traditional numerical methods for solving PDEs. However, the gap between domain-specific knowledge and deep learning expertise often limits the practical application of PINNs. Previous works typically involve manually conducting extensive PINNs experiments and summarizing heuristic rules for hyperparameter tuning. In this work, we introduce PINNsAgent, a novel surrogation framework that leverages large language models (LLMs) to bridge the gap between domain-specific knowledge and deep learning. PINNsAgent integrates Physics-Guided Knowledge Replay (PGKR) for efficient knowledge transfer from solved PDEs to similar problems, and Memory Tree Reasoning for exploring the search space of optimal PINNs architectures. We evaluate PINNsAgent on 14 benchmark PDEs, demonstrating its effectiveness in automating the surrogation process and significantly improving the accuracy of PINNs-based solutions.

Poster

#W-112

Temperature-Annealed Boltzmann Generators

Henrik Schopmans · Pascal Friederich

Efficient sampling of unnormalized probability densities such as theBoltzmann distribution of molecular systems is a longstanding challenge.Next to conventional approaches like molecular dynamics or Markov chainMonte Carlo, variational approaches, such as training normalizing flows withthe reverse Kullback-Leibler divergence, have been introduced. However, suchmethods are prone to mode collapse and often do not learn to sample the fullconfigurational space. Here, we present temperature-annealed Boltzmanngenerators (TA-BG) to address this challenge. First, we demonstrate thattraining a normalizing flow with the reverse Kullback-Leibler divergence athigh temperatures is possible without mode collapse. Furthermore, weintroduce a reweighting-based training objective to anneal the distribution to lower target temperatures.We apply this methodology to three molecular systems of increasing complexity and, compared to the baseline, achieve better results in almost all metrics while requiring up to three times fewer target energy evaluations. For the largest system, our approach is the only method that accurately resolves the metastable states of the system.

Poster

#W-113

Arbitrarily-Conditioned Multi-Functional Diffusion for Multi-Physics Emulation

Da Long · Zhitong Xu · Guang Yang · Akil Narayan · Shandian Zhe

Modern physics simulation often involves multiple functions of interests, and traditional numerical approaches are known to be complex and computationally costly. While machine learning-based surrogate models can offer significant cost reductions, most focus on a single task, such as forward prediction, and typically lack uncertainty quantification --- an essential component in many applications. To overcome these limitations, we propose Arbitrarily-Conditioned Multi-Functional Diffusion (ACM-FD), a versatile probabilistic surrogate model for multi-physics emulation. ACM-FD can perform a wide range of tasks within a single framework, including forward prediction, various inverse problems, and simulating data for entire systems or subsets of quantities conditioned on others. Specifically, we extend the standard Denoising Diffusion Probabilistic Model (DDPM) for multi-functional generation by modeling noise as Gaussian processes (GP). We propose a random-mask based, zero-regularized denoising loss to achieve flexible and robust conditional generation. We induce a Kronecker product structure in the GP covariance matrix, substantially reducing the computational cost and enabling efficient training and sampling. We demonstrate the effectiveness of ACM-FD across several fundamental multi-physics systems.

Poster

#W-114

A Machine Learning Approach to Duality in Statistical Physics

Prateek Gupta · Andrea Ferrari · Nabil Iqbal

The notion of duality -- that a given physical system can have two different mathematical descriptions -- is a key idea in modern theoretical physics. Establishing a duality in lattice statistical mechanics models requires the construction of a dual Hamiltonian and a map from the original to the dual observables. By using neural networks to parameterize these maps and introducing a loss function that penalises the difference between correlation functions in original and dual models, we formulate the process of duality discovery as an optimization problem. We numerically solve this problem and show that our framework can rediscover the celebrated Kramers-Wannier duality for the 2d Ising model, numerically reconstructing the known mapping of temperatures. We further investigate the 2d Ising model deformed by a plaquette coupling and find families of ``approximate duals''. We discuss future directions and prospects for discovering new dualities within this framework.

Poster

#W-115

Tensor-Var: Efficient Four-Dimensional Variational Data Assimilation

Yiming Yang · Xiaoyuan Cheng · Daniel Giles · Sibo Cheng · Yi He · Xiao Xue · Boli Chen · Yukun Hu

Variational data assimilation estimates the dynamical system states by minimizing a cost function that fits the numerical models with the observational data. Although four-dimensional variational assimilation (4D-Var) is widely used, it faces high computational costs in complex nonlinear systems and depends on imperfect state-observation mappings. Deep learning (DL) offers more expressive approximators, while integrating DL models into 4D-Var is challenging due to their nonlinearities and lack of theoretical guarantees in assimilation results. In this paper, we propose \textit{Tensor-Var}, a novel framework that integrates kernel conditional mean embedding (CME) with 4D-Var to linearize nonlinear dynamics, achieving convex optimization in a learned feature space. Moreover, our method provides a new perspective for solving 4D-Var in a linear way, offering theoretical guarantees of consistent assimilation results between the original and feature spaces. To handle large-scale problems, we propose a method to learn deep features (DFs) using neural networks within the Tensor-Var framework. Experiments on chaotic systems and global weather prediction with real-time observations show that Tensor-Var outperforms conventional and DL hybrid 4D-Var baselines in accuracy while achieving a 10- to 20-fold speed improvement.

Poster

#W-116

Multi-Timescale Dynamics Model Bayesian Optimization for Plasma Stabilization in Tokamaks

Rohit Sonker · Alexandre Capone · Andrew Rothstein · Hiro Kaga · Egemen Kolemen · Jeff Schneider

Machine learning algorithms often struggle to control complex real-world systems. In the case of nuclear fusion, these challenges are exacerbated, as the dynamics are notoriously complex, data is poor, hardware is subject to failures, and experiments often affect dynamics beyond the experiment's duration. Existing tools like reinforcement learning, supervised learning, and Bayesian optimization address some of these challenges but fail to provide a comprehensive solution. To overcome these limitations, we present a multi-scale Bayesian optimization approach that integrates a high-frequency data-driven dynamics model with a low-frequency Gaussian process. By updating the Gaussian process between experiments, the method rapidly adapts to new data, refining the predictions of the less reliable dynamical model. We validate our approach by controlling tearing instabilities in the DIII-D nuclear fusion plant. Offline testing on historical data shows that our method significantly outperforms several baselines. Results on live experiments on the DIII-D tokamak, conducted under high-performance plasma scenarios prone to instabilities, shows a 50\% success rate — marking a 117\% improvement over historical outcomes.

Poster

#W-117

MIPT: Multilevel Informed Prompt Tuning for Robust Molecular Property Prediction

Yeyun Chen · Jiangming Shi

The progress in materials science and drug discovery is impeded by the availability of labeled data and the high costs of manual annotation, driving the need for efficient strategies to capture molecular representations and enable accurate predictions. Pretrained Graph Neural Networks have shown promise in capturing universal molecular representations, but adapting them to task-specific applications remains challenging. In this paper, we propose Multilevel Informed Prompt-Tuning (MIPT), a novel framework for effectively tailoring pretrained models to molecule-related tasks. MIPT utilizes a lightweight, multi-level prompt learning module to capture node-level and graph-level task-specific knowledge, ensuring adaptable and efficient tuning. Additionally, a noise penalty mechanism is introduced to address mismatches between pretrained representations and downstream tasks, reducing irrelevant or noisy information. Experimental results show that MIPT surpasses all baselines, aligning graph space and task space while achieving significant improvements in molecule-related tasks, demonstrating its scalability and versatility for molecular tasks.

Poster

#W-118

Open Materials Generation with Stochastic Interpolants

Philipp Höllmer · Thomas Egg · Maya Martirossyan · Eric Fuemmeler · Zeren Shui · Amit Gupta · Pawan Prakash · Adrian Roitberg · Mingjie Liu · George Karypis · Mark Transtrum · Richard Hennig · Ellad Tadmor · Stefano Martiniani

The discovery of new materials is essential for enabling technological advancements. Computational approaches for predicting novel materials must effectively learn the manifold of stable crystal structures within an infinite design space. We introduce Open Materials Generation (OMatG), a unifying framework for the generative design and discovery of inorganic crystalline materials. OMatG employs stochastic interpolants (SI) to bridge an arbitrary base distribution to the target distribution of inorganic crystals via a broad class of tunable stochastic processes, encompassing both diffusion models and flow matching as special cases. In this work, we adapt the SI framework by integrating an equivariant graph representation of crystal structures and extending it to account for periodic boundary conditions in unit cell representations. Additionally, we couple the SI flow over spatial coordinates and lattice vectors with discrete flow matching for atomic species. We benchmark OMatG's performance on two tasks: Crystal Structure Prediction (CSP) for specified compositions, and de novo generation (DNG) aimed at discovering stable, novel, and unique structures. In our ground-up implementation of OMatG, we refine and extend both CSP and DNG metrics compared to previous works. OMatG establishes a new state of the art in generative modeling for materials discovery, outperforming purely flow-based and diffusion-based implementations. These results underscore the importance of designing flexible deep learning frameworks to accelerate progress in materials science. The OMatG code is available at https://github.com/FERMat-ML/OMatG.

Poster

#W-119

On Explaining Equivariant Graph Networks via Improved Relevance Propagation

Hongyi Ling · Haiyang Yu · Zhimeng Jiang · Na Zou · Shuiwang Ji

We consider explainability in equivariant graph neural networks for 3D geometric graphs. While many XAI methods have been developed for analyzing graph neural networks, they predominantly target 2D graph structures. The complex nature of 3D data and the sophisticated architectures of equivariant GNNs present unique challenges. Current XAI techniques either struggle to adapt to equivariant GNNs or fail to effectively handle positional data and evaluate the significance of geometric features adequately. To address these challenges, we introduce a novel method, known as EquiGX, which uses the Deep Taylor decomposition framework to extend the layer-wise relevance propagation rules tailored for spherical equivariant GNNs. Our approach decomposes prediction scores and back-propagates the relevance scores through each layer to the input space. Our decomposition rules provide a detailed explanation of each layer’s contribution to the network’s predictions, thereby enhancing our understanding of how geometric and positional data influence the model’s outputs. Through experiments on both synthetic and real-world datasets, our method demonstrates its capability to identify critical geometric structures and outperform alternative baselines. These results indicate that our method provides significantly enhanced explanations for equivariant GNNs. Our code has been released as part of the AIRS library (https://github.com/divelab/AIRS/).

Spotlight Poster

#W-120

Relational Invariant Learning for Robust Solvation Free Energy Prediction

Yeyun Chen

Predicting the solvation free energy of molecules using graph neural networks holds significant potential for advancing drug discovery and the design of novel materials. While previous methods have demonstrated success on independent and identically distributed (IID) datasets, their performance in out-of-distribution (OOD) scenarios remains largely unexplored. We propose a novel Relational Invariant Learning framework (RILOOD) to enhance OOD generalization in solvation free energy prediction. RILOOD comprises three key components: (i) a mixup-based conditional modeling module that integrates diverse environments, (ii) a novel multi-granularity refinement strategy that extends beyond core substructures to enable context-aware representation learning for capturing multi-level interactions, and (iii) an invariant learning mechanism that identifies robust patterns generalizable to unseen environments. Extensive experiments demonstrate that RILOOD significantly outperforms state-of-the-art methods across various distribution shifts, highlighting its effectiveness in improving solvation free energy prediction under diverse conditions.

Poster

#W-121

Interpolating Neural Network-Tensor Decomposition (INN-TD): a scalable and interpretable approach for large-scale physics-based problems

Jiachen Guo · Xiaoyu Xie · Chanwook Park · Hantao Zhang · Matthew Politis · Gino Domel · Jiachen Guo

Deep learning has been extensively employed as a powerful function approximator for modeling physics-based problems described by partial differential equations (PDEs). Despite their popularity, standard deep learning models often demand prohibitively large computational resources and yield limited accuracy when scaling to large-scale, high-dimensional physical problems. Their black-box nature further hinders their application in industrial problems where interpretability and high precision are critical. To overcome these challenges, this paper introduces Interpolating Neural Network-Tensor Decomposition (INN-TD), a scalable and interpretable framework that has the merits of both machine learning and finite element methods for modeling large-scale physical systems. By integrating locally supported interpolation functions from finite element into the network architecture, INN-TD achieves a sparse learning structure with enhanced accuracy, faster training/solving speed, and reduced memory footprint. This makes it particularly effective for tackling large-scale high-dimensional parametric PDEs in training, solving, and inverse optimization tasks in physical problems where high precision is required.

Poster

#W-122

HybridGS: High-Efficiency Gaussian Splatting Data Compression using Dual-Channel Sparse Representation and Point Cloud Encoder

Qi Yang · Le Yang · Geert Van der Auwera · Zhu Li

Most existing 3D Gaussian Splatting (3DGS) compression schemes focus on producing compact 3DGS representation via implicit data embedding. They have long encoding and decoding times and highly customized data format, making it difficult for widespread deployment. This paper presents a new 3DGS compression framework called HybridGS, which takes advantage of both compact generation and standardized point cloud data encoding. HybridGS first generates compact and explicit 3DGS data. A dual-channel sparse representation is introduced to supervise the primitive position and feature bit depth. It then utilizes a canonical point cloud encoder to carry out further data compression and form standard output bitstreams. A simple and effective rate control scheme is proposed to pivot the interpretable data compression scheme. HybridGS does not include any modules aimed at improving 3DGS quality during generation. But experiment results show that it still provides comparable reconstruction performance against state-of-the-art methods, with evidently faster encoding and decoding speed. The code is publicly available at https://github.com/Qi-Yangsjtu/HybridGS .

Poster

#W-123

CostFilter-AD: Enhancing Anomaly Detection through Matching Cost Filtering

Zhe Zhang · Mingxiu Cai · Hanxiao Wang · Gaochang Wu · Tianyou Chai · Xiatian Zhu

Unsupervised anomaly detection (UAD) seeks to localize the anomaly mask of an input image with respect to normal samples.Either by reconstructing normal counterparts (reconstruction-based) or by learning an image feature embedding space (embedding-based), existing approaches fundamentally rely on image-level or feature-level matching to derive anomaly scores. Often, such a matching process is inaccurate yet overlooked, leading to sub-optimal detection. To address this issue, we introduce the concept of cost filtering, borrowed from classical matching tasks, such as depth and flow estimation, into the UAD problem. We call this approach CostFilter-AD. Specifically, we first construct a matching cost volume between the input and normal samples, comprising two spatial dimensions and one matching dimension that encodes potential matches. To refine this, we propose a cost volume filtering network, guided by the input observation as an attention query across multiple feature layers, which effectively suppresses matching noise while preserving edge structures and capturing subtle anomalies. Designed as a generic post-processing plug-in, CostFilter-AD can be integrated with either reconstruction-based or embedding-based methods. Extensive experiments on MVTec-AD and VisA benchmarks validate the generic benefits of CostFilter-AD for both single- and multi-class UAD tasks. Code and models will be released at https://github.com/ZHE-SAPI/CostFilter-AD.

Poster

#W-124

HetSSNet: Spatial-Spectral Heterogeneous Graph Learning Network for Panchromatic and Multispectral Images Fusion

Mengting Ma · Yizhen Jiang · Mengjiao Zhao · Jiaxin Li · Wei Zhang

Remote sensing pansharpening aims to reconstruct spatial-spectral properties during the fusion of panchromatic (PAN) images and low-resolution multi-spectral (LR-MS) images, finally generating the high-resolution multi-spectral (HR-MS) images. In the mainstream modeling strategies, i.e., CNN and Transformer, the input images are treated as the equal-sized grid of pixels in the Euclidean space. They have limitations in facing remote sensing images with irregular ground objects. Graph is the more flexible structure, however, there are two major challenges when modeling spatial-spectral properties with graph: 1) constructing the customized graph structure for spatial-spectral relationship priors; 2) learning the unified spatial-spectral representation through the graph. To address these challenges, we propose the spatial-spectral heterogeneous graph learning network, named HetSSNet.Specifically, HetSSNet initially constructs the heterogeneous graph structure for pansharpening, which explicitly describes pansharpening-specific relationships. Subsequently, the basic relationship pattern generation module is designed to extract the multiple relationship patterns from the heterogeneous graph. Finally, relationship pattern aggregation module is exploited to collaborativelylearn unified spatial-spectral representation across different relationships among nodes with adaptive importance learning from local and global perspectives. Extensive experiments demonstrate the significant superiority and generalization of HetSSNet.

Poster

#W-200

MIRROR: Make Your Object-Level Multi-View Generation More Consistent with Training-Free Rectification

TianChi Xing · Bonan Li · Congying Han · XINMIN QIU · Zicheng Zhang · Tiande Guo

Multi-view Diffusion has greatly advanced the development of 3D content creation by generating multiple images from distinct views, achieving remarkable photorealistic results. However, existing works are still vulnerable to inconsistent 3D geometric structures (commonly known as Janus Problem) and severe artifacts. In this paper, we introduce MIRROR, a versatile plug-and-play method that rectifies such inconsistencies in a training-free manner, enabling the acquisition of high-fidelity, realistic structures without compromising diversity. Our key idea focuses on tracing the motion trajectory of physical points across adjacent viewpoints, enabling rectifications based on neighboring observations of the same region. Technically, MIRROR comprises two core modules: Trajectory Tracking Module (TTM) for pixel-wise trajectory tracking that labels identical points across views, and Feature Rectification Module (FRM) for explicitly adjustment of each pixel embedding on noisy synthesized images by minimizing the distance to corresponding block features in neighboring views, thereby achieving consistent outputs. Extensive evaluations demonstrate that MIRROR can seamlessly integrate with a diverse range of off-the-shelf object-level multi-view diffusion models, significantly enhancing both the consistency and the fidelity in an efficient way.

Poster

#W-201

Generative Point Cloud Registration

Haobo Jiang · Jin Xie · jian Yang · Liang Yu · Jianmin Zheng

In this paper, we propose a novel 3D registration paradigm, Generative Point Cloud Registration, which bridges advanced 2D generative models with 3D matching tasks to enhance registration performance. Our key idea is to generate cross-view consistent image pairs that are well-aligned with the source and target point clouds, enabling geometric-color feature fusion to facilitate robust matching. To ensure high-quality matching, the generated image pair should feature both 2D-3D geometric consistency and cross-view texture consistency. To achieve this, we introduce Match-ControlNet, a matching-specific, controllable 2D generative model. Specifically, it leverages the depth-conditioned generation capability of ControlNet to produce images that are geometrically aligned with depth maps derived from point clouds, ensuring 2D-3D geometric consistency. Additionally, by incorporating a coupled conditional denoising scheme and coupled prompt guidance, Match-ControlNet further promotes cross-view feature interaction, guiding texture consistency generation. Our generative 3D registration paradigm is general and could be seamlessly integrated into various registration methods to enhance their performance. Extensive experiments on 3DMatch and ScanNet datasets verify the effectiveness of our approach.

Poster

#W-202

Feature out! Let Raw Image as Your Condition for Blind Face Restoration

XINMIN QIU · Gege Chen · Bonan Li · Congying Han · Tiande Guo · Zicheng Zhang

Blind face restoration (BFR), which involves converting low-quality (LQ) images into high-quality (HQ) images, remains challenging due to complex and unknown degradations. While previous diffusion-based methods utilize feature extractors from LQ images as guidance, using raw LQ images directly as the starting point for the reverse diffusion process offers a theoretically optimal solution. In this work, we propose Pseudo-Hashing Image-to-image Schrödinger Bridge (P-I2SB), a novel framework inspired by optimal mass transport problems, which enhances the restoration potential of Schrödinger Bridge (SB) by correcting data distributions and effectively learning the optimal transport path between any two data distributions. Notably, we theoretically explore and identify that existing methods are limited by the optimality and reversibility of solutions in SB, leading to suboptimal performance. Our approach involves preprocessing HQ images during training by hashing them into pseudo-samples according to a rule related to LQ images, ensuring structural similarity in distribution. This guarantees optimal and reversible solutions in SB, enabling the inference process to learn effectively and allowing P-I2SB to achieve state-of-the-art results in BFR, with more natural textures and retained inference speed compared to previous methods.

Poster

#W-203

VTGaussian-SLAM: RGBD SLAM for Large Scale Scenes with Splatting View-Tied 3D Gaussians

Pengchong Hu · Zhizhong Han

Jointly estimating camera poses and mapping scenes from RGBD images is a fundamental task in simultaneous localization and mapping (SLAM). State-of-the-art methods employ 3D Gaussians to represent a scene, and render these Gaussians through splatting for higher efficiency and better rendering. However, these methods cannot scale up to extremely large scenes, due to the inefficient tracking and mapping strategies that need to optimize all 3D Gaussians in the limited GPU memories throughout the training to maintain the geometry and color consistency to previous RGBD observations. To resolve this issue, we propose novel tracking and mapping strategies to work with a novel 3D representation, dubbed view-tied 3D Gaussians, for RGBD SLAM systems. View-tied 3D Gaussians is a kind of simplified Gaussians, which is tied to depth pixels, without needing to learn locations, rotations, and multi-dimensional variances. Tying Gaussians to views not only significantly saves storage but also allows us to employ many more Gaussians to represent local details in the limited GPU memory. Moreover, our strategies remove the need of maintaining all Gaussians learnable throughout the training, while improving rendering quality, and tracking accuracy. We justify the effectiveness of these designs, and report better performance over the latest methods on the widely used benchmarks in terms of rendering and tracking accuracy and scalability. Please see our project page for code and videos at https://machineperceptionlab.github.io/VTGaussian-SLAM-Project.

Spotlight Poster

#W-204

VideoRoPE: What Makes for Good Video Rotary Position Embedding?

Xilin Wei · Xiaoran Liu · Yuhang Zang · Xiaoyi Dong · Pan Zhang · Yuhang Cao · Jian Tong · Haodong Duan · Qipeng Guo · Jiaqi Wang · Xipeng Qiu · Dahua Lin

While Rotary Position Embedding (RoPE) and its variants are widely adopted for their long-context capabilities, the extension of the 1D RoPE to video, with its complex spatio-temporal structure, remains an open challenge.This work first introduces a comprehensive analysis that identifies four key characteristics essential for the effective adaptation of RoPE to video, which have not been fully considered in prior work.As part of our analysis, we introduce a challenging V-NIAH-D (Visual Needle-In-A-Haystack with Distractors) task, which adds periodic distractors into V-NIAH.The V-NIAH-D task demonstrates that previous RoPE variants, lacking appropriate temporal dimension allocation, are easily misled by distractors.Based on our analysis, we introduce VideoRoPE, with a 3D structure designed to preserve spatio-temporal relationships.VideoRoPE features low-frequency temporal allocation to mitigate periodic oscillations, a diagonal layout to maintain spatial symmetry, and adjustable temporal spacing to decouple temporal and spatial indexing.VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.Our code and model weights will be publicly released.

Spotlight Poster

#W-205

ReferSplat: Referring Segmentation in 3D Gaussian Splatting

Shuting He · Guangquan Jie · Changshuo Wang · Yun Zhou · Shuming Hu · Guanbin Li · Henghui Ding

We introduce Referring 3D Gaussian Splatting Segmentation (R3DGS), a new task that aims to segment target objects in a 3D Gaussian scene based on natural language descriptions, which often contain spatial relationships or object attributes. This task requires the model to identify newly described objects that may be occluded or not directly visible in a novel view,posing a significant challenge for 3D multi-modal understanding. Developing this capability is crucial for advancing embodied AI.To support research in this area, we construct the first R3DGS dataset, Ref-LERF.Our analysis reveals that 3D multi-modal understanding and spatial relationship modeling are key challenges for R3DGS.To address these challenges, we propose ReferSplat, a framework that explicitly models 3D Gaussian points with natural language expressions in a spatially aware paradigm. ReferSplat achieves state-of-the-art performance on both the newly proposed R3DGS task and 3D open-vocabulary segmentation benchmarks.Dataset and code are available at https://github.com/heshuting555/ReferSplat.

Poster

#W-206

Privacy-Shielded Image Compression: Defending Against Exploitation from Vision-Language Pretrained Models

Xuelin Shen · Jiayin Xu · Kangsheng Yin · Wenhan Yang

The improved semantic understanding of vision-language pretrained (VLP) models has made it increasingly difficult to protect publicly posted images from being exploited by search engines and other similar tools. In this context, this paper seeks to protect users' privacy by implementing defenses at the image compression stage to prevent exploitation. Specifically, we propose a flexible coding method, termed Privacy-Shielded Image Compression (PSIC), that can produce bitstreams with multiple decoding options. By default, the bitstream is decoded to preserve satisfactory perceptual quality while preventing interpretation by VLP models. Our method also retains the original image compression functionality. With a customizable input condition, the proposed scheme can reconstruct the image that preserves its full semantic information. A Conditional Latent Trigger Generation (CLTG) module is proposed to produce bias information based on customizable conditions to guide the decoding process into different reconstructed versions, and an Uncertainty-Aware Encryption-Oriented (UAEO) optimization function is designed to leverage the soft labels inferred from the target VLP model's uncertainty on the training data. This paper further incorporates an adaptive multi-objective optimization strategy to obtain improved encrypting performance and perceptual quality simultaneously within a unified training process. The proposed scheme is plug-and-play and can be seamlessly integrated into most existing Learned Image Compression (LIC) models. Extensive experiments across multiple downstream tasks have demonstrated the effectiveness of our design.

Poster

#W-207

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

Dejia Xu · Yifan Jiang · Chen Huang · Liangchen Song · Thorsten Gernoth · Liangliang Cao · Zhangyang “Atlas” Wang · Hao Tang

In recent years there have been remarkable breakthroughs in image-to-video generation. However, the 3D consistency and camera controllability of generated frames have remained unsolved. Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene. To address these limitations, we introduce Cavia, a novel framework for camera-controllable, multi-view video generation, capable of converting an input image into multiple spatiotemporally consistent videos. Our framework extends the spatial and temporal attention modules into view-integrated attention modules, improving both viewpoint and temporal consistency. This flexible design allows for joint training with diverse curated data sources, including scene-level static videos, object-level synthetic multi-view dynamic videos, and real-world monocular dynamic videos. To the best of our knowledge, Cavia is the first framework that enables users to generate multiple videos of the same scene with precise control over camera motion, while simultaneously preserving object motion. Extensive experiments demonstrate that Cavia surpasses state-of-the-art methods in terms of geometric consistency and perceptual quality.

Poster

#W-208

Asymmetric Decision-Making in Online Knowledge Distillation: Unifying Consensus and Divergence

zhaowei chen · Borui Zhao · Yuchen Ge · Yuhao Chen · Renjie Song · Jiajun Liang

Online Knowledge Distillation (OKD) methods represent a streamlined, one-stage distillation training process that obviates the necessity of transferring knowledge from a pretrained teacher network to a more compact student network. In contrast to existing logits-based OKD methods, this paper presents an innovative approach to leverage intermediate spatial representations. Our analysis of the intermediate features from both teacher and student models reveals two pivotal insights: (1) the similar features between students and teachers are predominantly focused on the foreground objects. (2) teacher models emphasize foreground objects more than students. Building on these findings, we propose Asymmetric Decision-Making (ADM) to enhance feature consensus learning for student models while continuously promoting feature diversity in teacher models. Specifically, Consensus Learning for student models prioritizes spatial features with high consensus relative to teacher models. Conversely, Divergence Learning for teacher models highlights spatial features with lower similarity compared to student models, indicating superior performance by teacher models in these regions. Consequently, ADM facilitates the student models to catch up with the feature learning process of the teacher models. Extensive experiments demonstrate that ADM consistently surpasses existing OKD methods across various online knowledge distillation settings and also achieves superior results when transferred to offline knowledge distillation, semantic segmentation and diffusion distillation tasks.

Poster

#W-209

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

SHEN FEI · Cong Wang · Junyao Gao · Qin Guo · Jisheng Dang · Jinhui Tang · Tat-Seng Chua

Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also introduce the {TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation.

Poster

#W-210

Efficient Motion Prompt Learning for Robust Visual Tracking

Jie Zhao · Xin Chen · Yongsheng Yuan · Michael Felsberg · Dong Wang · Huchuan Lu

Due to the challenges of processing temporal information, most trackers depend solely on visual discriminability and overlook the unique temporal coherence of video data. In this paper, we propose a lightweight and plug-and-play motion prompt tracking method. It can be easily integrated into existing vision-based trackers to build a joint tracking framework leveraging both motion and vision cues, thereby achieving robust tracking through efficient prompt learning. A motion encoder with three different positional encodings is proposed to encode the long-term motion trajectory into the visual embedding space, while a fusion decoder and an adaptive weight mechanism are designed to dynamically fuse visual and motion features. We integrate our motion module into three different trackers with five models in total. Experiments on seven challenging tracking benchmarks demonstrate that the proposed motion module significantly improves the robustness of vision-based trackers, with minimal training costs and negligible speed sacrifice. Code is available at https://github.com/zj5559/Motion-Prompt-Tracking.

Spotlight Poster

#W-211

PhySpec: Physically Consistent Spectral Reconstruction via Orthogonal Subspace Decomposition and Self-Supervised Meta-Auxiliary Learning

Xingxing Yang · Jie Chen · Zaifeng Yang

This paper presents a novel approach to hyperspectral image (HSI) reconstruction from RGB images, addressing fundamental limitations in existing learning-based methods from a physical perspective. We discuss and aim to address the ``colorimetric dilemma": failure to consistently reproduce ground-truth RGB from predicted HSI, thereby compromising physical integrity and reliability in practical applications. To tackle this issue, we propose PhySpec, a physically consistent framework for robust HSI reconstruction. Our approach fundamentally exploits the intrinsic physical relationship between HSIs and corresponding RGBs by employing orthogonal subspace decomposition, which enables explicit estimation of camera spectral sensitivity (CSS). This ensures that our reconstructed spectra align with well-established physical principles, enhancing their reliability and fidelity. Moreover, to efficiently use internal information from test samples, we propose a self-supervised meta-auxiliary learning (MAXL) strategy that rapidly adapts the trained parameters to unseen samples using only a few gradient descent steps at test time, while simultaneously constraining the generated HSIs to accurately recover ground-truth RGB values. Thus, MAXL reinforces the physical integrity of the reconstruction process. Extensive qualitative and quantitative evaluations validate the efficacy of our proposed framework, showing superior performance compared to SOTA methods.

Poster

#W-212

PISA Experiments: Exploring Physics Post-Training for Video Diffusion Models by Watching Stuff Drop

Chenyu Li · Oscar Michel · Xichen Pan · Sainan Liu · Mike Roberts · Saining Xie

Large-scale pre-trained video generation models excel in content creation but are not reliable as physically accurate world simulators out of the box. This work studies the process of post-training these models for accurate world modeling through the lens of the simple, yet fundamental, physics task of modeling object freefall. We show state-of-the-art video generation models struggle with this basic task, despite their visually impressive outputs. To remedy this problem, we find that fine-tuning on a relatively small amount of simulated videos is effective in inducing the dropping behavior in the model, and we can further improve results through a novel reward modeling procedure we introduce. Our study also reveals key limitations of post-training in generalization and distribution modeling. Additionally, we release a benchmark for this task that may serve as a useful diagnostic tool for tracking physical accuracy in large-scale video generative model development. Code is available at this repository: https://github.com/vision-x-nyu/pisa-experiments.

Poster

#W-213

SafeMap: Robust HD Map Construction from Incomplete Observations

Xiaoshuai Hao · Lingdong Kong · Rong Yin · Pengwei Wang · Jing Zhang · Yunfeng Diao · Shu Zhao

Robust high-definition (HD) map construction is vital for autonomous driving, yet existing methods often struggle with incomplete multi-view camera data. This paper presents SafeMap, a novel framework specifically designed to ensure accuracy even when certain camera views are missing. SafeMap integrates two key components: the Gaussian-based Perspective View Reconstruction (G-PVR) module and the Distillation-based Bird’s-Eye-View (BEV) Correction (D-BEVC) module. G-PVR leverages prior knowledge of view importance to dynamically prioritize the most informative regions based on the relationships among available camera views. Furthermore, D-BEVC utilizes panoramic BEV features to correct the BEV representations derived from incomplete observations. Together, these components facilitate comprehensive data reconstruction and robust HD map generation. SafeMap is easy to implement and integrates seamlessly into existing systems, offering a plug-and-play solution for enhanced robustness. Experimental results demonstrate that SafeMap significantly outperforms previous methods in both complete and incomplete scenarios, highlighting its superior performance and resilience.

Poster

#W-214

When Model Knowledge meets Diffusion Model: Diffusion-assisted Data-free Image Synthesis with Alignment of Domain and Class

Yujin Kim · Hyunsoo Kim · Hyunwoo Kim · Suhyun Kim

Open-source pre-trained models hold great potential for diverse applications, but their utility declines when their training data is unavailable. Data-Free Image Synthesis (DFIS) aims to generate images that approximate the learned data distribution of a pre-trained model without accessing the original data. However, existing DFIS methods produce samples that deviate from the training data distribution due to the lack of prior knowledge about natural images. To overcome this limitation, we propose DDIS, the first Diffusion-assisted Data-free Image Synthesis method that leverages a text-to-image diffusion model as a powerful image prior, improving synthetic image quality. DDIS extracts knowledge about the learned distribution from the given model and uses it to guide the diffusion model, enabling the generation of images that accurately align with the training data distribution. To achieve this, we introduce Domain Alignment Guidance (DAG) that aligns the synthetic data domain with the training data domain during the diffusion sampling process. Furthermore, we optimize a single Class Alignment Token (CAT) embedding to effectively capture class-specific attributes in the training dataset. Experiments on PACS and ImageNet demonstrate that DDIS outperforms prior DFIS methods by generating samples that better reflect the training data distribution, achieving SOTA performance in data-free applications.

Poster

#W-215

Learning Event Completeness for Weakly Supervised Video Anomaly Detection

Yu Wang · Shiwei Chen

Weakly supervised video anomaly detection (WS-VAD) is tasked with pinpointing temporal intervals containing anomalous events within untrimmed videos, utilizing only video-level annotations. However, a significant challenge arises due to the absence of dense frame-level annotations, often leading to incomplete localization in existing WS-VAD methods. To address this issue, we present a novel LEC-VAD, Learning Event Completeness for Weakly Supervised Video Anomaly Detection, which features a dual structure designed to encode both category-aware and category-agnostic semantics between vision and language. Within LEC-VAD, we devise semantic regularities that leverage an anomaly-aware Gaussian mixture to learn precise event boundaries, thereby yielding more complete event instances. Besides, we develop a novel memory bank-based prototype learning mechanism to enrich concise text descriptions associated with anomaly-event categories. This innovation bolsters the text's expressiveness, which is crucial for advancing WS-VAD. Our LEC-VAD demonstrates remarkable advancements over the current state-of-the-art methods on two benchmark datasets XD-Violence and UCF-Crime.

Poster

#W-216

Visual Autoregressive Modeling for Image Super-Resolution

Yunpeng Qu · Kun Yuan · Jinhua Hao · Kai Zhao · Qizhi Xie · Ming Sun · Chao Zhou

Image Super-Resolution (ISR) has seen significant progress with the introduction of remarkable generative models.However, challenges such as the trade-off issues between fidelity and realism, as well as computational complexity, have also posed limitations on their application.Building upon the tremendous success of autoregressive models in the language domain, we propose \textbf{VARSR}, a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.To effectively integrate and preserve semantic information in low-resolution images, we propose using prefix tokens to incorporate the condition.Scale-aligned Rotary Positional Encodings are introduced to capture spatial structures and the diffusion refiner is utilized for modeling quantization residual loss to achieve pixel-level fidelity.Image-based Classifier-free Guidance is proposed to guide the generation of more realistic images.Furthermore, we collect large-scale data and design a training process to obtain robust generative priors.Quantitative and qualitative results show that VARSR is capable of generating high-fidelity and high-realism images with more efficiency than diffusion-based methods.Our codes are released at \url{https://github.com/quyp2000/VARSR}.

Poster

#W-217

Beyond Confidence: Exploiting Homogeneous Pattern for Semi-Supervised Semantic Segmentation

Rui Sun · Huayu Mai · Wangkai Li · Yujia Chen · Naisong Luo · Yuan Wang · Tianzhu Zhang

The critical challenge of semi-supervised semantic segmentation lies in how to fully exploit a large volume of unlabeled data to improve the model's generalization performance for robust segmentation. Existing methods mainly rely on confidence-based scoring functions in the prediction space to filter pseudo labels, which suffer from the inherent trade-off between true and false positive rates. In this paper, we carefully design an agent construction strategy to build clean sets of correct (positive) and incorrect (negative) pseudo labels, and propose the Agent Score function (AgScore) to measure the consensus between candidate pixels and these sets. In this way, AgScore takes a step further to capture homogeneous patterns in the embedding space, conditioned on clean positive/negative agents stemming from the prediction space, without sacrificing the merits of confidence score, yielding better trad-off. We provide theoretical analysis to understand the mechanism of AgScore, and demonstrate its effectiveness by integrating it into three semi-supervised segmentation frameworks on Pascal VOC, Cityscapes, and COCO datasets, showing consistent improvements across all data partitions.

Poster

#W-218

UniMC: Taming Diffusion Transformer for Unified Keypoint-Guided Multi-Class Image Generation

Qin Guo · Ailing Zeng · Dongxu Yue · Ceyuan Yang · Yang Cao · Hanzhong Guo · SHEN FEI · Wei Liu · Xihui Liu · Dan Xu

Although significant advancements have been achieved in the progress of keypoint-guided Text-to-Image diffusion models, existing mainstream keypoint-guided models encounter challenges in controlling the generation of more general non-rigid objects beyond humans (e.g., animals). Moreover, it is difficult to generate multiple overlapping humans and animals based on keypoint controls solely. These challenges arise from two main aspects: the inherent limitations of existing controllable methods and the lack of suitable datasets. First, we design a DiT-based framework, named UniMC, to explore unifying controllable multi-class image generation. UniMC integrates instance- and keypoint-level conditions into compact tokens, incorporating attributes such as class, bounding box, and keypoint coordinates. This approach overcomes the limitations of previous methods that struggled to distinguish instances and classes due to their reliance on skeleton images as conditions. Second, we propose HAIG-2.9M, a large-scale, high-quality, and diverse dataset designed for keypoint-guided human and animal image generation. HAIG-2.9M includes 786K images with 2.9M instances. This dataset features extensive annotations such as keypoints, bounding boxes, and fine-grained captions for both humans and animals, along with rigorous manual inspection to ensure annotation accuracy. Extensive experiments demonstrate the high quality of HAIG-2.9M and the effectiveness of UniMC, particularly in heavy occlusions and multi-class scenarios.

Spotlight Poster

#W-219

Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection

Zhiyuan Yan · Jiangming Wang · Peng Jin · Ke-Yue Zhang · Chengchun Liu · Shen Chen · Taiping Yao · Shouhong Ding · Baoyuan Wu · Li Yuan

Detecting AI-generated images (AIGIs), such as natural images or face images, has become increasingly important yet challenging. In this paper, we start from a new perspective to excavate the reason behind the failure generalization in AIGI detection, named the asymmetry phenomenon, where a naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked, which is proved seriously limiting the expressivity and generalization. One potential remedy is incorporating the pre-trained knowledge within the vision foundation models (higher-ranked) to expand the feature space, alleviating the model's overfitting to fake. To this end, we employ Singular Value Decomposition (SVD) to decompose the original feature space into two orthogonal subspaces. By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning fake patterns. Compared to existing full-parameters and LoRA-based tuning methods, we explicitly ensure orthogonality, enabling the higher rank of the whole feature space, effectively minimizing overfitting and enhancing generalization. We finally identify a crucial insight: our method implicitly learns a vital prior that fakes are actually derived from the real, indicating a hierarchical relationship rather than independence. Modeling this prior, we believe, is essential for achieving superior generalization. Our codes are publicly available at https://github.com/YZY-stack/Effort-AIGI-Detection.

Poster

#W-220

Unifying 2D and 3D Vision-Language Understanding

Ayush Jain · Alexander Swerdlow · Yuzhou Wang · Sergio Arnaud · Ada Martin · Alexander Sax · Franziska Meier · Katerina Fragkiadaki

Progress in 3D vision-language learning has been hindered by the scarcity of large-scale 3D datasets. We introduce UniVLG, a unified architecture for 2D and 3D vision-language understanding that bridges the gap between existing 2D-centric models and the rich 3D sensory data available in embodied systems. Our approach initializes most model weights from pre-trained 2D models and trains on both 2D and 3D vision-language data. We propose a novel language-conditioned mask decoder shared across 2D and 3D modalities to ground objects effectively in both RGB and RGB-D images, outperforming box-based approaches. To further reduce the domain gap between 2D and 3D, we incorporate 2D-to-3D lifting strategies, enabling UniVLG to utilize 2D data to enhance 3D performance. With these innovations, our model achieves state-of-the-art performance across multiple 3D vision-language grounding tasks, demonstrating the potential of transferring advances from 2D vision-language learning to the data-constrained 3D domain. Furthermore, co-training on both 2D and 3D data enhances performance across modalities without sacrificing 2D capabilities. By removing the reliance on 3D mesh reconstruction and ground-truth object proposals, UniVLG sets a new standard for realistic, embodied-aligned evaluation. Code and additional visualizations are available at https://univlg.github.io.

Poster

#W-221

DyPolySeg: Taylor Series-Inspired Dynamic Polynomial Fitting Network for Few-shot Point Cloud Semantic Segmentation

Changshuo Wang · Xiang Fang · Prayag Tiwari

Few-shot point cloud semantic segmentation effectively addresses data scarcity by identifying unlabeled query samples through semantic prototypes generated from a small set of labeled support samples. However, pre-training-based methods suffer from domain shifts and increased training time. Additionally, existing methods using DGCNN as the backbone have limited geometric structure modeling capabilities and struggle to bridge the categorical information gap between query and support sets. To address these challenges, we propose DyPolySeg, a pre-training-free Dynamic Polynomial fitting network for few-shot point cloud semantic segmentation. Specifically, we design a unified Dynamic Polynomial Convolution (DyPolyConv) that extracts flat and detailed features of local geometry through Low-order Convolution (LoConv) and Dynamic High-order Convolution (DyHoConv), complemented by Mamba Block for capturing global context information. Furthermore, we propose a lightweight Prototype Completion Module (PCM) that reduces structural differences through self-enhancement and interactive enhancement between query and support sets. Experiments demonstrate that DyPolySeg achieves state-of-the-art performance on S3DIS and ScanNet datasets.

Poster

#W-300

sciLaMA: A Single-Cell Representation Learning Framework to Leverage Prior Knowledge from Large Language Models

Hongru Hu · Shuwen Zhang · Yongin Choi · Venkat Malladi · Gerald Quon

Single-cell RNA sequencing (scRNA-seq) enables high-resolution exploration of cellular diversity and gene regulation, yet analyzing such data remains challenging due to technical and methodological limitations. Existing task-specific deep generative models like Variational Auto-Encoder (VAE) and its variants struggle to incorporate external biological knowledge, while transformer-based foundational large Language Models (LLMs or large LaMs) face limitations in computational cost and applicability to tabular gene expression data. Here, we introduce sciLaMA (single-cell interpretable Language Model Adapter), a novel representation learning framework that bridges these gaps by integrating static gene embeddings from multimodal LaMs with scRNA-seq tabular data through a paired-VAE architecture. Our approach generates context-aware representations for both cells and genes and outperforms state-of-the-art methods in key single-cell downstream tasks, including batch effect correction, cell clustering, and cell-state-specific gene marker and module identification, while maintaining computational efficiency. sciLaMA offers a computationally efficient, unified framework for comprehensive single-cell data analysis and biologically interpretable gene module discovery.

Poster

#W-301

CellFlux: Simulating Cellular Morphology Changes via Flow Matching

Yuhui Zhang · Yuchang Su · Chenyu Wang · Tianhong Li · Zoe Wefers · Jeffrey J. Nirschl · James Burgess · Daisy Yi Ding · Alejandro Lozano · Emma Lundberg · Serena Yeung

Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlux, an image-generative model that simulates cellular morphology changes induced by chemical and genetic perturbations using flow matching. Unlike prior methods, CellFlux models distribution-wise transformations from unperturbed to perturbed cell states, effectively distinguishing actual perturbation effects from experimental artifacts such as batch effects—a major challenge in biological data. Evaluated on chemical (BBBC021), genetic (RxRx1), and combined perturbation (JUMP) datasets, CellFlux generates biologically meaningful cell images that faithfully capture perturbation-specific morphological changes, achieving a 35% improvement in FID scores and a 12% increase in mode-of-action prediction accuracy over existing methods. Additionally, CellFlux enables continuous interpolation between cellular states, providing a potential tool for studying perturbation dynamics. These capabilities mark a significant step toward realizing virtual cell modeling for biomedical research. Project page: https://yuhui-zh15.github.io/CellFlux/.

Spotlight Poster

#W-302

Raptor: Scalable Train-Free Embeddings for 3D Medical Volumes Leveraging Pretrained 2D Foundation Models

Ulzee An · Moonseong Jeong · Simon Lee · Aditya Gorla · Yuzhe Yang · Sriram Sankararaman

Current challenges in developing foundational models for volumetric imaging data, such as magnetic resonance imaging (MRI), stem from the computational complexity of state-of-the-art architectures in high dimensions and curating sufficiently large datasets of volumes.To address these challenges, we introduce Raptor (Random Planar Tensor Reduction), a train-free method for generating semantically rich embeddings for volumetric data. Raptor leverages a frozen 2D foundation model, pretrained on natural images, to extract visual tokens from individual cross-sections of medical volumes. These tokens are then spatially compressed using random projections, significantly reducing computational complexity while retaining rich semantic information. Extensive experiments on 10 diverse medical volume tasks verify the superior performance of Raptor over state-of-the-art methods, including those pretrained exclusively on medical volumes (+3 SuPreM, +6 MISFM, +10 Merlin, +13 VoCo, and +14 SLIViT), while entirely bypassing the need for costly training. Our results highlight Raptor's effectiveness and versatility as a foundation for advancing deep learning-based methods for medical volumes (code: github.com/sriramlab/raptor).

Poster

#W-303

All-atom inverse protein folding through discrete flow matching

Kai Yi · Kiarash Jamali · Sjors Scheres

The recent breakthrough of AlphaFold3 in modeling complex biomolecular interactions, including those between proteins and ligands, nucleotides, or metal ions, creates new opportunities for protein design. In so-called inverse protein folding, the objective is to find a sequence of amino acids that adopts a target protein structure. Many inverse folding methods struggle to predict sequences for complexes that contain non-protein components, and perform poorly with complexes that adopt multiple structural states. To address these challenges, we present ADFLIP (All-atom Discrete FLow matching Inverse Protein folding), a generative model based on discrete flow-matching for designing protein sequences conditioned on all-atom structural contexts. ADFLIP progressively incorporates predicted amino acid side chains as structural context during sequence generation and enables the design of dynamic protein complexes through ensemble sampling across multiple structural states. Furthermore, ADFLIP implements training-free classifier guidance sampling, which allows the incorporation of arbitrary pre-trained models to optimise the designed sequence for desired protein properties. We evaluated the performance of ADFLIP on protein complexes with small-molecule ligands, nucleotides, or metal ions, including dynamic complexes for which structure ensembles were determined by nuclear magnetic resonance (NMR). Our model achieves state-of-the-art performance in single-structure and multi-structure inverse folding tasks, demonstrating excellent potential for all-atom protein design. The code is available at https://github.com/ykiiiiii/ADFLIP .

Poster

#W-304

Aligning Protein Conformation Ensemble Generation with Physical Feedback

Jiarui Lu · Xiaoyin Chen · Stephen Lu · Aurelie Lozano · Vijil Chenthamarakshan · Payel Das · Jian Tang

Protein dynamics play a crucial role in protein biological functions and properties, and their traditional study typically relies on time-consuming molecular dynamics (MD) simulations conducted in silico. Recent advances in generative modeling, particularly denoising diffusion models, have enabled efficient accurate protein structure prediction and conformation sampling by learning distributions over crystallographic structures. However, effectively integrating physical supervision into these data-driven approaches remains challenging, as standard energy-based objectives often lead to intractable optimization. In this paper, we introduce Energy-based Alignment (EBA), a method that aligns generative models with feedback from physical models, efficiently calibrating them to appropriately balance conformational states based on their energy differences. Experimental results on the MD ensemble benchmark demonstrate that EBA achieves state-of-the-art performance in generating high-quality protein ensembles. By improving the physical plausibility of generated structures, our approach enhances model predictions and holds promise for applications in structural biology and drug discovery.

Poster

#W-305

CombiMOTS: Combinatorial Multi-Objective Tree Search for Dual-Target Molecule Generation

Thibaud Southiratn · Bonil Koo · Yijingxiu Lu · Sun Kim

Dual-target molecule generation, which focuses on discovering compounds capable of interacting with two target proteins, has garnered significant attention due to its potential for improving therapeutic efficiency, safety and resistance mitigation.Existing approaches face two critical challenges.First, by simplifying the complex dual-target optimization problem to scalarized combinations of individual objectives, they fail to capture important trade-offs between target engagement and molecular properties. Second, they typically do not integrate synthetic planning into the generative process.This highlights a need for more appropriate objective function design and synthesis-aware methodologies tailored to the dual-target molecule generation task.In this work, we propose CombiMOTS, a Pareto Monte Carlo Tree Search (PMCTS) framework that generates dual-target molecules.CombiMOTS is designed to explore a synthesizable fragment space while employing vectorized optimization constraints to encapsulate target affinity and physicochemical properties.Extensive experiments on real-world databases demonstrate that CombiMOTS produces novel dual-target molecules with high docking scores, enhanced diversity, and balanced pharmacological characteristics, showcasing its potential as a powerful tool for dual-target drug discovery.The code and data is accessible through \url{https://github.com/Tibogoss/CombiMOTS}.

Poster

#W-306

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo · Shang Qu · Yifei Li · Zhang-Ren Chen · Xuekai Zhu · Ermo Hua · Kaiyan Zhang · Ning Ding · Bowen Zhou

We introduce MedXpertQA, a highly challenging and comprehensive benchmark to evaluate expert-level medical knowledge and advanced reasoning. MedXpertQA includes 4,460 questions spanning 17 specialties and 11 body systems. It includes two subsets, Text for text evaluation and MM for multimodal evaluation. Notably, MM introduces expert-level exam questions with diverse images and rich clinical information, including patient records and examination results, setting it apart from traditional medical multimodal benchmarks with simple QA pairs generated from image captions. MedXpertQA applies rigorous filtering and augmentation to address the insufficient difficulty of existing benchmarks like MedQA, and incorporates specialty board questions to improve clinical relevance and comprehensiveness. We perform data synthesis to mitigate data leakage risk and conduct multiple rounds of expert reviews to ensure accuracy and reliability. We evaluate 18 leading models on MedXpertQA. Moreover, medicine is deeply connected to real-world decision-making, providing a rich and representative setting for assessing reasoning abilities beyond mathematics and code. To this end, we develop a reasoning-oriented subset to facilitate the assessment of o1-like models.

Poster

#W-307

UniMoMo: Unified Generative Modeling of 3D Molecules for De Novo Binder Design

Xiangzhe Kong · Zishen Zhang · Ziting Zhang · Rui Jiao · Jianzhu Ma · Wenbing Huang · Kai Liu · Yang Liu

The design of target-specific molecules such as small molecules, peptides, and antibodies is vital for biological research and drug discovery. Existing generative methods are restricted to single-domain molecules, failing to address versatile therapeutic needs or utilize cross-domain transferability to enhance model performance. In this paper, we introduce Unified generative Modelingof 3D Molecules (UniMoMo), the first framework capable of designing binders of multiple molecular domains using a single model. In particular, UniMoMo unifies the representations of different molecules as graphs of blocks, where each block corresponds to either a standard amino acid or a molecular fragment. Based on these unified representations, UniMoMo utilizes a geometric latent diffusion model for 3D molecular generation, featuring an iterative full-atom autoencoder to compress blocks into latent space points, followed by an E(3)-equivariant diffusion process. Extensive benchmarks across peptides, antibodies, and small molecules demonstrate the superiority of our unified framework over existing domain-specific models, highlighting the benefits of multi-domain training.

Poster

#W-308

Identifying biological perturbation targets through causal differential networks

Menghua Wu · Umesh Padia · Sean Murphy · Regina Barzilay · Tommi Jaakkola

Identifying variables responsible for changes to a biological system enables applications in drug target discovery and cell engineering. Given a pair of observational and interventional datasets, the goal is to isolate the subset of observed variables that were the targets of the intervention. Directly applying causal discovery algorithms is challenging: the data may contain thousands of variables with as few as tens of samples per intervention, and biological systems do not adhere to classical causality assumptions. We propose a causality-inspired approach to address this practical setting. First, we infer noisy causal graphs from the observational and interventional data. Then, we learn to map the differences between these graphs, along with additional statistical features, to sets of variables that were intervened upon. Both modules are jointly trained in a supervised framework, on simulated and real data that reflect the nature of biological interventions. This approach consistently outperforms baselines for perturbation modeling on seven single-cell transcriptomics datasets. We also demonstrate significant improvements over current causal discovery methods for predicting soft and hard intervention targets across a variety of synthetic data.

Poster

#W-309

Learning Invariant Causal Mechanism from Vision-Language Models

Zeen Song · Siyu Zhao · Xingyu Zhang · Jiangmeng Li · Changwen Zheng · Wenwen Qiang

Contrastive Language-Image Pretraining (CLIP) has achieved remarkable success, but its performance can degrade when fine-tuned in out-of-distribution (OOD) scenarios. We model the prediction process using a Structural Causal Model (SCM) and show that the causal mechanism involving both invariant and variant factors in training environments differs from that in test environments. In contrast, the causal mechanism with solely invariant factors remains consistent across environments. We theoretically prove the existence of a linear mapping from CLIP embeddings to invariant factors, which can be estimated using interventional data. Additionally, we provide a condition to guarantee low OOD risk of the invariant predictor. Based on these insights, we propose the Invariant Causal Mechanism of CLIP (CLIP-ICM) framework. CLIP-ICM involves collecting interventional data, estimating a linear projection matrix, and making predictions within the invariant subspace. Experiments on several OOD datasets show that CLIP-ICM significantly improves the performance of CLIP. Our method offers a simple but powerful enhancement, boosting the reliability of CLIP in real-world applications.

Poster

#W-310

BounDr.E: Predicting Drug-likeness via Biomedical Knowledge Alignment and EM-like One-Class Boundary Optimization

Dongmin Bang · Inyoung Sung · Yinhua Piao · Sangseon Lee · Sun Kim

The advent of generative AI now enables large-scale $\textit{de novo}$ design of molecules, but identifying viable drug candidates among them remains an open problem. Existing drug-likeness prediction methods often rely on ambiguous negative sets or purely structural features, limiting their ability to accurately classify drugs from non-drugs. In this work, we introduce BounDr.E}: a novel modeling of drug-likeness as a compact space surrounding approved drugs through a dynamic one-class boundary approach. Specifically, we enrich the chemical space through biomedical knowledge alignment, and then iteratively tighten the drug-like boundary by pushing non-drug-like compounds outside via an Expectation-Maximization (EM)-like process. Empirically, BounDr.E achieves 10\% F1-score improvement over the previous state-of-the-art and demonstrates robust cross-dataset performance, including zero-shot toxic compound filtering. Additionally, we showcase its effectiveness through comprehensive case studies in large-scale $\textit{in silico}$ screening. Our codes and constructed benchmark data under various schemes are provided at: https://github.com/eugenebang/boundr_e.

Poster

#W-311

Multi-Marginal Stochastic Flow Matching for High-Dimensional Snapshot Data at Irregular Time Points

Justin Lee · Behnaz Moradi-Jamei · Heman Shakeri

Modeling the evolution of high-dimensional systems from limited snapshot observations at irregular time points poses a significant challenge in quantitative biology and related fields. Traditional approaches often rely on dimensionality reduction techniques, which can oversimplify the dynamics and fail to capture critical transient behaviors in non-equilibrium systems. We present Multi-Marginal Stochastic Flow Matching (MMSFM), a novel extension of simulation-free score and flow matching methods to the multi-marginal setting, enabling the alignment of high-dimensional data measured at non-equidistant time points without reducing dimensionality. The use of measure-valued splines enhances robustness to irregular snapshot timing, and score matching prevents overfitting in high-dimensional spaces. We validate our framework on several synthetic and benchmark datasets and apply it to single-cell perturbation data from melanoma cell lines and gene expression data collected at uneven time points.

Poster

#W-312

InfoSEM: A Deep Generative Model with Informative Priors for Gene Regulatory Network Inference

Tianyu Cui · Song-Jun Xu · Artem Moskalev · Shuwei Li · Tommaso Mansi · Mangal Prakash · Rui Liao

Inferring Gene Regulatory Networks (GRNs) from gene expression data is crucial for understanding biological processes. While supervised models are reported to achieve high performance for this task, they rely on costly ground truth (GT) labels and risk learning gene-specific biases—such as class imbalances of GT interactions—rather than true regulatory mechanisms. To address these issues, we introduce InfoSEM, an unsupervised generative model that leverages textual gene embeddings as informative priors, improving GRN inference without GT labels. InfoSEM can also integrate GT labels as an additional prior when available, avoiding biases and further enhancing performance. Additionally, we propose a biologically motivated benchmarking framework that better reflects real-world applications such as biomarker discovery and reveals learned biases of existing supervised methods. InfoSEM outperforms existing models by 38.5% across four datasets using textual embeddings prior and further boosts performance by 11.1% when integrating labeled data as priors.

Poster

#W-313

A Variational Perspective on Generative Protein Fitness Optimization

Lea Bogensperger · Dominik Narnhofer · Ahmed Allam · Konrad Schindler · Michael Krauthammer

The goal of protein fitness optimization is to discover new protein variants with enhanced fitness for a given use. The vast search space and the sparsely populated fitness landscape, along with the discrete nature of protein sequences, pose significant challenges when trying to determine the gradient towards configurations with higher fitness. We introduce Variational Latent Generative Protein Optimization (VLGPO), a variational perspective on fitness optimization. Our method embeds protein sequences in a continuous latent space to enable efficient sampling from the fitness distribution and combines a (learned) flow matching prior over sequence mutations with a fitness predictor to guide optimization towards sequences with high fitness. VLGPO achieves state-of-the-art results on two different protein benchmarks of varying complexity. Moreover, the variational design with explicit prior and likelihood functions offers a flexible plug-and-play framework that can be easily customized to suit various protein design tasks.

Spotlight Poster

#W-314

Do Multiple Instance Learning Models Transfer?

Daniel Shao · Richard Chen · Andrew Song · Joel Runevic · Ming Y. Lu · Tong Ding · Faisal Mahmood

Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology for distilling embeddings from gigapixel tissue images into patient-level representations to predict clinical outcomes. However, MIL is frequently challenged by the constraints of working with small, weakly-supervised clinical datasets. Unlike fields such as natural language processing and computer vision, which effectively use transfer learning to improve model quality in data-scarce environments, the transferability of MIL models remains largely unexplored. We conduct the first comprehensive investigation into transfer learning capabilities of pretrained MIL models, evaluating 11 MIL models across 19 pretraining tasks spanning tissue subtyping, cancer grading, and molecular subtype prediction. We observe a substantial performance boost with finetuning pretrained models over training from randomly initialized weights, even with domain differences between pretraining and target tasks. Pretraining on pan-cancer datasets enables consistent generalization across organs and task types compared to single-disease pretraining. Remarkably, this pan-cancer pretraining leads to better transfer than that of a state-of-the-art slide-level foundation model, while using only 6.5\% of the training data. These findings indicate that MIL architectures exhibit robust adaptability, offering insights into the benefits of leveraging pretrained models to enhance performance in computational pathology.

Poster

#W-315

Reliable Algorithm Selection for Machine Learning-Guided Design

Clara Fannjiang · Ji Won Park

Algorithms for machine learning-guided design, or design algorithms, use machine learning-based predictions to propose novel objects with desired property values. Given a new design task—for example, to design novel proteins with high binding affinity to a therapeutic target—one must choose a design algorithm and specify any hyperparameters and predictive and/or generative models involved. How can these decisions be made such that the resulting designs are successful? This paper proposes a method for design algorithm selection, which aims to select design algorithms that will produce a distribution of design labels satisfying a user-specified success criterion—for example, that at least ten percent of designs’ labels exceed a threshold. It does so by combining designs’ predicted property values with held-out labeled data to reliably forecast characteristics of the label distributions produced by different design algorithms, building upon techniques from prediction-powered inference (Angelopoulos et al., 2023). The method is guaranteed with high probability to return design algorithms that yield successful label distributions (or the null set if none exist), if the density ratios between the design and labeled data distributions are known. We demonstrate the method’s effectiveness in simulated protein and RNA design tasks, in settings with either known or estimated density ratios.

Poster

#W-316

"Why Is There a Tumor?": Tell Me the Reason, Show Me the Evidence

Mengmeng Ma · Tang Li · Yunxiang Peng · LIN LU · Volkan Beylergil · Binsheng Zhao · Oguz Akin · Xi Peng

Medical AI models excel at tumor detection and segmentation. However, their latent representations often lack explicit ties to clinical semantics, producing outputs less trusted in clinical practice. Most of the existing models generate either segmentation masks/labels (localizing where without why) or textual justifications (explaining why without where), failing to ground clinical concepts in spatially localized evidence. To bridge this gap, we propose to develop models that can justify the segmentation or detection using clinically relevant terms and point to visual evidence. We address two core challenges: First, we curate a rationale dataset to tackle the lack of paired images, annotations, and textual rationales for training. The dataset includes 180K image-mask-rationale triples with quality evaluated by expert radiologists. Second, we design rationale-informed optimization that disentangles and localizes fine-grained clinical concepts in a self-supervised manner without requiring pixel-level concept annotations. Experiments across medical benchmarks show our model demonstrates superior performance in segmentation, detection, and beyond. The anonymous link to our code.

Poster

#W-317

ViTally Consistent: Scaling Biological Representation Learning for Cell Microscopy

Kian Kenyon-Dean · Zitong Jerry Wang · John Urbanik · Konstantin Donhauser · Jason Hartford · Saber Saberian · Nil Sahin · Ihab Bendidi · Safiye Celik · Juan Vera · Marta Fay · Imran Haque · Oren Kraus

Deriving insights from experimentally generated datasets requires methods that can account for random and systematic measurement errors and remove them in order to accurately represent the underlying effects of the conditions being tested. Here we present a framework for pretraining on large-scale microscopy datasets that includes three steps: (1) curating a set of diverse and self-consistent training samples, (2) scaling training of an appropriate foundation model architecture on this dataset, (3) evaluating intermediate layers of the trained model to identify the best representation for downstream tasks. Using this strategy, we present the largest foundation model for cell microscopy data to our knowledge, a new 1.9 billion-parameter ViT-G/8 MAE trained on over 8 billion microscopy image crops. Compared to a previous published ViT-L/8 MAE, our new model achieves a 60\% improvement in linear separability of genetic perturbations and obtains the best overall performance on whole-genome relationship recall, batch correction replicate consistency, and compound-gene activity prediction benchmarks.

Spotlight Poster

#W-318

Visual and Domain Knowledge for Professional-level Graph-of-Thought Medical Reasoning

Rina Bao · Shilong Dong · Zhenfang Chen · Sheng He · Patricia Ellen Grant · Yangming Ou

Medical Visual Question Answering (MVQA) requires AI models to answer questions related to medical images, offering significant potential to assist medical professionals in evaluating and diagnosing diseases, thereby improving early interventions. However, existing MVQA datasets primarily focus on basic questions regarding visual perception and pattern recognition, without addressing the more complex questions that are critical in clinical diagnosis and decision-making. This paper introduces a new benchmark designed for professional-level medical reasoning, simulating the decision-making process. We achieve this by collecting MRI and clinical data related to Hypoxic-Ischemic Encephalopathy, enriched with expert annotations and insights. Building on this data, we generate clinical question-answer pairs and MRI interpretations to enable comprehensive diagnosis, interpretation, and prediction of neurocognitive outcomes. Our evaluation of current large vision-language models (LVLMs) shows limited performance on this benchmark, highlighting both the challenges of the task and the importance of this benchmark for advancing medical AI. Furthermore, we propose a novel ``Clinical Graph of Thoughts" model, which integrates domain-specific medical knowledge and clinical reasoning processes with the interpretive abilities of LVLMs. The model demonstrates promising results, achieving around 15\% absolute gain on the most important neurocognitive outcome task, while the benchmark still reveals substantial opportunities for further research innovation.

Poster

#W-319

A Cross Modal Knowledge Distillation & Data Augmentation Recipe for Improving Transcriptomics Representations through Morphological Features

Ihab Bendidi · Yassir El Mesbahi · Alisandra Denton · Karush Suri · Kian Kenyon-Dean · Auguste Genovesio · Emmanuel Noutahi

Understanding cellular responses to stimuli is crucial for biological discovery and drug development. Transcriptomics provides interpretable, gene-level insights, while microscopy imaging offers rich predictive features but is harder to interpret. Weakly paired datasets, where samples share biological states, enable multimodal learning but are scarce, limiting their utility for training and multimodal inference. We propose a framework to enhance transcriptomics by distilling knowledge from microscopy images. Using weakly paired data, our method aligns and binds modalities, enriching gene expression representations with morphological information. To address data scarcity, we introduce (1) Semi-Clipped, an adaptation of CLIP for cross-modal distillation using pretrained foundation models, achieving state-of-the-art results, and (2) PEA (Perturbation Embedding Augmentation), a novel augmentation technique that enhances transcriptomics data while preserving inherent biological information. These strategies improve the predictive power and retain the interpretability of transcriptomics, enabling rich unimodal representations for complex biological tasks.

Poster

#W-320

Unified Screening for Multiple Diseases

Yiğit Narter · Alihan Hüyük · Mihaela van der Schaar · Cem Tekin

Current screening programs that focus on improving patient health while minimizing screening costs are tailored for individual diseases. Designing unified screening programs for multiple diseases requires carefully balancing competing disease risks, which is an open problem. In this work, we address this problem by casting unified screening as a referral problem, in which we choose to activate a subset of screening policies for individual diseases by accounting for competing risks that influence patient outcomes. We introduce a novel optimization framework that incorporates disease risks, budget constraints, and diagnostic error limits and characterize the structural properties of the optimal referral policy. For the unified screening of two diseases, we show that the optimal activation threshold for the screening of one disease depends on the risk of the other, resulting in decision boundaries with distinct risk-dependent profiles. We compare our unified model with independent screening programs that apply isolated activation thresholds for screening of each disease. Our approach optimizes screening decisions collectively, improving overall survival outcomes, particularly for patients with high disease risks.

Poster

#W-321

BinauralFlow: A Causal and Streamable Approach for High-Quality Binaural Speech Synthesis with Flow Matching Models

Susan Liang · Dejan Markovic · Israel D. Gebru · Steven Krenn · Todd Keebler · Jacob Sandakly · Frank Yu · Samuel Hassel · Chenliang Xu · Alexander Richard

Binaural rendering aims to synthesize binaural audio that mimics natural hearing based on a mono audio and the locations of the speaker and listener. Although many methods have been proposed to solve this problem, they struggle with rendering quality and streamable inference. Synthesizing high-quality binaural audio that is indistinguishable from real-world recordings requires precise modeling of binaural cues, room reverb, and ambient sounds. Additionally, real-world applications demand streaming inference. To address these challenges, we propose a flow matching based streaming binaural speech synthesis framework called BinauralFlow. We consider binaural rendering to be a generation problem rather than a regression problem and design a conditional flow matching model to render high-quality audio. Moreover, we design a causal U-Net architecture that estimates the current audio frame solely based on past information to tailor generative models for streaming inference. Finally, we introduce a continuous inference pipeline incorporating streaming STFT/ISTFT operations, a buffer bank, a midpoint solver, and an early skip schedule to improve rendering continuity and speed. Quantitative and qualitative evaluations demonstrate the superiority of our method over SOTA approaches. A perceptual study further reveals that our model is nearly indistinguishable from real-world recordings, with a 42% confusion rate.

Poster

#W-400

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Perampalli Shravan Nayak · Xiangru Jian · Kevin Qinghong Lin · Juan A. Rodriguez · Montek Kalsi · Nicolas Chapados · M. Özsu · Aishwarya Agrawal · David Vazquez · Christopher Pal · Perouz Taslakian · Spandana Gella · Sai Rajeswar Mudumba

Autonomous agents that navigate Graphical User Interfaces (GUIs) to automate tasks like document editing and file management can greatly enhance computer workflows. While existing research focuses on online settings, desktop environments, critical for many professional and everyday tasks, remain underexplored due to data collection challenges and licensing issues. We introduce UI-Vision, the first comprehensive, license-permissive benchmark for offline, fine-grained evaluation of computer use agents in real-world desktop environments. Unlike online benchmarks, UI-Vision provides: (i) dense, high-quality annotations of human demonstrations, including bounding boxes, UI labels, and action trajectories (clicks, drags, and keyboard inputs) across 83 software applications, and (ii) three fine-to-coarse grained tasks—Element Grounding, Layout Grounding, and Action Prediction—with well-defined metrics to rigorously evaluate agents' performance in desktop environments. Our evaluation reveals critical limitations in state-of-the-art models like UI-TARS-72B, including issues with understanding professional software, spatial reasoning, and complex actions like drag-and-drop. These findings highlight the challenges in developing fully autonomous computer-use agents. With UI-Vision, we aim to advance the development of more capable agents for real-world desktop tasks.

Poster

#W-401

Context is Key: A Benchmark for Forecasting with Essential Textual Information

Andrew Williams · Arjun Ashok · Étienne Marcotte · Valentina Zantedeschi · Jithendaraa Subramanian · Roland Riachi · James Requeima · Alexandre Lacoste · Irina Rish · Nicolas Chapados · Alexandre Drouin

Forecasting is a critical task in decision-making across numerous domains. While historical numerical data provide a start, they fail to convey the complete context for reliable and accurate predictions. Human forecasters frequently rely on additional information, such as background knowledge and constraints, which can efficiently be communicated through natural language. However, in spite of recent progress with LLM-based forecasters, their ability to effectively integrate this textual information remains an open question. To address this, we introduce "Context is Key" (CiK), a time-series forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context, requiring models to integrate both modalities; crucially, every task in CiK requires understanding textual context to be solved successfully. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters, and propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark. Our experiments highlight the importance of incorporating contextual information, demonstrate surprising performance when using LLM-based forecasting models, and also reveal some of their critical shortcomings. This benchmark aims to advance multimodal forecasting by promoting models that are both accurate and accessible to decision-makers with varied technical expertise.The benchmark can be visualized at https://servicenow.github.io/context-is-key-forecasting/v0.

Poster

#W-402

CMoS: Rethinking Time Series Prediction Through the Lens of Chunk-wise Spatial Correlations

Haotian Si · Changhua Pei · Jianhui LI · Dan Pei · Gaogang Xie

Recent advances in lightweight time series forecasting models suggest the inherent simplicity of time series forecasting tasks. In this paper, we present CMoS, a super-lightweight time series forecasting model. Instead of learning the embedding of the shapes, CMoS directly models the spatial correlations between different time series chunks. Additionally, we introduce a Correlation Mixing technique that enables the model to capture diverse spatial correlations with minimal parameters, and an optional Periodicity Injection technique to ensure faster convergence. Despite utilizing as low as 1% of the lightweight model DLinear's parameters count, experimental results demonstrate that CMoS outperforms existing state-of-the-art models across multiple datasets. Furthermore, the learned weights of CMoS exhibit great interpretability, providing practitioners with valuable insights into temporal structures within specific application scenarios.

Poster

#W-403

TimePoint: Accelerated Time Series Alignment via Self-Supervised Keypoint and Descriptor Learning

Ron Shapira Weber · shahar benishay · Andrey Lavrinenko · Shahaf E. Finder · Oren Freifeld

Fast and scalable alignment of time series is a fundamental challenge in many domains. The standard solution, Dynamic Time Warping (DTW), struggles with poor scalability and sensitivity to noise. We introduce TimePoint, a self-supervised method that dramatically accelerates DTW-based alignment while typically improving alignment accuracy by learning keypoints and descriptors from synthetic data. Inspired by 2D keypoint detection but carefully adapted to the unique challenges of 1D signals, TimePoint leverages efficient 1D diffeomorphisms, which effectively model nonlinear time warping, to generate realistic training data. This adaptation, along with fully convolutional and wavelet convolutional architectures, enables the extraction of informative keypoints and descriptors. Applying DTW to these sparse representations yields major speedups and typically higher alignment accuracy than standard DTW applied to the full signals. Despite being trained solely on synthetic data, TimePoint generalizes well to real-world time series. Extensive experiments demonstrate that TimePoint consistently achieves faster and more accurate alignments than standard DTW, making it a scalable solution for time-series analysis. Our code is available at https://github.com/BGU-CS-VIL/TimePoint.

Poster

#W-404

LOB-Bench: Benchmarking Generative AI for Finance - an Application to Limit Order Book Data

Peer Nagy · Sascha Frey · Kang Li · Bidipta Sarkar · Svitlana Vyetrenko · Stefan Zohren · Anisoara Calinescu · Jakob Foerster

While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we present LOB-Bench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains "market impact metrics", i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes.

Poster

#W-405

ProDiff: Prototype-Guided Diffusion for Minimal Information Trajectory Imputation

Tianci Bu · Le Zhou · Wenchuan Yang · Jianhong Mou · Kang Yang · Suoyi Tan · Feng Yao · Jingyuan Wang · Xin Lu

Trajectory data is crucial for various applications but often suffers from incompleteness due to device limitations and diverse collection scenarios. Existing imputation methods rely on sparse trajectory or travel information, such as velocity, to infer missing points. However, these approaches assume that sparse trajectories retain essential behavioral patterns, which place significant demands on data acquisition and overlook the potential of large-scale human trajectory embeddings.To address this, we propose ProDiff, a trajectory imputation framework that uses only two endpoints as minimal information. It integrates prototype learning to embed human movement patterns and a denoising diffusion probabilistic model for robust spatiotemporal reconstruction. Joint training with a tailored loss function ensures effective imputation.ProDiff outperforms state-of-the-art methods, improving accuracy by 6.28\% on FourSquare and 2.52\% on WuXi. Further analysis shows a 0.927 correlation between generated and real trajectories, demonstrating the effectiveness of our approach.

Poster

#W-406

Efficient Personalized Adaptation for Physiological Signal Foundation Model

Chenrui Wu · Haishuai Wang · Xiang Zhang · Chengqi Zhang · Jiajun Bu

Time series analysis is crucial across various fields like energy, environment, transportation, finance and health. Deep learning has significantly advanced this field, particularly, the Time Series Foundation Model (TSFM) excels in multiple domains due to extensive pre-training. In this work, we focus on TSFM's challenges in medical practice: limited computing resources and medical data privacy. TSFM variants include fine-tuned models and those pre-trained for rapid deployment on diverse data. There may not be enough computing resources to train physiological signals locally in hospitals, and generalized TSFM is still inferior to task-specific methods on private, imbalanced local data. To address this, we propose PhysioPFM, a framework for efficiently personalizing TSFM. Our approach involves low-rank pre-training on public datasets, generator training by trained LoRA weights, and efficient weight generation via local data. Experimental results demonstrate that integrating generated models with TSFM enhances performance, and transferability, and reduces the need for additional sensitive data training.

Poster

#W-407

Towards Learning to Complete Anything in Lidar

Ayça Takmaz · Cristiano Saltori · Neehar Peri · Tim Meinhardt · Riccardo de Lutio · Laura Leal-Taixé · Aljosa Osep

We propose CAL (Complete Anything in Lidar) for Lidar-based shape-completion in-the-wild. This is closely related to Lidar-based semantic/panoptic scene completion. However, contemporary methods can only complete and recognize objects from a closed vocabulary labeled in existing Lidar datasets. Different to that, our zero-shot approach leverages the temporal context from multi-modal sensor sequences to mine object shapes and semantic features of observed objects. These are then distilled into a Lidar-only instance-level completion and recognition model. Although we only mine partial shape completions, we find that our distilled model learns to infer full object shapes from multiple such partial observations across the dataset. We show that our model can be prompted on standard benchmarks for Semantic and Panoptic Scene Completion, localize objects as (amodal) 3D bounding boxes, and recognize objects beyond fixed class vocabularies.

Spotlight Poster

#W-408

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu · Yanjiang Guo · Pengchao Wang · Xiaoyu Chen · Yen-Jen Wang · Jianke Zhang · Koushil Sreenath · Chaochao Lu · Jianyu Chen

Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data.In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. For your convenience, videos can be found at https://video-prediction-policy.github.io/

Poster

#W-409

OTTER: A Vision-Language-Action Model with Text-Aware Visual Feature Extraction

Huang Huang · Fangchen Liu · Letian Fu · Tingfan Wu · Mustafa Mukadam · Jitendra Malik · Ken Goldberg · Pieter Abbeel

Vision-Language-Action (VLA) models aim to predict robotic actions based on visual observations and language instructions. Existing approaches require fine-tuning pre-trained vision-language models (VLMs) as visual and language features are independently fed into downstream policies, degrading the pre-trained semantic alignments. We propose OTTER, a novel VLA architecture that leverages these existing alignments through explicit, text-aware visual feature extraction. Instead of processing all visual features, OTTER selectively extracts and passes only task-relevant visual features that are semantically aligned with the language instruction to the policy transformer. This allows OTTER to keep the pre-trained vision-language encoders frozen. Thereby, OTTER preserves and utilizes the rich semantic understanding learned from large-scale pre-training, enabling strong zero-shot generalization capabilities. In simulation and real-world experiments, OTTER significantly outperforms existing VLA models, demonstrating strong zero-shot generalization to novel objects and environments. Video, code, checkpoints, and dataset: https://ottervla.github.io/.

Poster

#W-410

SAH-Drive: A Scenario-Aware Hybrid Planner for Closed-Loop Vehicle Trajectory Generation

Yuqi Fan · Zhiyong Cui · Zhenning Li · Yilong Ren · Haiyang Yu

Reliable planning is crucial for achieving autonomous driving. Rule-based planners are efficient but lack generalization, while learning-based planners excel in generalization yet have limitations in real-time performance and interpretability. In long-tail scenarios, these challenges make planning particularly difficult. To leverage the strengths of both rule-based and learning-based planners, we proposed the Scenario-Aware Hybrid Planner (SAH-Drive) for closed-loop vehicle trajectory planning. Inspired by human driving behavior, SAH-Drive combines a lightweight rule-based planner and a comprehensive learning-based planner, utilizing a dual-timescale decision neuron to determine the final trajectory. To enhance the computational efficiency and robustness of the hybrid planner, we also employed a diffusion proposal number regulator and a trajectory fusion module. The experimental results show that the proposed method significantly improves the generalization capability of the planning system, achieving state-of-the-art performance in interPlan, while maintaining computational efficiency without incurring substantial additional runtime.

Poster

#W-411

DexScale: Automating Data Scaling for Sim2Real Generalizable Robot Control

Guiliang Liu · Yueci Deng · Runyi Zhao · Huayi Zhou · Jian Chen · Jietao Chen · Ruiyan Xu · Yunxin Tai · Kui Jia

A critical prerequisite for achieving generalizable robot control is the availability of a large-scale robot training dataset. Due to the expense of collecting realistic robotic data, recent studies explored simulating and recording robot skills in virtual environments. While simulated data can be generated at higher speeds, lower costs, and larger scales, the applicability of such simulated data remains questionable due to the gap between simulated and realistic environments. To advance the Sim2Real generalization, in this study, we present DexScale, a data engine designed to perform automatic skills simulation and scaling for learning deployable robot manipulation policies. Specifically, DexScale ensures the usability of simulated skills by integrating diverse forms of realistic data into the simulated environment, preserving semantic alignment with the target tasks. For each simulated skill in the environment, DexScale facilitates effective Sim2Real data scaling by automating the process of domain randomization and adaptation. Tuned by the scaled dataset, the control policy achieves zero-shot Sim2Real generalization across diverse tasks, multiple robot embodiments, and widely studied policy model architectures, highlighting its importance in advancing Sim2Real embodied intelligence.

Poster

#W-412

Self-cross Feature based Spiking Neural Networks for Efficient Few-shot Learning

Qi Xu · Junyang Zhu · Dongdong Zhou · Hao Chen · Yang Liu · Jiangrong Shen · Qiang Zhang

Deep neural networks (DNNs) excel in computer vision tasks, especially, few-shot learning (FSL), which is increasingly important for generalizing from limited examples. However, DNNs are computationally expensive with scalability issues in real world. Spiking Neural Networks (SNNs), with their event-driven nature and low energy consumption, are particularly efficient in processing sparse and dynamic data, though they still encounter difficulties in capturing complex spatiotemporal features and performing accurate cross-class comparisons. To further enhance the performance and efficiency of SNNs in few-shot learning, we propose a few-shot learning framework based on SNNs, which combines a self-feature extractor module and a cross-feature contrastive module to refine feature representation and reduce power consumption. We apply the combination of temporal efficient training loss and InfoNCE loss to optimize the temporal dynamics of spike trains and enhance the discriminative power. Experimental results show that the proposed FSL-SNN significantly improves the classification performance on the neuromorphic dataset N-Omniglot, and also achieves competitive performance to ANNs on static datasets such as CUB and miniImageNet with low power consumption.

Poster

#W-413

Testing the Limits of Fine-Tuning for Improving Visual Cognition in Vision Language Models

Luca M. Schulze Buschoff · Konstantinos Voudouris · Elif Akata · Matthias Bethge · Josh Tenenbaum · Eric Schulz

Pre-trained vision language models still fall short of human visual cognition. In an effort to improve visual cognition and align models with human behavior, we introduce visual stimuli and human judgments on visual cognition tasks, allowing us to systematically evaluate performance across cognitive domains under a consistent environment. We fine-tune models on ground truth data for intuitive physics and causal reasoning and find that this improves model performance in the respective fine-tuning domain. Furthermore, it can improve model alignment with human behavior. However, we find that task-specific fine-tuning does not contribute to robust human-like generalization to data with other visual characteristics or to tasks in other cognitive domains.

Poster

#W-414

Pre-Training Graph Contrastive Masked Autoencoders are Strong Distillers for EEG

Xinxu Wei · kanhao zhao · Yong Jiao · Hua Xie · Lifang He · Yu Zhang

Effectively utilizing extensive unlabeled high-density EEG data to improve performance in scenarios with limited labeled low-density EEG data presents a significant challenge. In this paper, we address this challenge by formulating it as a graph transfer learning and knowledge distillation problem. We propose a Unified Pre-trained Graph Contrastive Masked Autoencoder Distiller, named EEG-DisGCMAE, to bridge the gap between unlabeled and labeled as well as high- and low-density EEG data. Our approach introduces a novel unified graph self-supervised pre-training paradigm, which seamlessly integrates the graph contrastive pre-training with the graph masked autoencoder pre-training. Furthermore, we propose a graph topology distillation loss function, allowing a lightweight student model trained on low-density data to learn from a teacher model trained on high-density data during pre-training and fine-tuning. This method effectively handles missing electrodes through contrastive distillation. We validate the effectiveness of EEG-DisGCMAE across four classification tasks using two clinical EEG datasets with abundant data.

Poster

#W-415

Faster and Stronger: When ANN-SNN Conversion Meets Parallel Spiking Calculation

Zecheng Hao · Qichao Ma · Kang Chen · Yi Zhang · Zhaofei Yu · Tiejun Huang

Spiking Neural Network (SNN), as a brain-inspired and energy-efficient network, is currently facing the pivotal challenge of exploring a suitable and efficient learning framework. The predominant training methodologies, namely Spatial-Temporal Back-propagation (STBP) and ANN-SNN Conversion, are encumbered by substantial training overhead or pronounced inference latency, which impedes the advancement of SNNs in scaling to larger networks and navigating intricate application domains. In this work, we propose a novel parallel conversion learning framework, which establishes a mathematical mapping relationship between each time-step of the parallel spiking neurons and the cumulative spike firing rate. We theoretically validate the lossless and sorting properties of the conversion process, as well as pointing out the optimal shifting distance for each step. Furthermore, by integrating the above framework with the distribution-aware error calibration technique, we can achieve efficient conversion towards more general activation functions or training-free circumstance. Extensive experiments have confirmed the significant performance advantages of our method for various conversion cases under ultra-low time latency. To our best knowledge, this is the first work which jointly utilizes parallel spiking calculation and ANN-SNN Conversion, providing a highly promising approach for SNN supervised training. Code is available at https://github.com/hzc1208/Parallel_Conversion.

Poster

#W-417

Fleet of Agents: Coordinated Problem Solving with Large Language Models

Lars Klein · Nearchos Potamitis · Roland Aydin · Robert West · Caglar Gulcehre · Akhil Arora

While numerous frameworks have been developed to enhance the reasoning abilities of large language models (LLMs), there is a scarcity of methods that effectively balance the trade-off between cost and quality. In this paper, we introduce Fleet of Agents (FoA), a novel and intuitive yet principled framework utilizing LLMs as agents to navigate through dynamic tree searches, employing a genetic-type particle filtering approach. FoA spawns a multitude of agents, each exploring the search space autonomously, followed by a selection phase where resampling based on a heuristic value function optimizes the balance between exploration and exploitation. This mechanism enables dynamic branching, adapting the exploration strategy based on discovered solutions. We conduct extensive experiments on four benchmark tasks, \``Game of 24\'', \``Mini-Crosswords\'', \``WebShop\'' and \``SciBench\'', utilizing four different LLMs, GPT-3.5, GPT-4, LLaMA3.2-11B, and LLaMA3.2-90B. On average across all tasks and LLMs, FoA obtains an absolute quality improvement of $\simeq 5\%$ while requiring only $\simeq 35\%$ of the cost of previous SOTA methods. Notably, our analyses reveal that (1) FoA achieves the best cost-quality trade-off among all benchmarked methods, and (2) FoA+ LLaMA3.2-11B surpasses the Llama3.2-90B model. FoA is publicly available at [https://github.com/au-clan/FoA](https://github.com/au-clan/FoA).

Poster

#W-418

A Reasoning-Based Approach to Cryptic Crossword Clue Solving

Martin Andrews · Sam Witteveen

Cryptic crossword clues are challenging language tasks for which new test sets are released daily by major newspapers on a global basis. Each cryptic clue contains both the definition of the answer to be placed in the crossword grid (in common with regular crosswords), and ‘wordplay’ that proves that the answer is correct (i.e. a human solver can be confident that an answer is correct without needing crossing words as confirmation). This work describes an LLM-based reasoning system built from open-licensed components that solves cryptic clues by (i) hypothesising answers; (ii) proposing wordplay explanations; and (iii) using a verifier system that operates on codified reasoning steps. Overall, this system establishes a new state-of-the-art performance on the challenging Cryptonite dataset of clues from The Times and The Telegraph newspapers in the UK. Because each proved solution is expressed in Python, interpretable wordplay reasoning for proven answers is available for inspection.

Poster

#W-419

MoHAVE: Mixture of Hierarchical Audio-Visual Experts for Robust Speech Recognition

Sungnyun Kim · Kangwook Jang · Sangmin Bae · Sungwoo Cho · Se-Young Yun

Audio-visual speech recognition (AVSR) has become critical for enhancing speech recognition in noisy environments by integrating both auditory and visual modalities. However, existing AVSR systems struggle to scale up without compromising computational efficiency. In this study, we introduce MoHAVE (Mixture of Hierarchical Audio-Visual Experts), a novel robust AVSR framework designed to address these scalability constraints. By leveraging a Mixture-of-Experts (MoE) architecture, MoHAVE activates modality-specific expert groups, ensuring dynamic adaptation to various audio-visual inputs with minimal computational overhead. Key contributions of MoHAVE include: (1) a sparse MoE framework that efficiently scales AVSR model capacity, (2) a hierarchical gating mechanism that dynamically utilizes the expert groups based on input context, enhancing adaptability and robustness, and (3) remarkable performance across robust AVSR benchmarks, including LRS3 and MuAViC transcription and translation tasks, setting a new standard for scalable speech recognition systems.

Poster

#W-420

Policy Filtration for RLHF to Mitigate Noise in Reward Models

Chuheng Zhang · Wei Shen · Li Zhao · Xuyun Zhang · Xiaolong Xu · Wanchun Dou · Jiang Bian

While direct policy optimization methods exist, pioneering LLMs are fine-tuned with reinforcement learning from human feedback (RLHF) to generate better responses under the supervision of a reward model learned from preference data. One major challenge of RLHF is the inaccuracy of the intermediate reward model, especially in the tasks that requires complex reasoning for the reward model to score a response. We find that the reliability of the reward model varies across responses assigned with different rewards. This motivates us to filter the samples whose rewards may be unreliable to improve the signal-to-noise ratio during policy learning, resulting in Policy Filtration for Proximal Policy Optimization (PF-PPO). To choose a proper policy filtering strategy, we use the coefficient of determination ($R^2$) between the rewards and actual scores on filtered samples as the metrics to help us find promising strategies since it measures how well the rewards filtered by PF-PPO indicate real performance. We provide extensive experiments to validate the effectiveness of PF-PPO in code generation and math reasoning tasks. In code generation, PF-PPO achieves the state-of-the-art performance of 7-billion-parameter models on HumanEval (+7.9%), MBPP (+0.7%), and LeetCode Contest (+10.0%) which is a more challenging benchmark created by us. In math reasoning, PF-PPO yields performance increase using different reward models and benchmarks (Ape210K and CMATH).

Poster

#W-421

LaRA: Benchmarking Retrieval-Augmented Generation and Long-Context LLMs – No Silver Bullet for LC or RAG Routing

Kuan Li · Liwen Zhang · Yong Jiang · Pengjun Xie · Fei Huang · Shuai Wang · Minhao Cheng

As Large Language Model (LLM) context windows expand, the necessity of Retrieval-Augmented Generation (RAG) for integrating external knowledge is debated. Existing RAG vs. long-context (LC) LLM comparisons are often inconclusive due to benchmark limitations. We introduce LaRA, a novel benchmark with 2326 test cases across four QA tasks and three long context types, for rigorous evaluation. Our analysis of eleven LLMs reveals the optimal choice between RAG and LC depends on a complex interplay of model capabilities, context length, task type, and retrieval characteristics, offering actionable guidelines for practitioners. Our code and dataset is provided at:https://github.com/Alibaba-NLP/LaRA

Poster

#W-500

Feature-Mapping Topology Optimization with Neural Heaviside Signed Distance Functions

Aleksandr Kolomeitsev · ANH-HUY PHAN

Topology optimization plays a crucial role in designing efficient and manufacturable structures. Traditional methods often yield free-form voids that, although providing design flexibility, introduce significant manufacturing challenges and require extensive post-processing. Conversely, feature-mapping topology optimization reduces post-processing efforts by constructing topologies using predefined geometric features. Nevertheless, existing approaches are significantly constrained by the limited set of geometric features available, the variety of parameters that each type of geometric feature can possess, and the necessity of employing differentiable signed distance functions. In this paper, we present a novel method that combines Neural Heaviside Signed Distance Functions (Heaviside SDFs) with structured latent shape representations to generate manufacturable voids directly within the optimization framework. Our architecture incorporates encoder and decoder networks to effectively approximate the Heaviside function and facilitate optimization within a unified latent space, thus addressing the feature diversity limitations of current feature-mapping techniques. Experimental results validate the effectiveness of our approach in balancing structural compliance, offering a new pathway to CAD-integrated design with minimal human intervention.

Poster

#W-501

Universal Biological Sequence Reranking for Improved De Novo Peptide Sequencing

Zijie Qiu · Jiaqi Wei · Xiang Zhang · Sheng Xu · Kai Zou · Zhi Jin · ZhiQiang Gao · Nanqing Dong · Siqi Sun

De novo peptide sequencing is a critical task in proteomics. However, the performance of current deep learning-based methods is limited by the inherent complexity of mass spectrometry data and the heterogeneous distribution of noise signals, leading to data-specific biases. We present RankNovo, the first deep reranking framework that enhances de novo peptide sequencing by leveraging the complementary strengths of multiple sequencing models. RankNovo employs a list-wise reranking approach, modeling candidate peptides as multiple sequence alignments and utilizing axial attention to extract informative features across candidates. Additionally, we introduce two new metrics, PMD (Peptide Mass Deviation) and RMD (ResidualMass Deviation), which offer delicate supervision by quantifying mass differences between peptides at both the sequence and residue levels. Extensive experiments demonstrate that RankNovo not only surpasses its base models used to generate training candidates for reranking pre-training, but also sets a new state-of-the-art benchmark. Moreover, RankNovo exhibits strong zero-shot generalization to unseen models—those whose generations were not exposed during training, highlighting its robustness and potential as a universal reranking framework for peptide sequencing. Our work presents a novel reranking strategy that fundamentally challenges existing single-model paradigms and advances the frontier of accurate de novo sequencing. Our source code is provided on GitHub.

Poster

#W-502

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Yiheng Xu · Zekun Wang · Junli Wang · Dunjie Lu · Tianbao Xie · Amrita Saha · Doyen Sahoo · Tao Yu · Caiming Xiong

Automating GUI tasks remains challenging due to reliance on textual representations, platform-specific action spaces, and limited reasoning capabilities. We introduce Aguvis, a unified vision-based framework for autonomous GUI agents that directly operates on screen images, standardizes cross-platform interactions and incorporates structured reasoning via inner monologue. To enable this, we construct Aguvis data collection, a large-scale dataset with multimodal grounding and reasoning annotations, and develop a two-stage training pipeline that separates GUI grounding from planning and reasoning. Experiments show that Aguvis achieves state-of-the-art performance across offline and real-world online benchmarks, marking the first fully autonomous vision-based GUI agent that operates without closed-source models. We open-source all datasets, models, and training recipes at https://aguvis-project.github.io to advance future research.

Poster

#W-503

LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation

Chen-Chia Chang · Wan-Hsuan Lin · Yikang Shen · Yiran Chen · Xin Zhang

Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications.However, its circuit formulation is inefficient due to $O(|V|^2)$ token length and suffers from low precision sensitivity to numeric inputs.In this work, we introduce LaMAGIC2, a succinct float-input canonical formulationwith identifier (SFCI) for language model-based analog topology generation.SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to $O(|V|)$, and enhancing numeric precision sensitivity for better performance under tight tolerances.Our experiments demonstrate that LaMAGIC2 achieves 34\% higher success rates under a tight tolerance 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5\% improvement.These advancements establish LaMAGIC2 as a robust framework for analog topology generation.

Poster

#W-504

Geometry-Informed Neural Networks

Arturs Berzins · Andreas Radler · Eric Volkmann · Sebastian Sanokowski · Sepp Hochreiter · Johannes Brandstetter

Geometry is a ubiquitous tool in computer graphics, design, and engineering. However, the lack of large shape datasets limits the application of state-of-the-art supervised learning methods and motivates the exploration of alternative learning strategies. To this end, we introduce geometry-informed neural networks (GINNs) -- a framework for training shape-generative neural fields without data by leveraging user-specified design requirements in the form of objectives and constraints. By adding diversity as an explicit constraint, GINNs avoid mode-collapse and can generate multiple diverse solutions, often required in geometry tasks. Experimentally, we apply GINNs to several problems spanning physics, geometry, and engineering design, showing control over geometrical and topological properties, such as surface smoothness or the number of holes. These results demonstrate the potential of training shape-generative models without data, paving the way for new generative design approaches without large datasets.

Poster

#W-505

Flat-LoRA: Low-Rank Adaptation over a Flat Loss Landscape

Tao Li · Zhengbao He · Yujun Li · Yasheng Wang · Lifeng Shang · Xiaolin Huang

Fine-tuning large-scale pre-trained models is prohibitively expensive in terms of computation and memory costs. Low-Rank Adaptation (LoRA), a popular Parameter-Efficient Fine-Tuning (PEFT) method, offers an efficient solution by optimizing only low-rank matrices. Despite recent progress in improving LoRA's performance, the relationship between the LoRA optimization space and the full parameter space is often overlooked. A solution that appears flat in the loss landscape of the LoRA space may still exhibit sharp directions in the full parameter space, potentially compromising generalization.We introduce Flat-LoRA, which aims to identify a low-rank adaptation situated in a flat region of the full parameter space.Instead of adopting the well-established sharpness-aware minimization approach, which incurs significant computation and memory overheads, we employ a Bayesian expectation loss objective to preserve training efficiency. Further, we design a refined strategy for generating random perturbations to enhance performance and carefully manage memory overhead using random seeds.Experiments across diverse tasks—including mathematical reasoning, coding abilities, dialogue generation, instruction following, and text-to-image generation—demonstrate that Flat-LoRA improves both in-domain and out-of-domain generalization.Code is available at https://github.com/nblt/Flat-LoRA.

Poster

#W-506

Efficient Distributed Optimization under Heavy-Tailed Noise

Su Hyeong Lee · Manzil Zaheer · Tian Li

Distributed optimization has become the default training paradigm in modern machine learning due to the growing scale of models and datasets. To mitigate communication overhead, local updates are often applied before global aggregation, resulting in a nested optimization approach with inner and outer steps. However, heavy-tailed stochastic gradient noise remains a significant challenge, particularly in attention-based models, hindering effective training. In this work, we propose TailOPT, an efficient framework designed to address heavy-tailed noise by leveraging adaptive optimization and novel clipping techniques. We establish convergence guarantees for the TailOPT framework under heavy-tailed noise with local updates and potentially unbounded gradient variance. Among its variants, we propose a memory- and communication-efficient instantiation (named $Bi^2Clip$) that performs coordinate-wise clipping from both above and below at both the inner and outer optimizers. $Bi^2Clip$ brings about benefits of adaptive optimization (e.g., Adam) without the cost of maintaining or transmitting additional gradient statistics. Empirically, TailOPT, including $Bi^2Clip$, demonstrates superior performance on various tasks and models compared with state-of-the-art methods, while being more efficient.

Poster

#W-507

Online Conformal Prediction via Online Optimization

Felipe Areces · Christopher Mohri · Tatsunori Hashimoto · John Duchi

We introduce a family of algorithms for online conformal prediction with coverage guarantees for both adversarial and stochastic data. In the adversarial setting, we establish the standard guarantee: over time, a pre-specified target fraction of confidence sets cover the ground truth. For stochastic data, we provide a guarantee at every time instead of just on average over time: the probability that a confidence set covers the ground truth—conditioned on past observations—converges to a pre-specified target when the conditional quantiles of the errors are a linear function of past data. Complementary to our theory, our experiments spanning over $15$ datasets suggest that the performance improvement of our methods over baselines grows with the magnitude of the data’s dependence, even when baselines are tuned on the test set. We put these findings to the test by pre-registering an experiment for electricity demand forecasting in Texas, where our algorithms achieve over a $10$\% reduction in confidence set sizes, a more than a $30$\% improvement in quantile and absolute losses with respect to the observed errors, and significant outcomes on all $78$ out of $78$ pre-registered hypotheses. We provide documentation for the pypi package implementing our algorithms here: \url{https://conformalopt.readthedocs.io/}.

Poster

#W-508

Learning to Generate Projections for Reducing Dimensionality of Heterogeneous Linear Programming Problems

Tomoharu Iwata · Shinsaku Sakaue

We propose a data-driven method for reducing the dimensionality of linear programming problems (LPs) by generating instance-specific projection matrices using a neural network-based model. Once the model is trained using multiple LPs by maximizing the expected objective value, we can efficiently find high-quality feasible solutions of newly given LPs. Our method can shorten the computational time of any LP solvers due to its solver-agnostic nature, it can provide feasible solutions by relying on projection that reduces the number of variables, and it can handle LPs of different sizes using neural networks with permutation equivariance and invariance. We also provide a theoretical analysis of the generalization bound for learning a neural network to generate projection matrices that reduce the size of LPs. Our experimental results demonstrate that our method can obtain solutions with higher quality than the existing methods, while its computational time is significantly shorter than solving the original LPs.

Poster

#W-509

Schwarz–Schur Involution: Lightspeed Differentiable Sparse Linear Solvers

Yu Wang · Mazdak Abulnaga · Yaël Balbastre · Bruce Fischl

Sparse linear solvers are fundamental to science and engineering, applied in partial differential equations (PDEs), scientific computing, computer vision, and beyond. Indirect solvers possess characteristics that make them undesirable as stable differentiable modules; existing direct solvers, though reliable, are too expensive to be adopted in neural architectures. We substantially accelerate direct sparse solvers or generalized deconvolution by up to 3 orders-of-magnitude faster, violating common assumptions that direct solvers are too slow. We ``condense'' a sparse Laplacian matrix into a dense tensor, a compact data structure that batch-wise stores the Dirichlet-to-Neumann matrices, reducing the sparse solving to recursively merging pairs of dense matrices that are much smaller. The batched small dense systems are sliced and inverted in parallel to take advantage of dense GPU BLAS kernels, highly optimized in the era of deep learning. Our method is efficient, qualified as a strong zero-shot baseline for AI-based PDE solving and a reliable differentiable module integrable into machine learning pipelines.

Poster

#W-511

Decoupled SGDA for Games with Intermittent Strategy Communication

Ali Zindari · Parham Yazdkhasti · Anton Rodomanov · Tatjana Chavdarova · Sebastian Stich

We introduce Decoupled SGDA, a novel adaptation of Stochastic Gradient Descent Ascent (SGDA) tailored for multiplayer games with intermittent strategy communication. Unlike prior methods, Decoupled SGDA enables players to update strategies locally using outdated opponent strategies, significantly reducing communication overhead. For Strongly-Convex-Strongly-Concave (SCSC) games, it achieves near-optimal communication complexity comparable to the best-known GDA rates. For weakly coupled games where the interaction between players is lower relative to the non-interactive part of the game, Decoupled SGDA significantly reduces communication costs compared to standard SGDA. Additionally, Decoupled SGDA outperforms federated minimax approaches in noisy, imbalanced settings. These results establish Decoupled SGDA as a transformative approach for distributed optimization in resource-constrained environments.

Poster

#W-512

Momentum-Driven Adaptivity: Towards Tuning-Free Asynchronous Federated Learning

Wenjing Yan · Xiangyu Zhong · Xiaolu Wang · Angela Yingjun Zhang

Asynchronous federated learning (AFL) has emerged as a promising solution to address system heterogeneity and improve the training efficiency of federated learning. However, existing AFL methods face two critical limitations: 1) they rely on strong assumptions about bounded data heterogeneity across clients, and 2) they require meticulous tuning of learning rates based on unknown system parameters. In this paper, we tackle these challenges by leveraging momentum-based optimization and adaptive learning strategies. We first propose MasFL, a novel momentum-driven AFL framework that successfully eliminates the need for data heterogeneity bounds by effectively utilizing historical descent directions across clients and iterations. By mitigating the staleness accumulation caused by asynchronous updates, we prove that MasFL achieves state-of- the-art convergence rates with linear speedup in both the number of participating clients and local updates. Building on this foundation, we further introduce AdaMasFL, an adaptive variant that incorporates gradient normalization into local updates. Remarkably, this integration removes all dependencies on problem-specific parameters, yielding a fully tuning-free AFL approach while retaining theoretical guarantees. Extensive experiments demonstrate that AdaMasFL consistently outperforms state-of-the-art AFL methods in run- time efficiency and exhibits exceptional robustness across diverse learning rate configurations and system conditions.

Poster

#W-513

Lean and Mean Adaptive Optimization via Subset-Norm and Subspace-Momentum with Convergence Guarantees

Thien Nguyen · Huy Nguyen

We introduce two complementary techniques for efficient optimization that reduce memory requirements while accelerating training oflarge-scale neural networks. The first technique, Subset-Norm step size, generalizes AdaGrad-Norm and AdaGrad(-Coordinate) through step-size sharing. Subset-Norm (SN) reduces AdaGrad's memory footprint from $O(d)$ to $O(\sqrt{d})$, where $d$ is the model size. For non-convex smooth objectives under coordinate-wise sub-gaussian noise, we show a noise-adapted high-probability convergence guarantee with improved dimensional dependence of SN over existing methods. Our second technique, Subspace-Momentum, reduces the momentum state's memory footprint by restricting momentum to a low-dimensional subspace while performing SGD in the orthogonal complement. We prove high-probability convergence rates for Subspace-Momentum under standard assumptions. Empirical evaluation on pre-training and fine-tuning LLMs demonstrates the effectiveness of our methods. For instance, combining Subset-Norm with Subspace-Momentum achieves Adam's validation perplexity for LLaMA 1B in approximately half the trainingtokens (6.8B vs 13.1B) while reducing Adam's optimizer-states memory footprint by more than 80\% with minimal additional hyperparameter tuning.

Poster

#W-514

Efficient Curvature-Aware Hypergradient Approximation for Bilevel Optimization

Youran Dong · Junfeng Yang · Wei Yao · Jin Zhang

Bilevel optimization is a powerful tool for many machine learning problems, such as hyperparameter optimization and meta-learning. Estimating hypergradients (also known as implicit gradients) is crucial for developing gradient-based methods for bilevel optimization. In this work, we propose a computationally efficient technique for incorporating curvature information into the approximation of hypergradients and present a novel algorithmic framework based on the resulting enhanced hypergradient computation. We provide convergence rate guarantees for the proposed framework in both deterministic and stochastic scenarios, particularly showing improved computational complexity over popular gradient-based methods in the deterministic setting. This improvement in complexity arises from a careful exploitation of the hypergradient structure and the inexact Newton method. In addition to the theoretical speedup, numerical experiments demonstrate the significant practical performance benefits of incorporating curvature information.

Poster

#W-515

Automatic Differentiation of Optimization Algorithms with Time-Varying Updates

Sheheryar Mehmood · Peter Ochs

Numerous optimization algorithms have a time-varying update rule thanks to, for instance, a changing step size, momentum parameter or, Hessian approximation. Often, such algorithms are used as solvers for the lower-level problem in bilevel optimization, and are unrolled when computing the gradient of the upper-level objective. In this paper, we apply unrolled or automatic differentiation to a time-varying iterative process and provide convergence (rate) guarantees for the resulting derivative iterates. We then adapt these convergence results and apply them to proximal gradient descent with variable step size and FISTA when solving partly-smooth problems. We test the convergence (rates) of these algorithms numerically through several experiments. Our theoretical and numerical results show that the convergence rate of the algorithm is reflected in its derivative iterates.

Poster

#W-516

Secant Line Search for Frank-Wolfe Algorithms

Deborah Hendrych · Sebastian Pokutta · Mathieu Besançon · David Martinez-Rubio

We present a new step-size strategy based on the secant method for Frank-Wolfe algorithms. This strategy, which requires mild assumptions about the function under consideration, can be applied to any Frank-Wolfe algorithm. It is as effective as full line search and, in particular, allows for adapting to the local smoothness of the function, such as in (Pedregosa et al., 2020), but comes with a significantly reduced computational cost, leading to higher effective rates of convergence. We provide theoretical guarantees and demonstrate the effectiveness of the strategy through numerical experiments.

Poster

#W-517

Active Learning of Deep Neural Networks via Gradient-Free Cutting Planes

Erica Zhang · Fangzhao Zhang · Mert Pilanci

Active learning methods aim to improve sample complexity in machine learning. In this work, we investigate an active learning scheme via a novel gradient-free cutting-plane training method for ReLU networks of arbitrary depth and develop a convergence theory. We demonstrate, for the first time, that cutting-plane algorithms, traditionally used in linear models, can be extended to deep neural networks despite their nonconvexity and nonlinear decision boundaries. Moreover, this training method induces the first deep active learning scheme known to achieve convergence guarantees, revealing a geometric contraction rate of the feasible set. We exemplify the effectiveness of our proposed active learning method against popular deep active learning baselines via both synthetic data experiments and sentimental classification task on real datasets.

Poster

#W-518

Scalable Approximation Algorithms for $p$-Wasserstein Distance and Its Variants

Nathaniel Lahn · Sharath Raghvendra · Emma Saarinen · Pouyan Shirzadian

The $p$-Wasserstein distance measures the cost of optimally transporting one distribution to another, where the cost of moving a unit mass from $a$ to $b$ is the $p^{th}$ power of the ground distance $\mathrm{d}(a,b)$ between them. Despite its strong theoretical properties, its use in practice -- especially for $p \ge 2$ -- is limited due to two key challenges: sensitivity to noise and a lack of scalable algorithms.We identify noise sensitivity as a key reason why some existing approximation algorithms for $p=1$ fail to generalize to $p \ge 2$ and then present new algorithms for approximating the $p$-Wasserstein distance and its variant. First, when $\mathrm{d}(\cdot,\cdot)$ is a metric, for any constant $p \ge 2$, we present a novel relative $O(\log n)$-approximation algorithm to compute the $p$-Wasserstein distance between any two discrete distributions of size $n$. The algorithm runs in $O(n^2 \log U\log \Delta\log n)$ time, where $\log U$ is the bit-length of the input probabilities and $\Delta$ is the ratio of the largest to the smallest pairwise distance. We use $p$ hierarchically well-separated trees to define a distance that approximates the $p$-Wasserstein cost within a factor of $O(\log n)$ and then present a simple primal-dual algorithm to compute the $p$-Wasserstein cost with respect to this distance. Second, due to the noise sensitivity of the $p$-Wasserstein distance, we show that existing combinatorial approaches require $\Omega(n^2/\delta^p)$ time to approximate the $p$-Wasserstein distance within an additive error of $\delta$. In contrast, we show that, for any arbitrary distance $\mathrm{d}(\cdot,\cdot)$, a recent noise-resistant variant of the $p$-Wasserstein distance, called the $p$-RPW distance, can be approximated in $O(n^2/\delta^3)$ time.

Poster

#W-519

SHIELD: Multi-task Multi-distribution Vehicle Routing Solver with Sparsity and Hierarchy

Yong Liang Goh · Zhiguang Cao · Yining Ma · Jianan Zhou · Mohammed Haroon Dupty · Wee Sun Lee

Recent advances toward foundation models for routing problems have shown great potential of a unified deep model for various VRP variants. However, they overlook the complex real-world customer distributions. In this work, we advance the Multi-Task VRP (MTVRP) setting to the more realistic yet challenging Multi-Task Multi-Distribution VRP (MTMDVRP) setting, and introduce SHIELD, a novel model that leverages both sparsity and hierarchy principles. Building on a deeper decoder architecture, we first incorporate the Mixture-of-Depths (MoD) technique to enforce sparsity. This improves both efficiency and generalization by allowing the model to dynamically select nodes to use or skip each decoder layer, providing the needed capacity to adaptively allocate computation for learning the task/distribution specific and shared representations. We also develop a context-based clustering layer that exploits the presence of hierarchical structures in the problems to produce better local representations. These two designs inductively bias the network to identify key features that are common across tasks and distributions, leading to significantly improved generalization on unseen ones. Our empirical results demonstrate the superiority of our approach over existing methods on 9 real-world maps with 16 VRP variants each.

Poster

#W-520

An Asymptotically Optimal Approximation Algorithm for Multiobjective Submodular Maximization at Scale

Fabian Spaeh · Atsushi Miyauchi

Maximizing a single submodular set function subject to a cardinality constraint is a well-studied and central topic in combinatorial optimization. However, finding a set that maximizes multiple functions at the same time is much less understood, even though it is a formulation which naturally occurs in robust maximization or problems with fairness considerations such as fair influence maximization or fair allocation. In this work, we consider the problem of maximizing the minimum over many submodular functions subject to a cardinality constraint, which is known as multiobjective submodular maximization. All known polynomial-time approximation algorithms either obtain a weak approximation guarantee or rely on the evaluation of the multilinear extension. The latter is expensive to evaluate and renders such algorithms impractical. We bridge this gap and introduce the first scalable and practical algorithm that obtains the best-known approximation guarantee. We furthermore introduce a novel application fair centrality maximization and show how it can be addressed via multiobjective submodular maximization. In our experimental evaluation, we show that our algorithm outperforms known algorithms in terms of objective value and running time.

Spotlight Poster

#W-521

Discrepancy Minimization in Input-Sparsity Time

Yichuan Deng · Xiaoyu Li · Zhao Song · OMRI WEINSTEIN

A recent work by [Larsen, SODA 2023] introduced a faster combinatorial alternative to Bansal's SDP algorithm for finding a coloring $x \in \\{-1, 1\\}^n$ that approximately minimizes the discrepancy $\mathrm{disc}(A, x) := \\| A x \\|_{\infty}$ of a real-valued $m \times n$ matrix $A$. Larsen's algorithm runs in $\widetilde{O}(mn^2)$ time compared to Bansal's $\widetilde{O}(mn^{4.5})$-time algorithm, with a slightly weaker logarithmic approximation ratio in terms of the hereditary discrepancy of $A$ [Bansal, FOCS 2010]. We present a combinatorial $\widetilde{O}(\mathrm{nnz}(A) + n^3)$-time algorithm with the same approximation guarantee as Larsen's, optimal for tall matrices where $m = \mathrm{poly}(n)$. Using a more intricate analysis and fast matrix multiplication, we further achieve a runtime of $\widetilde{O}(\mathrm{nnz}(A) + n^{2.53})$, breaking the cubic barrier for square matrices and surpassing the limitations of linear-programming approaches [Eldan and Singh, RS\&A 2018]. Our algorithm relies on two key ideas: (i) a new sketching technique for finding a projection matrix with a short $\ell_2$-basis using implicit leverage-score sampling, and (ii) a data structure for efficiently implementing the iterative Edge-Walk partial-coloring algorithm [Lovett and Meka, SICOMP 2015], and using an alternative analysis to enable ``lazy'' batch updates with low-rank corrections. Our results nearly close the computational gap between real-valued and binary matrices, for which input-sparsity time coloring was recently obtained by [Jain, Sah and Sawhney, SODA 2023].

Poster

#W-600

Intersectional Fairness in Reinforcement Learning with Large State and Constraint Spaces

ERIC EATON · Marcel Hussing · Michael Kearns · Aaron Roth · Sikata Sengupta · Jessica Sorrell

In traditional reinforcement learning (RL), the learner aims to solve a single objective optimization problem: find the policy that maximizes expected reward. However, in many real-world settings, it is important to optimize over multiple objectives simultaneously. For example, when we are interested in fairness, states might have feature annotations corresponding to multiple (intersecting) demographic groups to whom reward accrues, and our goal might be to maximize the reward of the group receiving the minimal reward. In this work, we consider a multi-objective optimization problem in which each objective is defined by a state-based reweighting of a single scalar reward function. This generalizes the problem of maximizing the reward of the minimum reward group. We provide oracle-efficient algorithms to solve these multi-objective RL problems even when the number of objectives is very large --- for tabular MDPs, as well as for large MDPs when the group functions have additional structure. The contribution of this paper is that we are able to solve this class of multi-objective RL problems with a possibly exponentially large class of constraints over intersecting groups in both tabular and large state space MDPs in an oracle-efficient manner. Finally, we experimentally validate our theoretical results and demonstrate applications on a preferential attachment graph MDP.

Poster

#W-601

Scaling Value Iteration Networks to 5000 Layers for Extreme Long-Term Planning

Yuhui Wang · Qingyuan Wu · Dylan Ashley · Francesco Faccio · Weida Li · Chao Huang · Jürgen Schmidhuber

The Value Iteration Network (VIN) is an end-to-end differentiable neural network architecture for planning. It exhibits strong generalization to unseen domains by incorporating a differentiable planning module that operates on a latent Markov Decision Process (MDP). However, VINs struggle to scale to long-term and large-scale planning tasks, such as navigating a $100\times 100$ maze---a task that typically requires thousands of planning steps to solve. We observe that this deficiency is due to two issues: the representation capacity of the latent MDP and the planning module's depth. We address these by augmenting the latent MDP with a dynamic transition kernel, dramatically improving its representational capacity, and, to mitigate the vanishing gradient problem, introduce an "adaptive highway loss" that constructs skip connections to improve gradient flow. We evaluate our method on 2D/3D maze navigation environments, continuous control, and the real-world Lunar rover navigation task. We find that our new method, named Dynamic Transition VIN (DT-VIN), scales to 5000 layers and solves challenging versions of the above tasks. Altogether, we believe that DT-VIN represents a concrete step forward in performing long-term large-scale planning in complex environments.

Poster

#W-602

Action-Constrained Imitation Learning

Chia-Han Yeh · Tse-Sheng Nan · Risto Vuorio · Wei Hung · Hung-Yen Wu · Shao-Hua Sun · Ping-Chun Hsieh

Policy learning under action constraints plays a central role in ensuring safe behaviors in various robot control and resource allocation applications.In this paper, we study a new problem setting termed Action-Constrained Imitation Learning (ACIL), where an action-constrained imitator aims to learn from a demonstrative expert with larger action space.The fundamental challenge of ACIL lies in the unavoidable mismatch of occupancy measure between the expert and the imitator caused by the action constraints. We tackle this mismatch through trajectory alignment and propose DTWIL, which replaces the original expert demonstrations with a surrogate dataset that follows similar state trajectories while adhering to the action constraints. Specifically, we recast trajectory alignment as a planning problem and solve it via Model Predictive Control, which aligns the surrogate trajectories with the expert trajectories based on the Dynamic Time Warping (DTW) distance. Through extensive experiments, we demonstrate that learning from the dataset generated by DTWIL significantly enhances performance across multiple robot control tasks and outperforms various benchmark imitation learning algorithms in terms of sample efficiency.

Poster

#W-603

A Reduction Framework for Distributionally Robust Reinforcement Learning under Average Reward

Zachary Roch · George Atia · Yue Wang

Robust reinforcement learning (RL) under the average reward criterion, which seeks to optimize long-term system performance in uncertain environments, remains a largely unexplored area. To address this challenge, we propose a reduction-based framework that transforms robust average reward optimization into the more extensively studied robust discounted reward optimization by employing a specific discount factor. Our framework provides two key advantages. Data Efficiency: We design a model-based reduction algorithm that achieves near-optimal sample complexity, enabling efficient identification of optimal robust policies; Scalability: By bypassing the inherent challenges of scaling up average reward optimization, our framework facilitates the design of scalable, convergent algorithms for robust average reward optimization leveraging function approximation. Our algorithmic design, supported by theoretical and empirical analyses, provides a concrete solution to robust average reward RL with the first data efficiency and scalability guarantees, highlighting the framework’s potential to optimize long-term performance under model uncertainty in practical problems.

Poster

#W-604

Off-Policy Actor-Critic for Adversarial Observation Robustness: Virtual Alternative Training via Symmetric Policy Evaluation

Kosuke Nakanishi · Akihiro Kubo · Yuji Yasui · Shin Ishii

Recently, robust reinforcement learning (RL) methods designed to handle adversarial input observations have received significant attention, motivated by RL's inherent vulnerabilities. While existing approaches have demonstrated reasonable success, addressing worst-case scenarios over long time horizons requires both minimizing the agent's cumulative rewards for adversaries and training agents to counteract them through alternating learning. However, this process introduces mutual dependencies between the agent and the adversary, making interactions with the environment inefficient and hindering the development of off-policy methods.In this work, we propose a novel off-policy method that eliminates the need for additional environmental interactions by reformulating adversarial learning as a soft-constrained optimization problem. Our approach is theoretically supported by the symmetric property of policy evaluation between the agent and the adversary.The implementation is available at https://github.com/nakanakakosuke/VALT_SAC.

Poster

#W-605

Adaptive Exploration for Multi-Reward Multi-Policy Evaluation

Alessio Russo · Aldo Pacchiano

We study the policy evaluation problem in an online multi-reward multi-policy discounted setting, where multiple reward functions must be evaluated simultaneously for different policies. We adopt an $(\epsilon,\delta)$-PAC perspective to achieve $\epsilon$-accurate estimates with high confidence across finite or convex sets of rewards, a setting that has not been investigated in the literature. Building on prior work on Multi-Reward Best Policy Identification, we adapt the MR-NaS exploration scheme to jointly minimize sample complexity for evaluating different policies across different reward sets. Our approach leverages an instance-specific lower bound revealing how the sample complexity scales with a measure of value deviation, guiding the design of an efficient exploration policy. Although computing this bound entails a hard non-convex optimization, we propose an efficient convex approximation that holds for both finite and convex reward sets. Experiments in tabular domains demonstrate the effectiveness of this adaptive exploration scheme.

Poster

#W-606

A Sub-Problem Quantum Alternating Operator Ansatz for Correlation Clustering

Lucas Fabian Naumann · Jannik Irmai · Bjoern Andres

The Quantum Alternating Operator Ansatz (QAOA) is a hybrid quantum-classical variational algorithm for approximately solving combinatorial optimization problems on Noisy Intermediate-Scale Quantum (NISQ) devices. Although it has been successfully applied to a variety of problems, there is only limited work on correlation clustering due to the difficulty of modelling the problem constraints with the ansatz. Motivated by this, we present a generalization of QAOA that is more suitable for this problem. In particular, we modify QAOA in two ways: Firstly, we use nucleus sampling for the computation of the expected cost. Secondly, we split the problem into sub-problems, solving each individually with QAOA. We call this generalization the Sub-Problem Quantum Alternating Operator Ansatz (SQAOA) and show theoretically that optimal solutions for correlation clustering instances can be obtained with certainty when the depth of the ansatz tends to infinity. Further, we show experimentally that SQAOA achieves better approximation ratios than QAOA for correlation clustering, while using only one qubit per node of the respective problem instance and reducing the runtime (of simulations).

Poster

#W-607

Towards Universal Offline Black-Box Optimization via Learning Language Model Embeddings

Rong-Xi Tan · Ming Chen · Ke Xue · Yao Wang · Yaoyuan Wang · Fu Sheng · Chao Qian

The pursuit of universal black-box optimization (BBO) algorithms is a longstanding goal. However, unlike domains such as language or vision, where scaling structured data has driven generalization, progress in offline BBO remains hindered by the lack of unified representations for heterogeneous numerical spaces. Thus, existing offline BBO approaches are constrained to single-task and fixed-dimensional settings, failing to achieve cross-domain universal optimization. Recent advances in language models (LMs) offer a promising path forward: their embeddings capture latent relationships in a unifying way, enabling universal optimization across different data types possible. In this paper, we discuss multiple potential approaches, including an end-to-end learning framework in the form of next-token prediction, as well as prioritizing the learning of latent spaces with strong representational capabilities. To validate the effectiveness of these methods, we collect offline BBO tasks and data from open-source academic works for training. Experiments demonstrate the universality and effectiveness of our proposed methods. Our findings suggest that unifying language model priors and learning string embedding space can overcome traditional barriers in universal BBO, paving the way for general-purpose BBO algorithms. The code is provided at https://github.com/lamda-bbo/universal-offline-bbo.

Poster

#W-608

Improved Lower Bounds for First-order Stochastic Non-convex Optimization under Markov Sampling

Zhenyu Sun · Ermin Wei

Unlike its vanilla counterpart with i.i.d. samples, stochastic optimization with Markovian sampling allows the sampling scheme following a Markov chain. This problem encompasses various applications that range from asynchronous distributed optimization to reinforcement learning. In this work, we lower bound the sample complexity of finding $\epsilon$-approximate critical solutions for any first-order methods when sampling is Markovian. We show that for samples drawn from stationary Markov processes with countable state space, any algorithm that accesses smooth, non-convex functions through queries to a stochastic gradient oracle, requires at least $\Omega(\epsilon^{-4})$ samples. Moreover, for finite Markov chains, we show a $\Omega(\epsilon^{-2})$ lower bound and propose a new algorithm, called MaC-SAGE, that is proven to (nearly) match our lower bound.

Poster

#W-609

A Memory Efficient Randomized Subspace Optimization Method for Training Large Language Models

Yiming Chen · yuan zhang · Yin Liu · Kun Yuan · Zaiwen Wen

The memory challenges associated with training Large Language Models (LLMs) have become a critical concern, particularly when using the Adam optimizer. To address this issue, numerous memory-efficient techniques have been proposed, with GaLore standing out as a notable example designed to reduce the memory footprint of optimizer states. However, these approaches do not alleviate the memory burden imposed by activations, rendering them unsuitable for scenarios involving long context sequences or large mini-batches. Moreover, their convergence properties are still not well-understood in the literature. In this work, we introduce a Randomized Subspace Optimization framework for pre-training and fine-tuning LLMs. Our approach decomposes the high-dimensional training problem into a series of lower-dimensional subproblems. At each iteration, a random subspace is selected, and the parameters within that subspace are optimized. This structured reduction in dimensionality allows our method to simultaneously reduce memory usage for both activations and optimizer states. We establish comprehensive convergence guarantees and derive rates for various scenarios, accommodating different optimization strategies to solve the subproblems. Extensive experiments validate the superior memory and communication efficiency of our method, achieving performance comparable to GaLore and Adam.

Spotlight Poster

#W-610

Convergence of Mean-Field Langevin Stochastic Descent-Ascent for Distributional Minimax Optimization

Zhangyi Liu · Feng Liu · Rui Gao · Shuang Li

We study convergence properties of the discrete-time Mean-Field Langevin Stochastic Descent-Ascent (MFL-SDA) algorithm for solving distributional minimax optimization. These problems arise in various applications, such as zero-sum games, generative adversarial networks and distributionally robust learning. Despite the significance of MFL-SDA in these contexts, the discrete-time convergence rate remains underexplored.To address this gap, we establish a last-iterate convergence rate of $O(\frac{1}{\epsilon}\log\frac{1}{\epsilon})$ for MFL-SDA. This rate is nearly optimal when compared to the complexity lower bound of its Euclidean counterpart. This rate also matches the complexity of mean-field Langevin stochastic gradient descent for distributional minimization and the outer-loop iteration complexity of an existing double-loop algorithm for distributional minimax problems.By leveraging an elementary analysis framework that avoids PDE-based techniques, we overcome previous limitations and achieve a faster convergence rate.

Poster

#W-611

Optimal Transport Barycenter via Nonconvex-Concave Minimax Optimization

Kaheon Kim · Rentian Yao · Changbo Zhu · Xiaohui Chen

The optimal transport barycenter (a.k.a. Wasserstein barycenter) is a fundamental notion of averaging that extends from the Euclidean space to the Wasserstein space of probability distributions. Computation of the *unregularized* barycenter for discretized probability distributions on point clouds is a challenging task when the domain dimension $d > 1$. Most practical algorithms for approximating the barycenter problem are based on entropic regularization. In this paper, we introduce a nearly linear time $O(m \log{m})$ and linear space complexity $O(m)$ primal-dual algorithm, the *Wasserstein-Descent $\dot{\mathbb{H}}^1$-Ascent* (WDHA) algorithm, for computing the *exact* barycenter when the input probability density functions are discretized on an $m$-point grid. The key success of the WDHA algorithm hinges on alternating between two different yet closely related Wasserstein and Sobolev optimization geometries for the primal barycenter and dual Kantorovich potential subproblems. Under reasonable assumptions, we establish the convergence rate and iteration complexity of WDHA to its stationary point when the step size is appropriately chosen. Superior computational efficacy, scalability, and accuracy over the existing Sinkhorn-type algorithms are demonstrated on high-resolution (e.g., $1024 \times 1024$ images) 2D synthetic and real data.

Poster

#W-612

A Novel Characterization of the Population Area Under the Risk Coverage Curve (AURC) and Rates of Finite Sample Estimators

Han Zhou · dr. Jordy Van Landeghem · Teodora Popordanoska · Matthew B Blaschko

The selective classifier (SC) has been proposed for rank based uncertainty thresholding, which could have applications in safety critical areas such as medical diagnostics, autonomous driving, and the justice system. The Area Under the Risk-Coverage Curve (AURC) has emerged as the foremost evaluation metric for assessing the performance of SC systems. In this work, we present a formal statistical formulation of population AURC, presenting an equivalent expression that can be interpreted as a reweighted risk function. Through Monte Carlo methods, we derive empirical AURC plug-in estimators for finite sample scenarios. The weight estimators associated with these plug-in estimators are shown to be consistent, with low bias and tightly bounded mean squared error (MSE). The plug-in estimators are proven to converge at a rate of $\mathcal{O}(\sqrt{\ln(n)/n})$ demonstrating statistical consistency. We empirically validate the effectiveness of our estimators through experiments across multiple datasets, model architectures, and confidence score functions (CSFs), demonstrating consistency and effectiveness in fine-tuning AURC performance.

Poster

#W-613

MetaOptimize: A Framework for Optimizing Step Sizes and Other Meta-parameters

Arsalan Sharifnassab · Saber Salehkaleybar · Rich Sutton

We address the challenge of optimizing meta-parameters (hyperparameters) in machine learning, a key factor for efficient training and high model performance. Rather than relying on expensive meta-parameter search methods, we introduce MetaOptimize: a dynamic approach that adjusts meta-parameters, particularly step sizes (also known as learning rates), during training. More specifically, MetaOptimize can wrap around any first-order optimization algorithm, tuning step sizes on the fly to minimize a specific form of regret that considers the long-term impact of step sizes on training, through a discounted sum of future losses. We also introduce lower-complexity variants of MetaOptimize that, in conjunction with its adaptability to various optimization algorithms, achieve performance comparable to those of the best hand-crafted learning rate schedules across diverse machine learning tasks.

Poster

#W-614

Constant Stepsize Local GD for Logistic Regression: Acceleration by Instability

Michael Crawshaw · Blake Woodworth · Mingrui Liu

Existing analysis of Local (Stochastic) Gradient Descent for heterogeneous objectives requires stepsizes $\eta \leq 1/K$ where $K$ is the communication interval, which ensures monotonic decrease of the objective. In contrast, we analyze Local Gradient Descent for logistic regression with separable, heterogeneous data using any stepsize $\eta > 0$. With $R$ communication rounds and $M$ clients, we show convergence at a rate $\mathcal{O}(1/\eta K R)$ after an initial unstable phase lasting for $\widetilde{\mathcal{O}}(\eta K M)$ rounds. This improves upon the existing $\mathcal{O}(1/R)$ rate for general smooth, convex objectives. Our analysis parallels the single machine analysis of Wu et al. (2024) in which instability is caused by extremely large stepsizes, but in our setting another source of instability is large local updates with heterogeneous objectives.

Poster

#W-615

Distributed Retraction-Free and Communication-Efficient Optimization on the Stiefel Manifold

Yilong Song · Peijin Li · Bin Gao · Kun Yuan

Optimization problems on the Stiefel manifold, ranging from principal component analysis to enhancing neural network robustness, are ubiquitous in machine learning. The Landing algorithm avoids computationally expensive retraction operations on manifolds, making it highly competitive for large-scale problems. This paper extends this method to distributed settings, introducing EF-Landing, the first retraction-free and communication-efficient algorithm for distributed stochastic optimization on the Stiefel manifold. By incorporating communication compression and error feedback, EF-Landing ensures convergence and constraint feasibility while significantly reducing communication overhead. We provide sharp convergence guarantees, demonstrating that EF-Landing achieves the same asymptotic linear speedup convergence rate as existing methods without communication compression. Furthermore, our analysis is highly versatile, applying to both deterministic and stochastic settings and encompassing algorithms based on gradient descent or momentum-based gradient descent. We also generalize EF-Landing to operate on block-wise Stiefel manifolds, enabling greater flexibility for structured constraints. Extensive numerical experiments validate our theoretical results.

Poster

#W-616

Ringmaster ASGD: The First Asynchronous SGD with Optimal Time Complexity

Artavazd Maranjyan · Alexander Tyurin · Peter Richtarik

Asynchronous Stochastic Gradient Descent (Asynchronous SGD) is a cornerstone method for parallelizing learning in distributed machine learning. However, its performance suffers under arbitrarily heterogeneous computation times across workers, leading to suboptimal time complexity and inefficiency as the number of workers scales. While several Asynchronous SGD variants have been proposed, recent findings by Tyurin & Richtárik (NeurIPS 2023) reveal that none achieve optimal time complexity, leaving a significant gap in the literature. In this paper, we propose Ringmaster ASGD, a novel Asynchronous SGD method designed to address these limitations and tame the inherent challenges of Asynchronous SGD. We establish, through rigorous theoretical analysis, that Ringmaster ASGD achieves optimal time complexity under arbitrarily heterogeneous and dynamically fluctuating worker computation times. This makes it the first Asynchronous SGD method to meet the theoretical lower bounds for time complexity in such scenarios.

Poster

#W-617

FedOne: Query-Efficient Federated Learning for Black-box Discrete Prompt Learning

Ganyu Wang · Jinjie Fang · Maxwell (Juncheng) Yin · Bin Gu · Xi Chen · Boyu Wang · Yi Chang · Charles X. Ling

Black-Box Discrete Prompt Learning (BDPL) is a prompt-tuning method that optimizes discrete prompts without accessing model parameters or gradients, making the prompt tuning on a cloud-based Large Language Model (LLM) feasible.Adapting Federated Learning (FL) to BDPL could further enhance prompt tuning performance by leveraging data from diverse sources. However, all previous research on federated black-box prompt tuning had neglected the substantial query cost associated with the cloud-based LLM service. To address this gap, we conducted a theoretical analysis of query efficiency within the context of federated black-box prompt tuning. Our findings revealed that degrading FedAvg to activate only one client per round, a strategy we called \textit{FedOne}, enabled optimal query efficiency in federated black-box prompt learning. Building on this insight, we proposed the FedOne framework, a federated black-box discrete prompt learning method designed to maximize query efficiency when interacting with cloud-based LLMs.We conducted numerical experiments on various aspects of our framework, demonstrating a significant improvement in query efficiency, which aligns with our theoretical results.

Poster

#W-618

PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization

Xinyi Wan · Penghui Qi · Guangxing Huang · Min Lin · Jialin Li

Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption.

Poster

#W-619

FRUGAL: Memory-Efficient Optimization by Reducing State Overhead for Scalable Training

Filipp Zmushko · Aleksandr Beznosikov · Martin Takac · Samuel Horváth

With the increase in the number of parameters in large language models, the training process increasingly demands larger volumes of GPU memory. A significant portion of this memory is typically consumed by the optimizer state. To overcome this challenge, recent approaches such as low-rank adaptation (LoRA), low-rank gradient projection (GaLore), and blockwise optimization (BAdam) have been proposed. However, in all these algorithms, the effective rank of the weight updates remains low-rank, which can lead to a substantial loss of information from the gradient. This loss can be critically important, especially during the pre-training stage. In this paper, we introduce FRUGAL (Full-Rank Updates with GrAdient spLitting), a new memory-efficient optimization framework. FRUGAL leverages gradient splitting to perform low-dimensional updates using advanced algorithms (such as Adam), while updates along the remaining directions are executed via state-free methods like SGD or signSGD. Our framework can be integrated with various low-rank update selection techniques, including GaLore and BAdam. We provide theoretical convergence guarantees for our framework when using SGDM for low-dimensional updates and SGD for state-free updates. Additionally, our method consistently outperforms concurrent approaches, achieving state-of-the-art results in pre-training and fine-tuning tasks while balancing memory efficiency and performance metrics.

Poster

#W-620

Riemannian Diffusion Adaptation for Distributed Optimization on Manifolds

Xiuheng Wang · Ricardo Borsoi · Cédric Richard · Ali Sayed

Online distributed optimization is particularly useful for solving optimization problems with streaming data collected by multiple agents over a network. When the solutions lie on a Riemannian manifold, such problems become challenging to solve, particularly when efficiency and continuous adaptation are required. This work tackles these challenges and devises a diffusion adaptation strategy for decentralized optimization over general manifolds. A theoretical analysis shows that the proposed algorithm is able to approach network agreement after sufficient iterations, which allows a non-asymptotic convergence result to be derived. We apply the algorithm to the online decentralized principal component analysis problem and Gaussian mixture model inference. Experimental results with both synthetic and real data illustrate its performance.

Poster

#W-621

Differentiable Quadratic Optimization For the Maximum Independent Set Problem

Ismail Alkhouri · Cedric Le Denmat · Yingjie Li · CUNXI YU · Jia (Kevin) Liu · Rongrong Wang · Alvaro Velasquez

Combinatorial Optimization (CO) addresses many important problems, including the challenging Maximum Independent Set (MIS) problem. Alongside exact and heuristic solvers, differentiable approaches have emerged, often using continuous relaxations of quadratic objectives. Noting that an MIS in a graph is a Maximum Clique (MC) in its complement, we propose a new quadratic formulation for MIS by incorporating an MC term, improving convergence and exploration. We show that every maximal independent set corresponds to a local minimizer, derive conditions with respect to the MIS size, and characterize stationary points. To tackle the non-convexity of the objective, we propose optimizing several initializations in parallel using momentum-based gradient descent, complemented by an efficient MIS checking criterion derived from our theory. We dub our method as parallelized Clique-Informed Quadratic Optimization for MIS (pCQO-MIS). Our experimental results demonstrate the effectiveness of the proposed method compared to exact, heuristic, sampling, and data-centric approaches. Notably, our method avoids the out-of-distribution tuning and reliance on (un)labeled data required by data-centric methods, while achieving superior MIS sizes and competitive run-time relative to their inference time. Additionally, a key advantage of pCQO-MIS is that, unlike exact and heuristic solvers, the run-time scales only with the number of nodes in the graph, not the number of edges. Our code is available at the GitHub repository: https://github.com/ledenmat/pCQO-mis-benchmark/tree/refactor.

Poster

#W-700

QPRL : Learning Optimal Policies with Quasi-Potential Functions for Asymmetric Traversal

Jumman Hossain · Nirmalya Roy

Reinforcement learning (RL) in real-world tasks such as robotic navigation often encounters environments with asymmetric traversal costs, where actions like climbing uphill versus moving downhill incur distinctly different penalties, or transitions may become irreversible. While recent quasimetric RL methods relax symmetry assumptions, they typically do not explicitly account for path-dependent costs or provide rigorous safety guarantees. We introduce Quasi-Potential Reinforcement Learning (QPRL), a novel framework that explicitly decomposes asymmetric traversal costs into a path-independent potential function ($\Phi$) and a path-dependent residual ($\Psi$). This decomposition allows efficient learning and stable policy optimization via a Lyapunov-based safety mechanism. Theoretically, we prove that QPRL achieves convergence with improved sample complexity of $\tilde{O}(\sqrt{T})$, surpassing prior quasimetric RL bounds of $\tilde{O}(T)$. Empirically, our experiments demonstrate that QPRL attains state-of-the-art performance across various navigation and control tasks, significantly reducing irreversible constraint violations by approximately $4\times$ compared to baselines.

Poster

#W-701

In-Context Reinforcement Learning From Suboptimal Historical Data

Juncheng Dong · Moyang Guo · Ethan Fang · Zhuoran Yang · Vahid Tarokh

Transformer models have achieved remarkable empirical successes, largely due to their in-context learning capabilities. Inspired by this, we explore training an autoregressive transformer for in-context reinforcement learning (ICRL). In this setting, we initially train a transformer on an offline dataset consisting of trajectories collected from various RL tasks, and then fix and use this transformer to create an action policy for new RL tasks. Notably, we consider the setting where the offline dataset contains trajectories sampled from suboptimal behavioral policies. In this case, standard autoregressive training corresponds to imitation learning and results in suboptimal performance. To address this, we propose the Decision Importance Transformer (DIT) framework, which emulates the actor-critic algorithm in an in-context manner. In particular, we first train a transformer-based value function that estimates the advantage functions of the behavior policies that collected the suboptimal trajectories. Then we train a transformer-based policy via a weighted maximum likelihood estimation loss, where the weights are constructed based on the trained value function to steer the suboptimal policies to the optimal ones. We conduct extensive experiments to test the performance of DIT on both bandit and Markov Decision Process problems. Our results show that DIT achieves superior performance, particularly when the offline dataset contains suboptimal historical data.

Poster

#W-702

Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Claas Voelcker · Anastasiia Pedan · Arash Ahmadian · Romina Abachi · Igor Gilitschenski · Amir-massoud Farahmand

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcement learning.The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature.However, theoretical investigation into its strengths and weaknesses is limited.In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss.We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function.Building on this insight, we propose corrections to solve this issue.Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents.We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

Poster

#W-703

Time-Aware World Model for Adaptive Prediction and Control

Anh Nhu · Sanghyun Son · Ming Lin

In this work, we introduce the Time-Aware World Model (TAWM), a model-based approach that explicitly incorporates temporal dynamics. By conditioning on the time-step size, $\Delta t$, and training over a diverse range of $\Delta t$ values – rather than sampling at a fixed time-step – TAWM learns both high- and low-frequency task dynamics across diverse control problems. Grounded in the information-theoretic insight that the optimal sampling rate depends on a system’s underlying dynamics, this time-aware formulation improves both performance and data efficiency. Empirical evaluations show that TAWM consistently outperforms conventional models across varying observation rates in a variety of control tasks, using the same number of training samples and iterations. Our code can be found online at: github.com/anh-nn01/Time-Aware-World-Model.

Poster

#W-704

Pessimism Principle Can Be Effective: Towards a Framework for Zero-Shot Transfer Reinforcement Learning

Chi Zhang · Ziying Jia · George Atia · Sihong He · Yue Wang

Transfer reinforcement learning aims to derive a near-optimal policy for a target environment with limited data by leveraging abundant data from related source domains. However, it faces two key challenges: the lack of performance guarantees for the transferred policy, which can lead to undesired actions, and the risk of negative transfer when multiple source domains are involved. We propose a novel framework based on the pessimism principle, which constructs and optimizes a conservative estimation of the target domain’s performance. Our framework effectively addresses the two challenges by providing an optimized lower bound on target performance, ensuring safe and reliable decisions, and by exhibiting monotonic improvement with respect to the quality of the source domains, thereby avoiding negative transfer. We construct two types of conservative estimations, rigorously characterize their effectiveness, and develop efficient distributed algorithms with convergence guarantees. Our framework provides a theoretically sound and practically robust solution for transfer learning in reinforcement learning.

Spotlight Poster

#W-705

Log-Sum-Exponential Estimator for Off-Policy Evaluation and Learning

Armin Behnamnia · Gholamali Aminian · Alireza Aghaei · Chengchun Shi · Vincent Tan · Hamid R Rabiee

Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret—the performance gap between our LSE estimator and the optimal policy—assuming bounded $(1+\epsilon)$-th moment of weighted reward. Notably, we achieve a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$ for the regret bounds, where $\epsilon\in[0,1]$ and $n$ is the size of logged bandit feedback dataset. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach. The code for our estimator is available at the following link: https://github.com/armin-behnamnia/lse-offpolicy-learning .

Poster

#W-706

Temporal Distance-aware Transition Augmentation for Offline Model-based Reinforcement Learning

Dongsu Lee · Minhae Kwon

The goal of offline reinforcement learning (RL) is to extract the best possible policy from the previously collected dataset considering the out-of-distribution (OOD) sample issue. Offline model-based RL (MBRL) is a captivating solution capable of alleviating such issues through a \textit{state-action transition augmentation} with a learned dynamic model. Unfortunately, offline MBRL methods have been observed to fail in sparse rewarded and long-horizon environments for a long time. In this work, we propose a novel MBRL method, dubbed Temporal Distance-Aware Transition Augmentation (TempDATA), that generates additional transitions in a geometrically structured representation space, instead of state space. For comprehending long-horizon behaviors efficiently, our main idea is to learn state abstraction, which captures a temporal distance from both trajectory and transition levels of state space. Our experiments empirically confirm that TempDATA outperforms previous offline MBRL methods and achieves matching or surpassing the performance of diffusion-based trajectory augmentation and goal-conditioned RL on the D4RL AntMaze, FrankaKitchen, CALVIN, and pixel-based FrankaKitchen.

Poster

#W-707

Reflect-then-Plan: Offline Model-Based Planning through a Doubly Bayesian Lens

Jihwan Jeong · Xiaoyu Wang · Jingmin Wang · Scott Sanner · Pascal Poupart

Offline reinforcement learning (RL) is crucial when online exploration is costly or unsafe but often struggles with high epistemic uncertainty due to limited data. Existing methods rely on fixed conservative policies, restricting adaptivity and generalization. To address this, we propose Reflect-then-Plan (RefPlan), a novel doubly Bayesian offline model-based (MB) planning approach. RefPlan unifies uncertainty modeling and MB planning by recasting planning as Bayesian posterior estimation. At deployment, it updates a belief over environment dynamics using real-time observations, incorporating uncertainty into MB planning via marginalization. Empirical results on standard benchmarks show that RefPlan significantly improves the performance of conservative offline RL policies. In particular, RefPlan maintains robust performance under high epistemic uncertainty and limited data, while demonstrating resilience to changing environment dynamics, improving the flexibility, generalizability, and robustness of offline-learned policies.

Poster

#W-708

Multi-Stage Manipulation with Demonstration-Augmented Reward, Policy, and World Model Learning

Adrià López Escoriza · Nicklas Hansen · Stone Tao · Tongzhou Mu · Hao Su

Long-horizon tasks in robotic manipulation present significant challenges in reinforcement learning (RL) due to the difficulty of designing dense reward functions and effectively exploring the expansive state-action space. However, despite a lack of dense rewards, these tasks often have a multi-stage structure, which can be leveraged to decompose the overall objective into manageable sub-goals. In this work, we propose DEMO³, a framework that exploits this structure for efficient learning from visual inputs. Specifically, our approach incorporates multi-stage dense reward learning, a bi-phasic training scheme, and world model learning into a carefully designed demonstration-augmented RL framework that strongly mitigates the challenge of exploration in long-horizon tasks. Our evaluations demonstrate that our method improves data-efficiency by an average of 40% and by 70% on particularly difficult taskscompared to state-of-the-art approaches. We validate this across 16 sparse-reward tasks spanning four domains, including challenging humanoid visual control tasks using as few as five demonstrations.

Poster

#W-709

Trajectory World Models for Heterogeneous Environments

Shaofeng Yin · Jialong Wu · Siqiao Huang · Xingjian Su · he · Jianye Hao · Mingsheng Long

Heterogeneity in sensors and actuators across environments poses a significant challenge to building large-scale pre-trained world models on top of this low-dimensional sensor information. In this work, we explore pre-training world models for heterogeneous environments by addressing key transfer barriers in both data diversity and model flexibility. We introduce UniTraj, a unified dataset comprising over one million trajectories from 80 environments, designed to scale data while preserving critical diversity. Additionally, we propose TrajWorld, a novel architecture capable of flexibly handling varying sensor and actuator information and capturing environment dynamics in-context. Pre-training TrajWorld on UniTraj yields substantial gains in transition prediction, achieves a new state-of-the-art for off-policy evaluation, and also delivers superior online performance of model predictive control. To the best of our knowledge, this work, for the first time, demonstrates the transfer benefits of world models across heterogeneous and complex control environments. Code and data are available at https://github.com/thuml/TrajWorld.

Poster

#W-710

Stealing That Free Lunch: Exposing the Limits of Dyna-Style Reinforcement Learning

Brett Barkley · David Fridovich-Keil

Dyna-style off-policy model-based reinforcement learning (DMBRL) algorithms are a family of techniques for generating synthetic state transition data and thereby enhancing the sample efficiency of off-policy RL algorithms. This paper identifies and investigates a surprising performance gap observed when applying DMBRL algorithms across different benchmark environments with proprioceptive observations. We show that, while DMBRL algorithms perform well in control tasks in OpenAI Gym, their performance can drop significantly in DeepMind Control Suite (DMC), even though these settings offer similar tasks and identical physics backends. Modern techniques designed to address several key issues that arise in these settings do not provide a consistent improvement across all environments, and overall our results show that adding synthetic rollouts to the training process --- the backbone of Dyna-style algorithms --- significantly degrades performance across most DMC environments. Our findings contribute to a deeper understanding of several fundamental challenges in model-based RL and show that, like many optimization fields, there is no free lunch when evaluating performance across diverse benchmarks in RL.

Poster

#W-711

The Impact of On-Policy Parallelized Data Collection on Deep Reinforcement Learning Networks

Walter Mayor · Johan Obando-Ceron · Aaron Courville · Pablo Samuel Castro

The use of parallel actors for data collection has been an effective technique used in reinforcement learning (RL) algorithms. The manner in which data is collected in these algorithms, controlled via the number of parallel environments and the rollout length, induces a form of bias-variance trade-off; the number of training passes over the collected data, on the other hand, must strike a balance between sample efficiency and overfitting. We conduct an empirical analysis of these trade-offs on PPO, one of the most popular RL algorithms that uses parallel actors, and establish connections to network plasticity and, more generally, optimization stability. We examine its impact on network architectures, as well as the hyper-parameter sensitivity when scaling data. Our analyses indicate that larger dataset sizes can increase final performance across a variety of settings, and that scaling parallel environments is more effective than increasing rollout lengths. These findings highlight the critical role of data collection strategies in improving agent performance.

Poster

#W-712

Robust Reward Alignment via Hypothesis Space Batch Cutting

Zhixian Xie · Haode Zhang · Yizhe Feng · Wanxin Jin

Reward design in reinforcement learning and optimal control is challenging. Preference-based alignment addresses this by enabling agents to learn rewards from ranked trajectory pairs provided by humans. However, existing methods often struggle from poor robustness to unknown false human preferences. In this work, we propose a robust and efficient reward alignment method based on a novel and geometrically interpretable perspective: hypothesis space batched cutting. Our method iteratively refines the reward hypothesis space through “cuts” based on batches of human preferences. Within each batch, human preferences, queried based on disagreement, are grouped using a voting function to determine the appropriate cut, ensuring a bounded human query complexity. To handle unknown erroneous preferences, we introduce a conservative cutting method within each batch, preventing erroneous human preferences from making overly aggressive cuts to the hypothesis space. This guarantees provable robustness against false preferences, while eliminating the need to explicitly identify them. We evaluate our method in a model predictive control setting across diverse tasks. The results demonstrate that our framework achieves comparable or superior performance to state-of-the-art methods in error-free settings while significantly outperforming existing methods when handling a high percentage of erroneous human preferences.

Poster

#W-713

Diversifying Robot Locomotion Behaviors with Extrinsic Behavioral Curiosity

Zhenglin Wan · Xingrui Yu · David Bossens · Yueming LYU · Qing Guo · Flint Xiaofeng Fan · Yew Soon ONG · Ivor Tsang

Imitation learning (IL) has shown promise in robot locomotion but is often limited to learning a single expert policy, constraining behavior diversity and robustness in unpredictable real-world scenarios. To address this, we introduce Quality Diversity Inverse Reinforcement Learning (QD-IRL), a novel framework that integrates quality-diversity optimization with IRL methods, enabling agents to learn diverse behaviors from limited demonstrations. This work introduces Extrinsic Behavioral Curiosity (EBC), which allows agents to receive additional curiosity rewards from an external critic based on how novel the behaviors are with respect to a large behavioral archive. To validate the effectiveness of EBC in exploring diverse locomotion behaviors, we evaluate our method on multiple robot locomotion tasks. EBC improves the performance of QD-IRL instances with GAIL, VAIL, and DiffAIL across all included environments by up to 185\%, 42\%, and 150\%, even surpassing expert performance by 20\% in Humanoid. Furthermore, we demonstrate that EBC is applicable to Gradient-Arborescence-based Quality Diversity Reinforcement Learning (QD-RL) algorithms, where it substantially improves performance and provides a generic technique for diverse robot locomotion. The source code of this work is provided at https://github.com/vanzll/EBC.

Poster

#W-714

Agent-Centric Actor-Critic for Asynchronous Multi-Agent Reinforcement Learning

Whiyoung Jung · Sunghoon Hong · Deunsol Yoon · Kanghoon Lee · Woohyung Lim

Multi-Agent Reinforcement Learning (MARL) struggles with coordination in sparse reward environments. Macro-actions —sequences of actions executed as single decisions— facilitate long-term planning but introduce asynchrony, complicating Centralized Training with Decentralized Execution (CTDE). Existing CTDE methods use padding to handle asynchrony, risking misaligned asynchronous experiences and spurious correlations. We propose the Agent-Centric Actor-Critic (ACAC) algorithm to manage asynchrony without padding. ACAC uses agent-centric encoders for independent trajectory processing, with an attention-based aggregation module integrating these histories into a centralized critic for improved temporal abstractions. The proposed structure is trained via a PPO-based algorithm with a modified Generalized Advantage Estimation for asynchronous environments. Experiments show ACAC accelerates convergence and enhances performance over baselines in complex MARL tasks.

Spotlight Poster

#W-715

Ad-Hoc Human-AI Coordination Challenge

Tin Dizdarevic · Ravi Hammond · Tobias Gessler · Anisoara Calinescu · Jonathan Cook · Matteo Gallici · Andrei Lupu · Jakob Foerster

Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action -- making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop \textit{human proxy agents} on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. The code is available at \href{https://github.com/FLAIROx/ah2ac2}{https://github.com/FLAIROx/ah2ac2}.

Poster

#W-716

Revisiting Cooperative Off-Policy Multi-Agent Reinforcement Learning

yueheng li · Guangming Xie · Zongqing Lu

Cooperative Multi-Agent Reinforcement Learning (MARL) has become a critical tool for addressing complex real-world problems. However, off-policy MARL methods, which rely on joint Q-functions, face significant scalability challenges due to the exponentially growing joint action space.In this work, we highlight a critical yet often overlooked issue: erroneous Q-target estimation, primarily caused by extrapolation error.Our analysis reveals that this error becomes increasingly severe as the number of agents grows, leading to unique challenges in MARL due to its expansive joint action space and the decentralized execution paradigm.To address these challenges, we propose a suite of techniques tailored for off-policy MARL, including annealed multi-step bootstrapping, averaged Q-targets, and restricted action representation. Experimental results demonstrate that these methods effectively mitigate erroneous estimations, yielding substantial performance improvements in challenging benchmarks such as SMAC, SMACv2, and Google Research Football.

Poster

#W-717

R3DM: Enabling Role Discovery and Diversity Through Dynamics Models in Multi-agent Reinforcement Learning

Harsh Goel · Mohammad Omama · Behdad Chalaki · Vaishnav Tadiparthi · Ehsan Moradi Pari · Sandeep Chinchali

Multi-agent reinforcement learning (MARL) has achieved significant progress in large-scale traffic control, autonomous vehicles, and robotics. Drawing inspiration from biological systems where roles naturally emerge to enable coordination, role-based MARL methods have been proposed to enhance cooperation learning for complex tasks. However, existing methods exclusively derive roles from an agent's past experience during training, neglecting their influence on its future trajectories. This paper introduces a key insight: an agent’s role should shape its future behavior to enable effective coordination. Hence, we propose Role Discovery and Diversity through Dynamics Models (R3DM), a novel role-based MARL framework that learns emergent roles by maximizing the mutual information between agents' roles, observed trajectories, and expected future behaviors. R3DM optimizes the proposed objective through contrastive learning on past trajectories to first derive intermediate roles that shape intrinsic rewards to promote diversity in future behaviors across different roles through a learned dynamics model. Benchmarking on SMAC and SMACv2 environments demonstrates that R3DM outperforms state-of-the-art MARL approaches, improving multi-agent coordination to increase win rates by up to 20%. The code is available at https://github.com/UTAustin-SwarmLab/R3DM.

Poster

#W-718

AssistanceZero: Scalably Solving Assistance Games

Cassidy Laidlaw · Eli Bronstein · Timothy Guo · Dylan Feng · Lukas Berglund · Justin Svegliato · Stuart Russell · Anca Dragan

Assistance games are a promising alternative to reinforcement learning from human feedback (RLHF) for training AI assistants. Assistance games resolve key drawbacks of RLHF, such as incentives for deceptive behavior, by explicitly modeling the interaction between assistant and user as a two-player game where the assistant cannot observe their shared goal. Despite their potential, assistance games have only been explored in simple settings. Scaling them to more complex environments is difficult because it requires both solving intractable decision-making problems under uncertainty and accurately modeling human users' behavior. We present the first scalable approach to solving assistance games and apply it to a new, challenging Minecraft-based assistance game with over $10^{400}$ possible goals. Our approach, AssistanceZero, extends AlphaZero with a neural network that predicts human actions and rewards, enabling it to plan under uncertainty. We show that AssistanceZero outperforms model-free RL algorithms and imitation learning in the Minecraft-based assistance game. In a human study, our AssistanceZero-trained assistant significantly reduces the number of actions participants take to complete building tasks in Minecraft. Our results suggest that assistance games are a tractable framework for training effective AI assistants in complex environments. Code and videos are available at https://anonymous.4open.science/w/scalably-solving-assistance-games/.

Poster

#W-719

Goal-Space Planning with Subgoal Models

Chunlok Lo · Kevin Roice · Parham Mohammad Panahi · Scott Jordan · Adam White · Gabor Mihucz · Farzane Aminmansour · Martha White

This paper investigates a new approach to model-based reinforcement learning using background planning: mixing (approximate) dynamic programming updates and model-free updates, similar to the Dyna architecture. Background planning with learned models is often worse than model-free alternatives, such as Double DQN, even though the former uses significantly more memory and computation. The fundamental problem is that learned models can be inaccurate and often generate invalid states, especially when iterated many steps. In this paper, we avoid this limitation by constraining background planning to a given set of (abstract) subgoals and learning only local, subgoal-conditioned models. This goal-space planning (GSP) approach is more computationally efficient, naturally incorporates temporal abstraction for faster long-horizon planning, and avoids learning the transition dynamics entirely. We show that our GSP algorithm can propagate value from an abstract space in a manner that helps a variety of base learners learn significantly faster in different domains.

Poster

#W-720

Faster Approximation Algorithms for k-Center via Data Reduction

Arnold Filtser · Shaofeng Jiang · Yi Li · Anurag Murty Naredla · Ioannis Psarros · Qiaoyuan Yang · Qin Zhang

We study efficient algorithms for the Euclidean $k$-Center problem, focusing on the regime of large $k$. We take the approach of data reduction by considering $\alpha$-coreset, which is a small subset $S$ of the dataset $P$ such that any $\beta$-approximation on $S$ is an $(\alpha + \beta)$-approximation on $P$. We give efficient algorithms to construct coresets whose size is $k \cdot o(n)$, which immediately speeds up existing approximation algorithms. Notably, we obtain a near-linear time $O(1)$-approximation when $k = n^c$ for any $0 < c < 1$. We validate the performance of our coresets on real-world datasets with large $k$, and we observe that the coreset speeds up the well-known Gonzalez algorithm by up to $4$ times, while still achieving similar clustering cost. Technically, one of our coreset results is based on a new efficient construction of consistent hashing with competitive parameters. This general tool may be of independent interest for algorithm design in high dimensional Euclidean spaces.

Poster

#W-721

Private Lossless Multiple Release

Joel Daniel Andersson · Lukas Retschmeier · Boel Nelson · Rasmus Pagh

Koufogiannis et al. (2016) showed a $\textit{gradual release}$ result for Laplace noise-based differentially private mechanisms: given an $\varepsilon$-DP release, a new release with privacy parameter $\varepsilon' > \varepsilon$ can be computed such that the combined privacy loss of both releases is at most $\varepsilon'$ and the distribution of the latter is the same as a single release with parameter $\varepsilon'$.They also showed gradual release techniques for Gaussian noise, later also explored by Whitehouse et al. (2022).In this paper, we consider a more general $\textit{multiple release}$ setting in which analysts hold private releases with different privacy parameters corresponding to different access/trust levels.These releases are determined one by one, with privacy parameters in arbitrary order.A multiple release is $\textit{lossless}$ if having access to a subset $S$ of the releases has the same privacy guarantee as the least private release in $S$, and each release has the same distribution as a single release with the same privacy parameter.Our main result is that lossless multiple release is possible for a large class of additive noise mechanisms.For the Gaussian mechanism we give a simple method for lossless multiple release with a short, self-contained analysis that does not require knowledge of the mathematics of Brownian motion.We also present lossless multiple release for the Laplace and Poisson mechanisms.Finally, we consider how to efficiently do gradual release of sparse histograms, and present a mechanism with running time independent of the number of dimensions.

Poster

#W-800

Batch List-Decodable Linear Regression via Higher Moments

Ilias Diakonikolas · Daniel Kane · Sushrut Karmalkar · Sihan Liu · Thanasis Pittas

We study the task of list-decodable linear regression using batches, recently introduced by Das et al. 2023.. In this setting, we are given $m$ batches with each batch containing $n$ points in $\mathbb R^d$. A batch is called clean if the points it contains are i.i.d. samples from an unknown linear regression distribution. For a parameter $\alpha \in (0, 1/2)$, an unknown $\alpha$-fraction of the batches are clean and no assumptions are made on the remaining batches. The goal is to output a small list of vectors at least one of which is close to the true regressor vector in $\ell_2$-norm. Das et al. 2023 gave an efficient algorithm for this task, under natural distributional assumptions, with the following guarantee. Under the assumption that the batch size satisfies $n \geq \tilde{\Omega}(\alpha^{-1})$ and the total number of batches is $m = \text{poly}(d, n, 1/\alpha)$, their algorithm runs in polynomial time and outputs a list of $O(1/\alpha^2)$ vectors at least one of which is $\tilde{O}(\alpha^{-1/2}/\sqrt{n})$ close to the target regressor. Here we design a new polynomial-time algorithm for this task with significantly stronger guarantees under the assumption that the low-degree moments of the covariates distribution are Sum-of-Squares (SoS) certifiably bounded.Specifically, for any constant $\delta>0$, as long as the batch size is $n \geq \Omega_{\delta}(\alpha^{-\delta})$and the degree-$\Theta(1/\delta)$ moments of the covariates are SoS certifiably bounded, our algorithm uses $m = \text{poly}((dn)^{1/\delta}, 1/\alpha)$ batches,runs in polynomial-time, and outputs an $O(1/\alpha)$-sized list of vectors one of which is $O(\alpha^{-\delta/2}/\sqrt{n})$ close to the target. That is, our algorithm substantially improves both the minimum batch size and the final error guarantee, while achieving the optimal list size. Our approach leverages higher-order moment information by carefully combining the SoS paradigm interleaved with an iterative method and a novel list pruning procedure for this setting. In the process, we give an SoS proof of the Marcinkiewicz-Zygmund inequality that may be of broader applicability.

Poster

#W-801

Model Uncertainty Quantification by Conformal Prediction in Continual Learning

Rui Gao · Weiwei Liu

Continual learning has attracted increasing research attention in recent years due to its promising experimental results in real-world applications. In this paper, we study the issue of calibration in continual learning which reliably quantifies the uncertainty of model predictions. Conformal prediction (CP) provides a general framework for model calibration, which outputs prediction intervals or sets with a theoretical high coverage guarantee as long as the samples are exchangeable. However, the tasks in continual learning are learned in sequence, which violates the principle that data should be exchangeable. Meanwhile, the model learns the current task with limited or no access to data from previous tasks, which is not conducive to constructing the calibration set. To address these issues, we propose a CP-based method for model uncertainty quantification in continual learning (CPCL), which also reveals the connection between prediction interval length and forgetting. We analyze the oracle prediction interval in continual learning and theoretically prove the asymptotic coverage guarantee of CPCL. Finally, extensive experiments on simulated and real data empirically verify the validity of our proposed method.

Poster

#W-802

Understanding the Forgetting of (Replay-based) Continual Learning via Feature Learning: Angle Matters

Hongyi Wan · Shiyuan Ren · Wei Huang · Miao Zhang · Xiang Deng · Yixin Bao · Liqiang Nie

Continual learning (CL) is crucial for advancing human-level intelligence, but its theoretical understanding, especially regarding factors influencing forgetting, is still relatively limited. This work aims to build a unified theoretical framework for understanding CL using feature learning theory. Different from most existing studies that analyze forgetting under linear regression model or lazy training, we focus on a more practical two-layer convolutional neural network (CNN) with polynomial ReLU activation for sequential tasks within a signal-noise data model. Specifically, we theoretically reveal how the angle between task signal vectors influences forgetting that: acute or small obtuse angles lead to benign forgetting, whereas larger obtuse angles result in harmful forgetting. Furthermore, we demonstrate that the replay method alleviates forgetting by expanding the range of angles corresponding to benign forgetting. Our theoretical results suggest that mid-angle sampling, which selects examples with moderate angles to the prototype, can enhance the replay method's ability to mitigate forgetting. Experiments on synthetic and real-world datasets confirm our theoretical results and highlight the effectiveness of our mid-angle sampling strategy.

Poster

#W-803

Uniform Mean Estimation for Heavy-Tailed Distributions via Median-of-Means

Mikael Møller Høgsgaard · Andrea Paudice

The Median of Means (MoM) is a mean estimator that has gained popularity in the context of heavy-tailed data. In this work, we analyze its performance in the task of simultaneously estimating the mean of each function in a class $\mathcal{F}$ when the data distribution possesses only the first $p$ moments for $p \in (1,2]$. We prove a new sample complexity bound using a novel symmetrization technique that may be of independent interest. Additionally, we present applications of our result to $k$-means clustering with unbounded inputs and linear regression with general losses, improving upon existing works.

Poster

#W-804

On the Statistical Mechanisms of Distributional Compositional Generalization

Jingwen Fu · Nanning Zheng

Distributional Compositional Generalization (DCG) refers to the ability to tackle tasks from new distributions by leveraging the knowledge of concepts learned from supporting distributions. In this work, we aim to explore the statistical mechanisms of DCG, which have been largely overlooked in previous studies. By statistically formulating the problem, this paper seeks to address two key research questions: 1) Can a method to one DCG problem be applicable to another? 2) What statistical properties can indicate a learning algorithm's capacity for knowledge composition in DCG tasks? \textbf{To address the first question}, an invariant measure is proposed to provide a dimension where all different methods converge. This measure underscores the critical role of data in enabling improvements without trade-offs. \textbf{As for the second question}, we reveal that by decoupling the impacts of insufficient data and knowledge composition, the ability of the learning algorithm to compose knowledge relies on the compatibility and sensitivity between the learning algorithm and the composition rule. In summary, the statistical analysis of the generalization mechanisms provided in this paper deepens our understanding of compositional generalization, offering a complementary evidence on the importance of data in DCG task.

Poster

#W-805

Securing Equal Share: A Principled Approach for Learning Multiplayer Symmetric Games

Jiawei Ge · Yuanhao Wang · Wenzhe Li · Chi Jin

This paper examines multiplayer symmetric constant-sum games with more than two players in a competitive setting, such as Mahjong, Poker, and various board and video games. In contrast to two-player zero-sum games, equilibria in multiplayer games are neither unique nor non-exploitable, failing to provide meaningful guarantees when competing against opponents who play different equilibria or non-equilibrium strategies. This gives rise to a series of long-lasting fundamental questions in multiplayer games regarding suitable objectives, solution concepts, and principled algorithms. This paper takes an initial step towards addressing these challenges by focusing on the natural objective of *equal share*—securing an expected payoff of $C/n$ in an $n$-player symmetric game with a total payoff of $C$. We rigorously identify the theoretical conditions under which achieving an equal share is tractable and design a series of efficient algorithms, inspired by no-regret learning, that *provably* attain approximate equal share across various settings. Furthermore, we provide complementary lower bounds that justify the sharpness of our theoretical results. Our experimental results highlight worst-case scenarios where meta-algorithms from prior state-of-the-art systems for multiplayer games fail to secure an equal share, while our algorithm succeeds, demonstrating the effectiveness of our approach.

Poster

#W-806

Self-Play $Q$-Learners Can Provably Collude in the Iterated Prisoner's Dilemma

Quentin Bertrand · Juan Duque · Emilio Calvano · Gauthier Gidel

A growing body of computational studies shows that simple machine learning agents converge to cooperative behaviors in social dilemmas, such as collusive price-setting in oligopoly markets, raising questions about what drives this outcome. In this work, we provide theoretical foundations for this phenomenon in the context of self-play multi-agent Q-learners in the iterated prisoner’s dilemma. We characterize broad conditions under which such agents provably learn the cooperative Pavlov (win-stay, lose-shift) policy rather than the Pareto-dominated “always defect” policy. We validate our theoretical results through additional experiments, demonstrating their robustness across a broader class of deep learning algorithms.

Poster

#W-807

Fraud-Proof Revenue Division on Subscription Platforms

Abheek Ghosh · Tzeh Yuan Neoh · Nicholas Teh · Giannis Tyrovolas

We study a model of subscription-based platforms where users pay a fixed fee for unlimited access to content, and creators receive a share of the revenue. Existing approaches to detecting fraud predominantly rely on machine learning methods, engaging in an ongoing arms race with bad actors. We explore revenue division mechanisms that inherently disincentivize manipulation. We formalize three types of manipulation-resistance axioms and examine which existing rules satisfy these. We show that a mechanism widely used by streaming platforms, not only fails to prevent fraud, but also makes detecting manipulation computationally intractable. We also introduce a novel rule, ScaledUserProp, that satisfies all three manipulation-resistance axioms. Finally, experiments with both real-world and synthetic streaming data support ScaledUserProp as a fairer alternative compared to existing rules.

Poster

#W-808

The impact of uncertainty on regularized learning in games

Pierre-Louis Cauvin · Davide Legacci · Panayotis Mertikopoulos

In this paper, we investigate how randomness and uncertainty influence learning in games. Specifically, we examine a perturbed variant of the dynamics of “follow-the-regularized-leader” (FTRL), where the players’ payoff observations and strategy updates are continually impacted by random shocks. Our findings reveal that, in a fairly precise sense, “uncertainty favors extremes”: in any game, regardless of the noise level, every player’s trajectory of play reaches an arbitrarily small neighborhood of a pure strategy in finite time (which we estimate). Moreover, even if the player does not ultimately settle at this strategy, they return arbitrarily close to some(possibly different) pure strategy infinitely often. This prompts the question of which sets of pure strategies emerge as robust predictions of learning under uncertainty. We show that (a) the only possible limits of the FTRL dynamics under uncertainty are pure Nash equilibria; and (b) a span of pure strategies is stable and attracting if and only if it is closed under better replies. Finally, we turn to games where the deterministic dynamics are recurrent—such as zero-sum games with interior equilibria—and show that randomness disrupts this behavior, causing the stochastic dynamics to drift toward the boundary on average.

Poster

#W-809

Solving Zero-Sum Convex Markov Games

Fivos Kalogiannis · Emmanouil-Vasileios Vlatakis-Gkaragkounis · Ian Gemp · Georgios Piliouras

We contribute the first provable guarantees of global convergence to Nash equilibria (NE) in two-player zero-sum convex Markov games (cMGs) by using independent policy gradient methods. Convex Markov games, recently defined by Gemp et al.(2024), extend Markov decision processes to multi-agent settings with preferences that are convex over occupancy measures, offering a broad framework for modeling generic strategic interactions. However, even the fundamental min-max case of cMGs presents significant challenges, including inherent nonconvexity, the absence of Bellman consistency, and the complexity of the infinite horizon.Our results follow a two-step approach. First, leveraging properties of hidden-convex–hidden-concave functions, we show that a simple nonconvex regularization transforms the min-max optimization problem into a nonconvex–proximal Polyak-Łojasiewicz (NC-pPL) objective. Crucially, this regularization can stabilize the iterates of independent policy gradient methods and ultimately lead them to converge to equilibria. Second, building on this reduction, we address the general constrained min-max problems under NC-pPL and two-sided pPL conditions, providing the first global convergence guarantees for stochastic nested and alternating gradient descent-ascent methods, which we believe may be of independent interest.

Poster

#W-810

Preference-CFR: Beyond Nash Equilibrium for Better Game Strategies

Qi Ju · Thomas Tellier · Meng Sun · Zhemei Fang · YunFeng Luo

Artificial intelligence (AI) has surpassed top human players in a variety of games. In imperfect information games, these achievements have primarily been driven by Counterfactual Regret Minimization (CFR) and its variants for computing Nash equilibrium. However, most existing research has focused on maximizing payoff, while largely neglecting the importance of strategic diversity and the need for varied play styles, thereby limiting AI’s adaptability to different user preferences.To address this gap, we propose Preference-CFR (Pref-CFR), a novel method that incorporates two key parameters: preference degree and vulnerability degree. These parameters enable the AI to adjust its strategic distribution within an acceptable performance loss threshold, thereby enhancing its adaptability to a wider range of strategic demands. In our experiments with Texas Hold’em, Pref-CFR successfully trained Aggressive and Loose Passive styles that not only match original CFR-based strategies in performance but also display clearly distinct behavioral patterns. Notably, for certain hand scenarios, Pref-CFR produces strategies that diverge significantly from both conventional expert heuristics and original CFR outputs, potentially offering novel insights for professional players.

Poster

#W-811

Settling the Maximin Share Fairness for Scheduling among Groups of Machines

Bo Li · Fangxiao WANG · Xing Shiji

We study the fair scheduling of jobs among groups of (unrelated) machines and focus on the maximin share (MMS) fairness at the group level. The problem was first introduced by Li et al. [NeurIPS 2023], where each group consists of a number of identical machines (or identical up to different speeds), and the cost of a group is determined by the minimum makespan on completing all jobs assigned to it. It is left as an open problem when the machines within each group are unrelated. In this paper, we first resolve this problem and design a polynomial-time algorithm that computes a 2-approximate MMS allocation via linear programming techniques. We complement this result with a hard instance, showing that no algorithm can be better than $(2-\frac{1}{n})$-approximate MMS, where $n$ is the number of machines. Thus the approximation ratio 2 is asymptotically tight. When the groups consist of identical machines, we improve the approximation ratio to $\frac{4}{3}$.

Poster

#W-812

Contract Design Under Approximate Best Responses

Francesco Bacchiocchi · Jiarui Gan · Matteo Castiglioni · Alberto Marchesi · Nicola Gatti

Principal-agent problems model scenarios where a principal aims at incentivizing an agent to take costly, unobservable actions through the provision of payments. Such interactions are ubiquitous in several real-world applications, ranging from blockchain to the delegation of machine learning tasks. In this paper, we initiate the study of hidden-action principal-agent problems under approximate best responses, in which the agent may select any action that is not too much suboptimal given the principal's payment scheme (a.k.a. contract). Our main result is a polynomial-time algorithm to compute an optimal contract under approximate best responses. This is perhaps surprising, as computing an optimal commitment under approximate best responses is known to be computationally intractable in Stackelberg games. We also investigate the learnability of contracts under approximate best responses, by providing a no-regret learning algorithm for a natural application scenario where the principal does not know anything about the environment.

Poster

#W-813

COSDA: Counterfactual-based Susceptibility Risk Framework for Open-Set Domain Adaptation

Wenxu Wang · Rui Zhou · Jing Wang · Yun Zhou · Cheng Zhu · Ruichun Tang · Bo Han · Nevin Zhang

Open-Set Domain Adaptation (OSDA) aims to transfer knowledge from the labeled source domain to the unlabeled target domain that contains unknown categories, thus facing the challenges of domain shift and unknown category recognition. While recent works have demonstrated the potential of causality for domain alignment, little exploration has been conducted on causal-inspired theoretical frameworks for OSDA. To fill this gap, we introduce the concept of Susceptibility and propose a novel Counterfactual-based susceptibility risk framework for OSDA, termed COSDA. Specifically, COSDA consists of three novel components: (i) a Susceptibility Risk Estimator (SRE) for capturing causal information, along with comprehensive derivations of the computable theoretical upper bound, forming a risk minimization framework under the OSDA paradigm; (ii) a Contrastive Feature Alignment (CFA) module, which is theoretically proven based on mutual information to satisfy the Exogeneity assumption and facilitate cross-domain feature alignment; (iii) a Virtual Multi-unknown-categories Prototype (VMP) pseudo-labeling strategy, providing label information by measuring how similar samples are to known and multiple virtual unknown category prototypes, thereby assisting in open-set recognition and intra-class discriminative feature learning. Extensive experiments demonstrate that our approach achieves state-of-the-art performance.

Poster

#W-814

The Role of Sparsity for Length Generalization in LLMs

Noah Golowich · Samy Jelassi · David Brandfonbrener · Sham Kakade · Eran Malach

Training large language models to predict beyond their training context lengths has drawn much attention in recent years, yet the principles driving such behavior of length generalization remain underexplored. We propose a new theoretical framework to study length generalization for the next-token prediction task, as performed by decoder-only transformers. Conceptually, we show that length generalization occurs as long as each predicted token depends on a small (fixed) number of previous tokens. We formalize such tasks via a notion we call k-sparse planted correlation distributions, and show that an idealized model of transformers which generalize attention heads successfully length-generalize on such tasks. As a bonus, our theoretical model allows us to provide justifications for techniques to modify positional embeddings which have been introduced to improve length generalization, such as position coupling.We support our theoretical results with experiments on synthetic tasks and natural language, which confirm that a key factor driving length generalization is indeed a ``sparse'' dependency structure of each token on the previous ones. Further, inspired by our theory, we introduce Predictive Position Coupling, a generalization of position coupling which trains the transformer to predict the position IDs used in a positional coupling approach. Predictive Position Coupling thereby allows us to broaden the array of tasks to which Position Coupling can successfully be applied to achieve length generalization.

Poster

#W-815

Optimal Transfer Learning for Missing Not-at-Random Matrix Completion

Akhil Jalan · Yassir Jedra · Arya Mazumdar · Soumendu Sundar Mukherjee · Purnamrita Sarkar

We study transfer learning for matrix completion in a Missing Not-at-Random (MNAR) setting that is motivated by biological problems. The target matrix $Q$ has entire rows and columns missing, making estimation impossible without side information. To address this, we usea noisy and incomplete source matrix $P$, which relates to $Q$ via a feature shift in latent space. We consider both the *active* and *passive* sampling of rows and columns. We establish minimax lower bounds for entrywise estimation error in each setting. Our computationally efficient estimation framework achieves this lower bound for the active setting, which leverages the source data to query the most informative rows and columns of $Q$. This avoids the need for *incoherence* assumptions required for rate optimality in the passive sampling setting. We demonstrate the effectiveness of our approach through comparisons with existing algorithms on real-world biological datasets.

Poster

#W-816

When Can Proxies Improve the Sample Complexity of Preference Learning?

Yuchen Zhu · Daniel Augusto de Souza · Zhengyan Shi · Mengyue Yang · Pasquale Minervini · Matt Kusner · Alexander D'Amour

We address the problem of reward hacking, where maximising a proxy reward does not necessarily increase the true reward. This is a key concern for Large Language Models (LLMs), as they are often fine-tuned on human preferences that may not accurately reflect a true objective. Existing work uses various tricks such as regularisation, tweaks to the reward model, and reward hacking detectors, to limit the influence that such proxy preferences have on a model. Luckily, in many contexts such as medicine, education, and law, a sparse amount of expert data is often available. In these cases, it is often unclear whether the addition of proxy data can improve policy learning. We outline a set of sufficient conditions on proxy feedback that, if satisfied, indicate that proxy data can provably improve the sample complexity of learning the ground truth policy. These conditions can inform the data collection process for specific tasks. The result implies a parameterisation for LLMs that achieves this improved sample complexity. We detail how one can adapt existing architectures to yield this improved sample complexity.

Poster

#W-817

Representations Shape Weak-to-Strong Generalization: Theoretical Insights and Empirical Predictions

Yihao Xue · Jiping Li · Baharan Mirzasoleiman

Weak-to-Strong Generalization (W2SG), where a weak model supervises a stronger one, serves as an important analogy for understanding how humans might guide superhuman intelligence in the future. Promising empirical results revealed that a strong model can surpass its weak supervisor. While recent work has offered theoretical insights into this phenomenon, a clear understanding of the interactions between weak and strong models that drive W2SG remains elusive. We investigate W2SG through a theoretical lens and show that it can be characterized using kernels derived from the principal components of weak and strong models' internal representations. These kernels can be used to define a space that, at a high level, captures what the weak model is unable to learn but is learnable by the strong model. The projection of labels onto this space quantifies how much the strong model falls short of its full potential due to weak supervision. This characterization also provides insights into how certain errors in weak supervision can be corrected by the strong model, regardless of overfitting. Our theory has significant practical implications, providing a representation-based metric that predicts W2SG performance trends without requiring labels, as shown in experiments on molecular predictions with transformers and 5 NLP tasks involving 52 LLMs.

Poster

#W-818

Time to Spike? Understanding the Representational Power of Spiking Neural Networks in Discrete Time

Duc Anh Nguyen · Ernesto Araya · Adalbert Fono · Gitta Kutyniok

Recent years have seen significant progress in developing spiking neural networks (SNNs) as a potential solution to the energy challenges posed by conventional artificial neural networks (ANNs). However, our theoretical understanding of SNNs remains relatively limited compared to the ever-growing body of literature on ANNs. In this paper, we study a discrete-time model of SNNs based on leaky integrate-and-fire (LIF) neurons, referred to as discrete-time LIF-SNNs, a widely used framework that still lacks solid theoretical foundations. We demonstrate that discrete-time LIF-SNNs realize piecewise constant functions defined on polyhedral regions, and more importantly, we quantify the network size required to approximate continuous functions. Moreover, we investigate the impact of latency (number of time steps) and depth (number of layers) on the complexity of the input space partitioning induced by discrete-time LIF-SNNs. Our analysis highlights the importance of latency and contrasts these networks with ANNs that use piecewise linear activation functions. Finally, we present numerical experiments to support our theoretical findings.

Poster

#W-819

Beyond Bradley-Terry Models: A General Preference Model for Language Model Alignment

Yifan Zhang · Ge Zhang · Yue Wu · Kangping Xu · Quanquan Gu

Modeling human preferences is crucial for aligning foundation models with human values. Traditional reward modeling methods, such as the Bradley-Terry (BT) reward model, fall short in expressiveness, particularly in addressing intransitive preferences. In this paper, we introduce \emph{preference embedding}, an approach that embeds responses into a latent space to capture intricate preference structures efficiently, achieving linear query complexity. Additionally, we propose preference score-based General Preference Optimization (GPO), which generalizes reward-based reinforcement learning from human feedback (RLHF). Experimental results show that our General Preference embedding Model (GPM) consistently outperforms the BT reward model on the RewardBench benchmark and effectively models cyclic preferences where any BT reward model behaves like a random guess. Furthermore, evaluations on downstream tasks such as AlpacaEval2.0, following the language model post-training with GPO and our general preference model, reveal performance improvements over BT models. These findings indicate that our method may enhance the alignment of foundation models with nuanced human values. The code is available at https://github.com/general-preference/general-preference-model.

Poster

#W-820

Provable In-Context Vector Arithmetic via Retrieving Task Concepts

Dake Bu · Wei Huang · Andi Han · Atsushi Nitanda · Qingfu Zhang · Hau-San Wong · Taiji Suzuki

In-context learning (ICL) has garnered significant attention for its ability to grasp functions/tasks from demonstrations. Recent studies suggest the presence of a latent task/function vector in LLMs during ICL. Merullo et al. (2024) showed that LLMs leverage this vector alongside the residual stream for Word2Vec-like vector arithmetic, solving factual-recall ICL tasks. Additionally, recent work empirically highlighted the key role of Question-Answer data in enhancing factual-recall capabilities. Despite these insights, a theoretical explanation remains elusive. To move one step forward, we propose a theoretical framework building on empirically grounded hierarchical concept modeling. We develop an optimization theory, showing how nonlinear residual transformers trained via gradient descent on cross-entropy loss perform factual-recall ICL tasks via vector arithmetic. We prove 0-1 loss convergence and show the strong generalization, including robustness to concept recombination and distribution shifts. These results elucidate the advantages of transformers over static embedding predecessors. Empirical simulations corroborate our theoretical insights.

Spotlight Poster

#W-821

Algorithm Development in Neural Networks: Insights from the Streaming Parity Task

Loek van Rossem · Andrew Saxe

Even when massively overparameterized, deep neural networks show a remarkable ability to generalize. Research on this phenomenon has focused on generalization within distribution, via smooth interpolation. Yet in some settings neural networks also learn to extrapolate to data far beyond the bounds of the original training set, sometimes even allowing for infinite generalization, implying that an algorithm capable of solving the task has been learned. Here we undertake a case study of the learning dynamics of recurrent neural networks trained on the streaming parity task in order to develop an effective theory of algorithm development. The streaming parity task is a simple but nonlinear task defined on sequences up to arbitrary length. We show that, with sufficient finite training experience, RNNs exhibit a phase transition to perfect infinite generalization. Using an effective theory for the representational dynamics, we find an implicit representational merger effect which can be interpreted as the construction of a finite automaton that reproduces the task. Overall, our results disclose one mechanism by which neural networks can generalize infinitely from finite training experience.

Spotlight Poster

#W-900

The Role of Randomness in Stability

Max Hopkins · Shay Moran

Stability is a central property in learning and statistics promising the output of an algorithm $\mathcal{A}$ does not change substantially when applied to similar datasets $S$ and $S'$. It is an elementary fact that any sufficiently stable algorithm (e.g.\ one returning the same result with high probability, satisfying privacy guarantees, etc.) must be randomized. This raises a natural question: can we quantify \textit{how much} randomness is needed for algorithmic stability? We study the randomness complexity of two influential notions of stability in learning: \textit{replicability} (which promises $\mathcal{A}$ usually outputs the same result when run over samples from the same distribution), and \textit{differential privacy} (which promises the output distribution of $\mathcal{A}$ remains similar under neighboring datasets). In particular, building on the ideas of (Dixon, Pavan, Vander Woude, and Vinodchandran ICML 2024) and (Cannone, Su, and Vadhan ITCS 2024), we prove a "weak-to-strong" boosting theorem for stability in these settings: the randomness complexity of a task $\mathcal{M}$ is tightly controlled by the best replication probability of any \textit{deterministic} algorithm solving $\mathcal{M}$, a parameter known as $\mathcal{M}$'s "global stability" (Chase, Moran, Yehudayoff FOCS 2023). Finally, we use this connection to characterize the randomness complexity of PAC Learning: a class has bounded randomness complexity iff it has finite Littlestone dimension, and moreover scales at worst logarithmically in the excess error of the learner. As a corollary, we resolve a question of (Chase, Chornomaz, Moran, and Yehudayoff STOC 2024) about the error-dependent list-replicability of agnostic learning.

Poster

#W-901

Approximation to Smooth Functions by Low-Rank Swish Networks

Zimeng Li · Hongjun LI · Jingyuan Wang · Ke Tang

While deep learning has witnessed remarkable achievements in a wide range of applications, its substantial computational cost imposes limitations on application scenarios of neural networks. To alleviate this problem, low-rank compression is proposed as a class of efficient and hardware-friendly network compression methods, which reduce computation by replacing large matrices in neural networks with products of two small ones. In this paper, we implement low-rank networks by inserting a sufficiently narrow linear layer without bias between each of two adjacent nonlinear layers. We prove that low-rank Swish networks with a fixed depth are capable of approximating any function from the Hölder ball $\mathcal{C}^{\beta, R}([0,1]^d)$ within an arbitrarily small error where $\beta$ is the smooth parameter and $R$ is the radius. Our proposed constructive approximation ensures that the width of linear hidden layers required for approximation is no more than one-third of the width of nonlinear layers, which implies that the computational cost can be decreased by at least one-third compared with a network with the same depth and width of nonlinear layers but without narrow linear hidden layers. Our theoretical finding can offer a theoretical basis for low-rank compression from the perspective of universal approximation theory.

Poster

#W-902

On the Learnability of Distribution Classes with Adaptive Adversaries

Tosca Lechner · Alex Bie · Gautam Kamath

We consider the question of learnability of distribution classes in the presence of adaptive adversaries -- that is, adversaries capable of intercepting the samples requested by a learner and applying manipulations with full knowledge of the samples before passing it on to the learner. This stands in contrast to oblivious adversaries, who can only modify the underlying distribution the samples come from but not their i.i.d.\ nature. We formulate a general notion of learnability with respect to adaptive adversaries, taking into account the budget of the adversary. We show that learnability with respect to additive adaptive adversaries is a strictly stronger condition than learnability with respect to additive oblivious adversaries.

Poster

#W-903

On the Training Convergence of Transformers for In-Context Classification of Gaussian Mixtures

Wei Shen · Ruida Zhou · Jing Yang · Cong Shen

Although transformers have demonstrated impressive capabilities for in-context learning (ICL) in practice, theoretical understanding of the underlying mechanism that allows transformers to perform ICL is still in its infancy. This work aims to theoretically study the training dynamics of transformers for in-context classification tasks. We demonstrate that, for in-context classification of Gaussian mixtures under certain assumptions, a single-layer transformer trained via gradient descent converges to a globally optimal model at a linear rate. We further quantify the impact of the training and testing prompt lengths on the ICL inference error of the trained transformer. We show that when the lengths of training and testing prompts are sufficiently large, the prediction of the trained transformer approaches the ground truth distribution of the labels. Experimental results corroborate the theoretical findings.

Spotlight Poster

#W-904

On Learning Parallel Pancakes with Mostly Uniform Weights

Ilias Diakonikolas · Daniel Kane · Sushrut Karmalkar · Jasper Lee · Thanasis Pittas

We study the complexity of learning $k$-mixtures of Gaussians ($k$-GMMs) on $\mathbb R^d$. This task is known to have complexity $d^{\Omega(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponentially small and that the components have the same unknown covariance. Recent work gave a $d^{O(\log(1/w_{\min}))}$-time algorithm for this class of GMMs, where $w_{\min}$ is the minimum weight. Our first main result is a Statistical Query (SQ) lower bound showing that this quasi-polynomial upper bound is essentially best possible, even for the special case of uniform weights. Specifically, we show that it is SQ-hard to distinguish between such a mixture and the standard Gaussian. We further explore how the distribution of weights affects the complexity of this task. Our second main result is a quasi-polynomial upper bound for the aforementioned testing task when most of the weights are uniform while a small fraction of the weights are potentially arbitrary.

Poster

#W-905

How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias

Ruiquan Huang · Yingbin LIANG · Jing Yang

Language recognition tasks are fundamental in natural language processing (NLP) and have been widely used to benchmark the performance of large language models (LLMs). These tasks also play a crucial role in explaining the working mechanisms of transformers. In this work, we focus on two representative tasks in the category of regular language recognition, known as 'even pairs' and 'parity check', the aim of which is to determine whether the occurrences of certain subsequences in a given sequence are even. Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks by theoretically analyzing its training dynamics under gradient descent. While even pairs can be solved directly by a one-layer transformer, parity check need to be solved by integrating Chain-of-Thought (CoT), either into the inference stage of a transformer well-trained for the even pairs task, or into the training of a one-layer transformer. For both problems, our analysis shows that the joint training of attention and linear layers exhibits two distinct phases. In the first phase, the attention layer grows rapidly, mapping data sequences into separable vectors. In the second phase, the attention layer becomes stable, while the linear layer grows logarithmically and approaches in direction to a max-margin hyperplane that correctly separates the attention layer outputs into positive and negative samples, and the loss decreases at a rate of $O(1/t)$. Our experiments validate those theoretical results.

Poster

#W-906

Representation Preserving Multiclass Agnostic to Realizable Reduction

Steve Hanneke · Qinglin Meng · Amirreza Shaeiri

We study the problem of multiclass classification when the number of labels can be unbounded within the PAC learning framework. Our main contribution is a theory that demonstrates a simple and elegant agnostic to realizable reduction for this framework. This resolves an open problem raised by the recent work of (Hopkins et al., 2022). Notably, our result is the first representation preserving multiclass agnostic to realizable reduction, in contrast with the compression based approach of the work of (David et al., 2017). Furthermore, our main theorem is stated in an abstract framework, called ``Unified PAC Learning'', which encompasses a range of frameworks, including multiclass PAC learning, list PAC learning, and multilabel PAC learning. In addition, we explore representation preserving reductions to the realizable setting for two noise models, namely Massart noise and Tsybakov noise, in the multiclass PAC learning framework. We believe our technique may find other applications in ensuing studies of theoretical machine learning.

Poster

#W-907

Constrained Pareto Set Identification with Bandit Feedback

Cyrille Kone · Emilie Kaufmann · Laura Richert

In this paper, we address the problem of identifying the Pareto Set under feasibility constraints in a multivariate bandit setting. Specifically, given a $K$-armed bandit with unknown means $\mu_1, \dots, \mu_K \in \mathbb{R}^d$, the goal is to identify the set of arms whose mean is not uniformly worse than that of another arm (i.e., not smaller for all objectives), while satisfying some known set of linear constraints, expressing, for example, some minimal performance on each objective. Our focus lies in fixed-confidence identification, for which we introduce an algorithm that significantly outperforms racing-like algorithms and the intuitive two-stage approach that first identifies feasible arms and then their Pareto Set. We further prove an information-theoretic lower bound on the sample complexity of any algorithm for constrained Pareto Set identification, showing that the sample complexity of our approach is near-optimal. Our theoretical results are supported by an extensive empirical evaluation on a series of benchmarks.

Poster

#W-908

Multi-Armed Bandits with Interference: Bridging Causal Inference and Adversarial Bandits

Su Jia · Peter Frazier · Nathan Kallus

Experimentation with interference poses a significant challenge in contemporary online platforms. Prior research on experimentation with interference has concentrated on the final output of a policy. Cumulative performance, while equally important, is less well understood. To address this gap, we introduce the problem of Multi-armed Bandits with Interference (MABI), where the learner assigns an arm to each of $N$ experimental units over $T$ rounds. The reward of each unit in each round depends on the treatments of all units, where the interference between two units decays in their distance. The reward functions are chosen by an adversary and may vary arbitrarily over time and across different units. We first show that the optimal expected regret (against the best fixed-arm policy) is $\tilde O(\sqrt T)$, and can be achieved by a switchback policy. However, the regret (as a random variable) for any switchback policy suffers a high variance, since it does not account for $N$. We propose a policy based on a novel clustered randomization scheme, whose regret (i) is optimal in expectation and (ii) admits a high probability bound that vanishes in $N$.

Poster

#W-909

A Trichotomy for List Transductive Online Learning

Steve Hanneke · Amirreza Shaeiri

List learning is an important topic in both theoretical and empirical machine learning research, playing a key role in the recent breakthrough result of (Brukhim et al., 2022) on the characterization of multiclass PAC learnability, as well as addressing label ambiguity in computer vision classification tasks, among others. In this paper, we study the problem of list transductive online learning. In this framework, the learner outputs a list of multiple labels for each instance rather than just one, as in traditional multiclass classification. In the realizable setting, we demonstrate a trichotomy of possible rates of the minimax number of mistakes. In particular, if the learner plays for $\text{T} \in \mathbb{N}$ rounds, its minimax number of mistakes can only be of the orders $\Theta(\text{T})$, $\Theta(\log \text{T})$, or $\Theta(1)$. This resolves an open question raised by (Hanneke et al., 2024). On the other hand, in the agnostic setting, we characterize the learnability by constructively proving the $\widetilde{\mathcal{O}}(\sqrt{\text{T}})$ upper bound on the minimax expected regret. Along the way, we also answer another open question asked by (Moran et al., 2023). To establish these results, we introduce two new combinatorial complexity dimensions, called the Level-constrained $(\mathrm{L+1})$-Littlestone dimension and Level-constrained $(\mathrm{L+1})$-Branching dimension, if the list size is $\mathrm{L} \in \mathbb{N}$. Eventually, we conclude our work by raising an open question regarding eliminating the factor list size, which seems to be a crucial step, as it has consistently appeared in previous works on this subject.

Poster

#W-910

Offline Learning for Combinatorial Multi-armed Bandits

Xutong Liu · Xiangxiang Dai · Jinhang Zuo · Siwei Wang · Carlee Joe-Wong · John C. S. Lui · Wei Chen

The combinatorial multi-armed bandit (CMAB) is a fundamental sequential decision-making framework, extensively studied over the past decade. However, existing work primarily focuses on the online setting, overlooking the substantial costs of online interactions and the readily available offline datasets.To overcome these limitations, we introduce Off-CMAB, the first offline learning framework for CMAB. Central to our framework is the combinatorial lower confidence bound (CLCB) algorithm, which combines pessimistic reward estimations with combinatorial solvers. To characterize the quality of offline datasets, we propose two novel data coverage conditions and prove that, under these conditions, CLCB achieves a near-optimal suboptimality gap, matching the theoretical lower bound up to a logarithmic factor.We validate Off-CMAB through practical applications, including learning to rank, large language model (LLM) caching, and social influence maximization, showing its ability to handle nonlinear reward functions, general feedback models, and out-of-distribution action samples that excludes optimal or even feasible actions. Extensive experiments on synthetic and real-world datasets further highlight the superior performance of CLCB.

Poster

#W-911

Nearly Optimal Algorithms for Contextual Dueling Bandits from Adversarial Feedback

Qiwei Di · Jiafan He · Quanquan Gu

Learning from human feedback plays an important role in aligning generative models, such as large language models (LLM). However, the effectiveness of this approach can be influenced by adversaries, who may intentionally provide misleading preferences to manipulate the output in an undesirable or harmful direction.To tackle this challenge, we study a specific model within this problem domain--contextual dueling bandits with adversarial feedback, where the true preference label can be flipped by an adversary. We propose an algorithm namely robust contextual dueling bandits ($\texttt{RCDB}$), which is based on uncertainty-weighted maximum likelihood estimation. Our algorithm achieves an $\tilde O(d\sqrt{T}/\kappa+dC/\kappa)$ regret bound, where $T$ is the number of rounds, $d$ is the dimension of the context, $\kappa$ is the lower bound of the derivative of the link function, and $ 0 \le C \le T$ is the total number of adversarial feedback. We also prove a lower bound to show that our regret bound is nearly optimal, both in scenarios with and without ($C=0$) adversarial feedback. Our work is the first to achieve nearly minimax optimal regret for dueling bandits in the presence of adversarial preference feedback. Additionally, for the sigmoid link function, we develop a novel algorithm that takes into account the effect of local derivatives into maximum likelihood estimation (MLE) analysis through a refined method for estimating the link function's derivative. This method helps us to eliminate the $\kappa$ dependence in the leading term with respect to $T$, which reduces the exponential dependence on the parameter radius $B$ to a polynomial dependence. We conduct experiments to evaluate our proposed algorithm $\texttt{RCDB}$ against various types of adversarial feedback. Experimental results demonstrate its superiority over the state-of-the-art dueling bandit algorithms in the presence of adversarial feedback.

Poster

#W-912

Tracking Most Significant Shifts in Infinite-Armed Bandits

Joe Suk · Jung-hun Kim

We study an infinite-armed bandit problem where actions' mean rewards are initially sampled from a reservoir distribution. Most prior works in this setting focused on stationary rewards (Berry et al., 1997; Wang et al., 2008; Bonald and Proutiere, 2013; Carpentier and Valko, 2015) with the more challenging adversarial/non-stationary variant only recently studied in the context of rotting/decreasing rewards (Kim et al., 2022; 2024). Furthermore, optimal regret upper bounds were only achieved using parameter knowledge of non-stationarity and only known for certain regimes of regularity of the reservoir. This work shows the first parameter-free optimal regret bounds while also relaxing these distributional assumptions. We also study a natural notion of significant shift for this problem inspired by recent developments in finite-armed MAB (Suk & Kpotufe, 2022). We show that tighter regret bounds in terms of significant shifts can be adaptively attained. Our enhanced rates only depend on the rotting non-stationarity and thus exhibit an interesting phenomenon for this problem where rising non-stationarity does not factor into the difficulty of non-stationarity.

Poster

#W-913

Online Learning in the Random-Order Model

Martino Bernasconi · Andrea Celli · Riccardo Colini Baldeschi · Federico Fusco · Stefano Leonardi · Matteo Russo

In the random-order model for online learning, the sequence of losses is chosen upfront by an adversary and presented to the learner after a random permutation. Any random-order input is asymptotically equivalent to a stochastic i.i.d.~one, but, for finite times, it may exhibit significant non-stationarity, which can hinder the performance of stochastic learning algorithms.While algorithms for adversarial inputs naturally maintain their regret guarantees in random order, simple no-regret algorithms exist for the stochastic model that fail against random-order instances. In this paper, we propose a general procedure to adapt stochastic learning algorithms to the random-order model without substantially affecting their regret guarantees. This allows us to recover improved regret bounds for prediction with delays, bandits with switching costs, and online learning with constraints. Finally, we investigate online classification and prove that, in random order, learnability is characterized by the VC dimension rather than by the Littlestone dimension, thus providing a further separation from the general adversarial model.

Poster

#W-914

Optimal and Practical Batched Linear Bandit Algorithm

Sanghoon Yu · Min-hwan Oh

We study the linear bandit problem under limited adaptivity, known as the batched linear bandit. While existing approaches can achieve near-optimal regret in theory, they are often computationally prohibitive or underperform in practice. We propose BLAE, a novel batched algorithm that integrates arm elimination with regularized G-optimal design, achieving the minimax optimal regret (up to logarithmic factors in $T$) in both large-$K$ and small-$K$ regimes for the first time, while using only $O(\log\log T)$ batches. Our analysis introduces new techniques for batch-wise optimal design and refined concentration bounds. Crucially, BLAE demonstrates low computational overhead and strong empirical performance, outperforming state-of-the-art methods in extensive numerical evaluations. Thus, BLAE is the first algorithm to combine provable minimax-optimality in all regimes and practical superiority in batched linear bandits.

Poster

#W-915

Contextual Online Decision Making with Infinite-Dimensional Functional Regression

Haichen Hu · Rui Ai · Stephen Bates · David Simchi-Levi

Contextual sequential decision-making is fundamental to machine learning, with applications in bandits, sequential hypothesis testing, and online risk control. These tasks often rely on statistical measures like expectation, variance, and quantiles. In this paper, we propose a universal algorithmic framework that learns the full underlying distribution, enabling a unified approach to all contextual online decision-making problems. The challenge lies in the uncountably infinite-dimensional regression, where existing contextual bandit algorithms all yield infinite regret. We innovatively propose an efficient infinite-dimensional functional regression oracle for contextual cumulative distribution functions (CDFs) and model every datum as a combination of context-dependent CDF basis functions. Our analysis reveals that the decay rate of the eigenvalue sequence of the design integral operator governs the regression error rate, and consequently, the utility regret rate. Specifically, when the eigenvalue sequence exhibits a polynomial decay of order $\frac{1}{\gamma}\ge 1$, the utility regret is bounded by $\tilde{O}( T^{\frac{3\gamma+2}{2(\gamma+2)}})$. The case that $\gamma=0$ can recover the existing optimal rate in contextual bandits literature with finite-dimensional regression and so as exponential decay. We also provide a numerical method to compute the eigenvalue sequence of integral operators, enabling the practical implementation of our framework.

Poster

#W-916

Optimal Algorithm for Max-Min Fair Bandit

Zilong Wang · Zhiyao Zhang · Shuai Li

Multi-player multi-armed bandit (MP-MAB) has been widely studied owing to its diverse applications across numerous domains. We consider an MP-MAB problem where $N$ players compete for $K$ arms in $T$ rounds. The reward distributions are heterogeneous where each player has a different expected reward for the same arm. When multiple players select the same arm, they collide and obtain zero rewards. In this paper, our target is to find the max-min fairness matching that maximizes the reward of the player who receives the lowest reward. This paper improves the existing max-min regret upper bound of $O(\exp(1/\Delta) + K^3 \log T\log \log T)$. More specifically, our decentralized fair elimination algorithm (DFE) deals with heterogeneity and collision carefully and attains a regret upper bound of $O((N^2+K)\log T / \Delta)$, where $\Delta$ is the minimum reward gap between max-min value and sub-optimal arms.In addition, this paper also provides an $\Omega(\max\\{ N^2, K \\} \log T / \Delta)$ regret lower bound for this problem, which indicates that our algorithm is optimal with respect to key parameters $T, N, K$, and $\Delta$. Additional numerical experiments also show the efficiency and improvement of our algorithms.

Poster

#W-917

The Batch Complexity of Bandit Pure Exploration

Adrienne Tuynman · Rémy Degenne

In a fixed-confidence pure exploration problem in stochastic multi-armed bandits, an algorithm iteratively samples arms and should stop as early as possible and return the correct answer to a query about the arms distributions.We are interested in batched methods, which change their sampling behaviour only a few times, between batches of observations.We give an instance-dependent lower bound on the number of batches used by any sample efficient algorithm for any pure exploration task.We then give a general batched algorithm and prove upper bounds on its expected sample complexity and batch complexity.We illustrate both lower and upper bounds on best-arm identification and thresholding bandits.

Poster

#W-918

Adaptive Sample Sharing for Multi Agent Linear Bandits

Hamza Cherkaoui · Merwan Barlier · Igor Colin

The multi-agent linear bandit setting is a well-known setting for which designing efficient collaboration between agents remains challenging. This paper studies the impact of data sharing among agents on regret minimization. Unlike most existing approaches, our contribution does not rely on any assumptions on the bandit parameters structure. Our main result formalizes the trade-off between the bias and uncertainty of the bandit parameter estimation for efficient collaboration. This result is the cornerstone of the Bandit Adaptive Sample Sharing (BASS) algorithm, whose efficiency over the current state-of-the-art is validated through both theoretical analysis and empirical evaluations on both synthetic and real-world datasets. Furthermore, we demonstrate that, when agents' parameters display a cluster structure, our algorithm accurately recovers them.

Poster

#W-919

Policy Optimization for CMDPs with Bandit Feedback: Learning Stochastic and Adversarial Constraints

Francesco Emanuele Stradi · Anna Lunghi · Matteo Castiglioni · Alberto Marchesi · Nicola Gatti

We study online learning in constrained Markov decision processes (CMDPs) in which rewards and constraints may be either stochastic or adversarial. In such settings, stradi et al. (2024) proposed the first best-of-both-worlds algorithm able to seamlessly handle stochastic and adversarial constraints, achieving optimal regret and constraint violation bounds in both cases. This algorithm suffers from two major drawbacks. First, it only works under full feedback, which severely limits its applicability in practice. Moreover, it relies on optimizing over the space of occupancy measures, which requires solving convex optimization problems, an highly inefficient task. In this paper, we provide the first best-of-both-worlds algorithm for CMDPs with bandit feedback. Specifically, when the constraints are stochastic, the algorithm achieves $\widetilde{\mathcal{O}}(\sqrt{T})$ regret and constraint violation, while, when they are adversarial, it attains $\widetilde{\mathcal{O}}(\sqrt{T})$ constraint violation and a tight fraction of the optimal reward. Moreover, our algorithm is based on a policy optimization approach, which is much more efficient than occupancy-measure-based methods.

Poster

#W-920

Contextual Linear Bandits with Delay as Payoff

Mengxiao Zhang · Yingfei Wang · Haipeng Luo

A recent work by Schlisselberg et al. (2025) studies a delay-as-payoff model for stochastic multi-armed bandits, where the payoff (either loss or reward) is delayed for a period that is proportional to the payoff. While this captures many real-world applications, the simple multi-armed bandit setting limits the practicality of their results. In this paper, we address this limitation by studying the delay-as-payoff model for contextual linear bandits. Specifically, we start from the case with a fixed action set and propose an efficient algorithm whose regret overhead compared to the standard no-delay case is only of order $D\Delta_{\max}\log T$, where $T$ is the total horizon, $D$ is the maximum delay, and $\Delta_{\max}$ is the maximum suboptimality gap. When payoff is loss, we also show further improvement of the bound, demonstrating a separation between reward and loss similar to Schlisselberg et al. (2025). Contrary to standard linear bandit algorithms that construct least squares estimator and confidence ellipsoid, the main novelty of our algorithm is to apply a phased arm elimination procedure by only picking the **volumetric spanners** of the action set, which addresses challenges arising from both payoff-dependent delays and large action sets. We further extend our results to the case with varying action sets by adopting the reduction from Hanna et al. (2023). Finally, we implement our algorithm and showcase its effectiveness and superior performance in experiments.

Poster

#W-921

Leveraging Model Guidance to Extract Training Data from Personalized Diffusion Models

Xiaoyu Wu · Jiaru Zhang · Steven Wu

Diffusion Models (DMs) have evolved into advanced image generation tools, especially for few-shot fine-tuning where a pretrained DM is fine-tuned on a small set of images to capture specific styles or objects. Many people upload these personalized checkpoints online, fostering communities such as Civitai and HuggingFace. However, model owners may overlook the potential risks of data leakage by releasing their fine-tuned checkpoints. Moreover, concerns regarding copyright violations arise when unauthorized data is used during fine-tuning. In this paper, we ask: "Can training data be extracted from these fine-tuned DMs shared online?" A successful extraction would present not only data leakage threats but also offer tangible evidence of copyright infringement. To answer this, we propose FineXtract, a framework for extracting fine-tuning data. Our method approximates fine-tuning as a gradual shift in the model's learned distribution---from the original pretrained DM toward the fine-tuning data. By extrapolating the models before and after fine-tuning, we guide the generation toward high-probability regions within the fine-tuned data distribution. We then apply a clustering algorithm to extract the most probable images from those generated using this extrapolated guidance. Experiments on DMs fine-tuned with datasets such as WikiArt, DreamBooth, and real-world checkpoints posted online validate the effectiveness of our method, extracting approximately 20% of fine-tuning data in most cases, significantly surpassing baseline performance. The code is available.

Poster Session

Poster Session 1 West

West Exhibition Hall B2-B3