Poster Session
Poster Session 3 West
West Exhibition Hall B2-B3
MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-text Decoding
Weikang Qiu · Zheng Huang · Haoyu Hu · Aosong Feng · Yujun Yan · ZHITAO YING
Decoding functional magnetic resonance imaging (fMRI) signals into text has been a key challenge in the neuroscience community, with the potential to advance brain-computer interfaces and uncover deeper insights into brain mechanisms. However, existing approaches often struggle with suboptimal predictive performance, limited task variety, and poor generalization across subjects. In response to this, we propose MindLLM, a model designed for subject-agnostic and versatile fMRI-to-text decoding. MindLLM consists of an fMRI encoder and an off-the-shelf LLM. The fMRI encoder employs a neuroscience-informed attention mechanism, which is capable of accommodating subjects with varying input shapes and thus achieves high-performance subject-agnostic decoding. Moreover, we introduce Brain Instruction Tuning (BIT), a novel approach that enhances the model's ability to capture diverse semantic representations from fMRI signals, facilitating more versatile decoding. We evaluate MindLLM on comprehensive fMRI-to-text benchmarks. Results demonstrate that our model outperforms the baselines, improving downstream tasks by $12.0\%$, unseen subject generalization by $24.5\%$, and novel task adaptation by $25.0\%$. Furthermore, the attention patterns in MindLLM provide interpretable insights into its decision-making process.
Tackling View-Dependent Semantics in 3D Language Gaussian Splatting
Jiazhong Cen · Xudong Zhou · Jiemin Fang · Changsong Wen · Lingxi Xie · xiaopeng zhang · Wei Shen · Qi Tian
Recent advancements in 3D Gaussian Splatting (3D-GS) enable high-quality 3D scene reconstruction from RGB images. Many studies extend this paradigm for language-driven open-vocabulary scene understanding. However, most of them simply project 2D semantic features onto 3D Gaussians and overlook a fundamental gap between 2D and 3D understanding: a 3D object may exhibit various semantics from different viewpoints—a phenomenon we term view-dependent semantics. To address this challenge, we propose LaGa (Language Gaussians), which establishes cross-view semantic connections by decomposing the 3D scene into objects. Then, it constructs view-aggregated semantic representations by clustering semantic descriptors and reweighting them based on multi-view semantics. Extensive experiments demonstrate that LaGa effectively captures key information from view-dependent semantics, enabling a more comprehensive understanding of 3D scenes. Notably, under the same settings, LaGa achieves a significant improvement of +18.7\% mIoU over the previous SOTA on the LERF-OVS dataset. Our code is available at: https://github.com/https://github.com/SJTU-DeepVisionLab/LaGa.
Polynomial Time Learning Augmented Algorithms for NP-hard Permutation Problems
Evripidis Bampis · Bruno Escoffier · Dimitris Fotakis · Panagiotis Patsilinakos · Michalis Xefteris
We consider a learning augmented framework for NP-hard permutation problems. The algorithm has access to predictions telling, given a pair $u,v$ of elements, whether $u$ is before $v$ or not in an optimal solution. Building on the work of Braverman and Mossel (SODA 2008), we show that for a class of optimization problems including scheduling, network design and other graph permutation problems, these predictions allow to solve them in polynomial time with high probability, provided that predictions are true with probability at least $1/2+\epsilon$. Moreover, this can be achieved with a parsimonious access to the predictions.
QuEst: Enhancing Estimates of Quantile-Based Distributional Measures Using Model Predictions
Xinyu Yang · Tom Zollo · Benjamin Eyre · Amogh Inamdar · David Madras · Richard Zemel
As machine learning models grow increasingly competent, their predictions can supplement scarce or expensive data in various important domains. In support of this paradigm, algorithms have emerged to combine a small amount of high-fidelity observed data with a much larger set of imputed model outputs to estimate some quantity of interest. Yet current hybrid-inference tools target only means or single quantiles, limiting their applicability for many critical domains and use cases. We present QuEst, a principled framework to merge observed and imputed data to deliver point estimates and rigorous confidence intervals for a wide family of quantile-based distributional measures. QuEst covers a range of measures, from tail risk (CVaR) to population segments such as quartiles, that are central to fields such as economics, sociology, education, medicine, and more. We extend QuEst to multidimensional metrics, and introduce an additional optimization technique to further reduce variance in this and other hybrid estimators. We demonstrate the utility of our framework through experiments in economic modeling, opinion polling, and language model auto-evaluation.
Fragments to Facts: Partial-Information Fragment Inference from LLMs
Lucas Rosenblatt · Bin Han · Robert Wolfe · Bill Howe
Large language models (LLMs) can leak sensitive training data through memorization and membership inference attacks. Prior work has primarily focused on strong adversarial assumptions, including attacker access to entire samples or long, ordered prefixes, leaving open the question of how vulnerable LLMs are when adversaries have only partial, unordered sample information. For example, if an attacker knows a patient has "hypertension," under what conditions can they query a model fine-tuned on patient data to learn the patient also has "osteoarthritis?" In this paper, we introduce a more general threat model under this weaker assumption and show that fine-tuned LLMs are susceptible to these fragment-specific extraction attacks. To systematically investigate these attacks, we propose two data-blind methods: (1) a likelihood ratio attack inspired by methods from membership inference, and (2) a novel approach, PRISM, which regularizes the ratio by leveraging an external prior. Using examples from medical and legal settings, we show that both methods are competitive with a data-aware baseline classifier that assumes access to labeled in-distribution data, underscoring their robustness.
Combinatorial Reinforcement Learning with Preference Feedback
Joongkyu Lee · Min-hwan Oh
In this paper, we consider combinatorial reinforcement learning with preference feedback, where a learning agent sequentially offers an action—an assortment of multiple items—to a user, whose preference feedback follows a multinomial logistic (MNL) model. This framework allows us to model real-world scenarios, particularly those involving long-term user engagement, such as in recommender systems and online advertising. However, this framework faces two main challenges: (1) the unknown value of each item, unlike traditional MNL bandits that only address single-step preference feedback, and (2) the difficulty of ensuring optimism while maintaining tractable assortment selection in the combinatorial action space with unknown values. In this paper, we assume a contextual MNL preference model, where the mean utilities are linear, and the value of each item is approximated by a general function. We propose an algorithm, MNL-VQL, that addresses these challenges, making it both computationally and statistically efficient. As a special case, for linear MDPs (with the MNL preference feedback), we establish the first regret lower bound in this framework and show that MNL-VQL achieves near-optimal regret. To the best of our knowledge, this is the first work to provide statistical guarantees in combinatorial RL with preference feedback.
Partially Observable Reinforcement Learning with Memory Traces
Onno Eberhard · Michael Muehlebach · Claire Vernade
Partially observable environments present a considerable computational challenge in reinforcement learning due to the need to consider long histories. Learning with a finite window of observations quickly becomes intractable as the window length grows. In this work, we introduce memory traces. Inspired by eligibility traces, these are compact representations of the history of observations in the form of exponential moving averages. We prove sample complexity bounds for the problem of offline on-policy evaluation that quantify the return errors achieved with memory traces for the class of Lipschitz continuous value estimates. We establish a close connection to the window approach, and demonstrate that, in certain environments, learning with memory traces is significantly more sample efficient. Finally, we underline the effectiveness of memory traces empirically in online reinforcement learning experiments for both value prediction and control.
Multi-objective Linear Reinforcement Learning with Lexicographic Rewards
Bo Xue · Dake Bu · Ji Cheng · Yuanyu Wan · Qingfu Zhang
Reinforcement Learning (RL) with linear transition kernels and reward functions has recently attracted growing attention due to its computational efficiency and theoretical advancements. However, prior theoretical research in RL has primarily focused on single-objective problems, resulting in limited theoretical development for multi-objective reinforcement learning (MORL). To bridge this gap, we examine MORL under lexicographic reward structures, where rewards comprise $m$ hierarchically ordered objectives. In this framework, the agent the agent maximizes objectives sequentially, prioritizing the highest-priority objective before considering subsequent ones. We introduce the first MORL algorithm with provable regret guarantees. For any objective $i \in \\{1, 2, \ldots, m\\}$, our algorithm achieves a regret bound of $\widetilde{O}(\Lambda^i(\lambda) \cdot \sqrt{d^2H^4 K})$, where $\Lambda^i(\lambda) = 1 + \lambda + \cdots + \lambda^{i-1}$, $\lambda$ quantifies the trade-off between conflicting objectives, $d$ is the feature dimension, $H$ is the episode length, and $K$ is the number of episodes. Furthermore, our algorithm can be applied in the misspecified setting, where the regret bound for the $i$-th objective becomes $\widetilde{O}(\Lambda^i(\lambda)\cdot(\sqrt{d^2H^4K}+\epsilon dH^2K))$, with $\epsilon$ denoting the degree of misspecification.
Near-optimal Regret Using Policy Optimization in Online MDPs with Aggregate Bandit Feedback
Tal Lancewicki · Yishay Mansour
We study online finite-horizon Markov Decision Processes with adversarially changing loss and aggregate bandit feedback (a.k.a full-bandit). Under this type of feedback, the agent observes only the total loss incurred over the entire trajectory, rather than the individual losses at each intermediate step within the trajectory. We introduce the first Policy Optimization algorithms for this setting. In the known-dynamics case, we achieve the first *optimal* regret bound of $\tilde \Theta(H^2\sqrt{SAK})$, where $K$ is the number of episodes, $H$ is the episode horizon, $S$ is the number of states, and $A$ is the number of actions. In the unknown dynamics case we establish regret bound of $\tilde O(H^3 S \sqrt{AK})$, significantly improving the best known result by a factor of $H^2 S^5 A^2$.
A Theoretical Justification for Asymmetric Actor-Critic Algorithms
Gaspard Lambrechts · Damien Ernst · Aditya Mahajan
In reinforcement learning for partially observable environments, many successful algorithms have been developed within the asymmetric learning paradigm. This paradigm leverages additional state information available at training time for faster learning. Although the proposed learning objectives are usually theoretically sound, these methods still lack a precise theoretical justification for their potential benefits. We propose such a justification for asymmetric actor-critic algorithms with linear function approximators by adapting a finite-time convergence analysis to this setting. The resulting finite-time bound reveals that the asymmetric critic eliminates error terms arising from aliasing in the agent state.
We establish a continuous-time framework for analyzing Deep Q-Networks (DQNs) via stochastic control and Forward-Backward Stochastic Differential Equations (FBSDEs). Considering a continuous-time Markov Decision Process (MDP) driven by a square-integrable martingale, we analyze DQN approximation properties. We show that DQNs can approximate the optimal Q-function on compact sets with arbitrary accuracy and high probability, leveraging residual network approximation theorems and large deviation bounds for the state-action process. We then analyze the convergence of a general Q-learning algorithm for training DQNs in this setting, adapting stochastic approximation theorems. Our analysis emphasizes the interplay between DQN layer count, time discretization, and the role of viscosity solutions (primarily for the value function $V^*$) in addressing potential non-smoothness of the optimal Q-function. This work bridges deep reinforcement learning and stochastic control, offering insights into DQNs in continuous-time settings, relevant for applications with physical systems or high-frequency data.
AdaWorld: Learning Adaptable World Models with Latent Actions
Shenyuan Gao · Siyuan Zhou · Yilun Du · Jun Zhang · Chuang Gan
World models aim to learn action-controlled future prediction and have proven essential for the development of intelligent agents. However, most existing world models rely heavily on substantial action-labeled data and costly training, making it challenging to adapt to novel environments with heterogeneous actions through limited interactions. This limitation can hinder their applicability across broader domains. To overcome this limitation, we propose AdaWorld, an innovative world model learning approach that enables efficient adaptation. The key idea is to incorporate action information during the pretraining of world models. This is achieved by extracting latent actions from videos in a self-supervised manner, capturing the most critical transitions between frames. We then develop an autoregressive world model that conditions on these latent actions. This learning paradigm enables highly adaptable world models, facilitating efficient transfer and learning of new actions even with limited interactions and finetuning. Our comprehensive experiments across multiple environments demonstrate that AdaWorld achieves superior performance in both simulation quality and visual planning.
Demystifying the Paradox of Importance Sampling with an Estimated History-Dependent Behavior Policy in Off-Policy Evaluation
Hongyi Zhou · Josiah Hanna · Jin Zhu · Ying Yang · Chengchun Shi
This paper studies off-policy evaluation (OPE) in reinforcement learning with a focus on behavior policy estimation for importance sampling. Prior work has shown empirically that estimating a history-dependent behavior policy can lead to lower mean squared error (MSE) even when the true behavior policy is Markovian. However, the question of why the use of history should lower MSE remains open. In this paper, we theoretically demystify this paradox by deriving a bias-variance decomposition of the MSE of ordinary importance sampling (IS) estimators, demonstrating that history-dependent behavior policy estimation decreases their asymptotic variances while increasing their finite-sample biases. Additionally, as the estimated behavior policy conditions on a longer history, we show a consistent decrease in variance. We extend these findings to a range of other OPE estimators, including the sequential IS estimator, the doubly robust estimator and the marginalized IS estimator, with the behavior policy estimated either parametrically or non-parametrically.
Multiple-policy Evaluation via Density Estimation
Yilei Chen · Aldo Pacchiano · Ioannis Paschalidis
We study the multiple-policy evaluation problem where we are given a set of $K$ policies and the goal is to evaluate their performance (expected total reward over a fixed horizon) to an accuracy $\epsilon$ with probability at least $1-\delta$. We propose an algorithm named CAESAR for this problem. Our approach is based on computing an approximately optimal sampling distribution and using the data sampled from it to perform the simultaneous estimation of the policy values. CAESAR has two phases. In the first phase, we produce coarse estimates of the visitation distributions of the target policies at a low order sample complexity rate that scales with $\tilde{O}(\frac{1}{\epsilon})$. In the second phase, we approximate the optimal sampling distribution and compute the importance weighting ratios for all target policies by minimizing a step-wise quadratic loss function inspired by the DualDICE objective. Up to low order and logarithmic terms CAESAR achieves a sample complexity $\tilde{O}\left(\frac{H^4}{\epsilon^2}\sum_{h=1}^H\max_{k\in[K]}\sum_{s,a}\frac{(d_h^{\pi^k}(s,a))^2}{\mu^*_h(s,a)}\right)$, where $d^{\pi}$ is the visitation distribution of policy $\pi$, $\mu^*$ is the optimal sampling distribution, and $H$ is the horizon.
Provable Zero-Shot Generalization in Offline Reinforcement Learning
Zhiyong Wang · Chen Yang · John C. S. Lui · Dongruo Zhou
In this work, we study offline reinforcement learning (RL) with zero-shot generalization property (ZSG), where the agent has access to an offline dataset including experiences from different environments, and the goal of the agent is to train a policy over the training environments which performs well on test environments without further interaction. Existing work showed that classical offline RL fails to generalize to new, unseen environments. We propose pessimistic empirical risk minimization (PERM) and pessimistic proximal policy optimization (PPPO), which leverage pessimistic policy evaluation to guide policy learning and enhance generalization. We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG. Our result serves as a first step in understanding the foundation of the generalization phenomenon in offline reinforcement learning.
Learning with Expected Signatures: Theory and Applications
Lorenzo Lucchese · Mikko S. Pakkanen · Almut E. D. Veraart
The expected signature maps a collection of data streams to a lower dimensional representation, with a remarkable property: the resulting feature tensor can fully characterize the data generating distribution. This "model-free"' embedding has been successfully leveraged to build multiple domain-agnostic machine learning (ML) algorithms for time series and sequential data. The convergence results proved in this paper bridge the gap between the expected signature's empirical discrete-time estimator and its theoretical continuous-time value, allowing for a more complete probabilistic interpretation of expected signature-based ML methods. Moreover, when the data generating process is a martingale, we suggest a simple modification of the expected signature estimator with significantly lower mean squared error and empirically demonstrate how it can be effectively applied to improve predictive performance.
Efficient Diffusion Models for Symmetric Manifolds
Oren Mangoubi · Neil He · Nisheeth K. Vishnoi
We introduce a framework for designing efficient diffusion models for $d$-dimensional symmetric-space Riemannian manifolds, including the torus, sphere, special orthogonal group and unitary group. Existing manifold diffusion models often depend on heat kernels, which lack closed-form expressions and require either $d$ gradient evaluations or exponential-in-$d$ arithmetic operations per training step. We introduce a new diffusion model for symmetric manifolds with a spatially-varying covariance, allowing us to leverage a projection of Euclidean Brownian motion to bypass heat kernel computations. Our training algorithm minimizes a novel efficient objective derived via Ito's Lemma, allowing each step to run in $O(1)$ gradient evaluations and nearly-linear-in-$d$ ($O(d^{1.19})$) *arithmetic* operations, reducing the gap between diffusions on symmetric manifolds and Euclidean space. Manifold symmetries ensure the diffusion satisfies an "average-case" Lipschitz condition, enabling accurate and efficient sample generation. Empirically, our model outperforms prior methods in training speed and improves sample quality on synthetic datasets on the torus, special orthogonal group, and unitary group.
Diffusion Models are Secretly Exchangeable: Parallelizing DDPMs via Auto Speculation
Hengyuan Hu · Aniket Das · Dorsa Sadigh · Nima Anari
Denoising Diffusion Probabilistic Models (DDPMs) have emerged as powerful tools for generative modeling. However, their sequential computation requirements lead to significant inference-time bottlenecks. In this work, we utilize the connection between DDPMs and Stochastic Localization to prove that, under an appropriate reparametrization, the increments of DDPM satisfy an exchangeability property. This general insight enables near-black-box adaptation of various performance optimization techniques from autoregressive models to the diffusion setting. To demonstrate this, we introduce _Autospeculative Decoding_ (ASD), an extension of the widely used speculative decoding algorithm to DDPMs that does not require any auxiliary draft models. Our theoretical analysis shows that ASD achieves a $\tilde{O}(K^{\frac{1}{3}})$ parallel runtime speedup over the $K$ step sequential DDPM. We also demonstrate that a practical implementation of autospeculative decoding accelerates DDPM inference significantly in various domains.
The Limits of Tractable Marginalization
Oliver Broadrick · Sanyam Agarwal · Guy Van den Broeck · Markus Bläser
Marginalization -- summing a function over all assignments to a subset of its inputs -- is a fundamental computational problem with applications from probabilistic inference to formal verification.Despite its computational hardness in general, there exist many classes of functions (e.g., probabilistic models) for which marginalization remains tractable, and they can all be commonly expressed by arithmetic circuits computing multilinear polynomials.This raises the question, can *all* functions with polynomial time marginalization algorithms be succinctly expressed by such circuits? We give a negative answer, exhibiting simple functions with tractable marginalization yet no efficient representation by known models, assuming $\\mathsf{FP} \\neq \\#\\mathsf{P}$ (an assumption implied by $\\mathsf{P} \\neq \\mathsf{NP}$). To this end, we identify a hierarchy of complexity classes corresponding to stronger forms of marginalization, all of which are efficiently computable on the known circuit models. We conclude with a completeness result, showing that whenever there is an efficient real RAM performing virtual evidence marginalization for a function, then there are small arithmetic circuits for that function's multilinear representation.
Exact Upper and Lower Bounds for the Output Distribution of Neural Networks with Random Inputs
Andrey Kofnov · Daniel Kapla · Ezio Bartocci · Efstathia Bura
We derive exact upper and lower bounds for the cumulative distribution function (cdf) of the output of a neural network (NN) over its entire support subject to noisy (stochastic) inputs. The upper and lower bounds converge to the true cdf over its domain as the resolution increases. Our method applies to any feedforward NN using continuous monotonic piecewise twice continuously differentiable activation functions (e.g., ReLU, tanh and softmax) and convolutional NNs, which were beyond the scope of competing approaches. The novelty and instrumental tool of our approach is to bound general NNs with ReLU NNs. The ReLU NN-based bounds are then used to derive the upper and lower bounds of the cdf of the NN output. Experiments demonstrate that our method delivers guaranteed bounds of the predictive output distribution over its support, thus providing exact error guarantees, in contrast to competing approaches.
Distributed Conformal Prediction via Message Passing
Haifeng Wen · Hong XING · Osvaldo Simeone
Post-hoc calibration of pre-trained models is critical for ensuring reliable inference, especially in safety-critical domains such as healthcare. Conformal Prediction (CP) offers a robust post-hoc calibration framework, providing distribution-free statistical coverage guarantees for prediction sets by leveraging held-out datasets. In this work, we address a decentralized setting where each device has limited calibration data and can communicate only with its neighbors over an arbitrary graph topology. We propose two message-passing-based approaches for achieving reliable inference via CP: quantile-based distributed conformal prediction (Q-DCP) and histogram-based distributed conformal prediction (H-DCP). Q-DCP employs distributed quantile regression enhanced with tailored smoothing and regularization terms to accelerate convergence, while H-DCP uses a consensus-based histogram estimation approach. Through extensive experiments, we investigate the trade-offs between hyperparameter tuning requirements, communication overhead, coverage guarantees, and prediction set sizes across different network topologies. The code of our work is released on: https://github.com/HaifengWen/Distributed-Conformal-Prediction.
Blink of an eye: a simple theory for feature localization in generative models
Marvin Li · Aayush Karan · Sitan Chen
Large language models can exhibit unexpected behavior in the blink of an eye. In a recent computer use demo, a language model switched from coding to Googling pictures of Yellowstone, and these sudden shifts in behavior have also been observed in reasoning patterns and jailbreaks. This phenomenon is not unique to autoregressive models: in diffusion models, key features of the final output are decided in narrow ``critical windows'' of the generation process. In this work we develop a simple, unifying theory to explain this phenomenon. Using the formalism of stochastic localization for generative models, we show that it emerges generically as the generation process localizes to a sub-population of the distribution it models. While critical windows have been studied at length in diffusion models, existing theory heavily relies on strong distributional assumptions and the particulars of Gaussian diffusion. In contrast to existing work our theory (1) applies to autoregressive and diffusion models; (2) makes very few distributional assumptions; (3) quantitatively improves previous bounds even when specialized to diffusions; and (4) requires basic mathematical tools. Finally, we validate our predictions empirically for LLMs and find that critical windows often coincide with failures in problem solving for various math and reasoning benchmarks.
Federated Oriented Learning: A Practical One-Shot Personalized Federated Learning Framework
Guan Huang · Tao Shu
Personalized Federated Learning (PFL) has become a promising learning paradigm, enabling the training of high-quality personalized models through multiple communication rounds between clients and a central server. However, directly applying traditional PFL in real-world environments where communication is expensive, limited, or infeasible is challenging, as seen in Low Earth Orbit (LEO) satellite constellations, which face severe communication constraints due to their high mobility, limited contact windows. To address these issues, we introduce Federated Oriented Learning (FOL), a novel four-stage one-shot PFL algorithm designed to enhance local model performance by leveraging neighboring models within stringent communication constraints. FOL comprises model pretraining, model collection, model alignment (via fine-tuning, pruning, post fine-tuning, and ensemble refinement), and knowledge distillation stages. We establish two theoretical guarantees on empirical risk discrepancy between student and teacher models and the convergence of the distillation process. Extensive experiments on datasets Wildfire, Hurricane, CIFAR-10, CIFAR-100, and SVHN demonstrate that FOL consistently outperforms state-of-the-art one-shot Federated Learning (OFL) methods; for example, it achieves accuracy improvements of up to 39.24\% over the baselines on the Wildfire dataset.
Statistical Test for Feature Selection Pipelines by Selective Inference
Tomohiro Shiraishi · Tatsuya Matsukawa · Shuichi Nishino · Ichiro Takeuchi
A data analysis pipeline is a structured sequence of steps that transforms raw data into meaningful insights by integrating various analysis algorithms. In this paper, we propose a novel statistical test to assess the significance of data analysis pipelines. Our approach enables the systematic development of valid statistical tests applicable to any feature selection pipeline composed of predefined components. We develop this framework based on selective inference, a statistical technique that has recently gained attention for data-driven hypotheses. As a proof of concept, we focus on feature selection pipelines for linear models, composed of three missing value imputation algorithms, three outlier detection algorithms, and three feature selection algorithms. We theoretically prove that our statistical test can control the probability of false positive feature selection at any desired level, and demonstrate its validity and effectiveness through experiments on synthetic and real data. Additionally, we present an implementation framework that facilitates testing across any configuration of these feature selection pipelines without extra implementation costs.
Making Hard Problems Easier with Custom Data Distributions and Loss Regularization: A Case Study in Modular Arithmetic
Eshika Saxena · Alberto Alfarano · Emily Wenger · Kristin Lauter
Recent work showed that ML-based attacks on Learning with Errors (LWE), a hard problem used in post-quantum cryptography, outperform classical algebraic attacks in certain settings. Although promising, ML attacks struggle to scale to more complex LWE settings. Prior work connected this issue to the difficulty of training ML models to do modular arithmetic, a core feature of the LWE problem. To address this, we develop techniques that significantly boost the performance of ML models on modular arithmetic tasks—enabling the models to sum up to $N=128$ elements modulo $q \le 974269$. Our core innovation is the use of custom training data distributions and a carefully designed loss function that better represents the problem structure. We apply an initial proof of concept of our techniques to LWE specifically and find that they allow recovery of 2x harder secrets than prior work. Our techniques also help ML models learn other well-studied problems better, including copy, associative recall, and parity, motivating further study.
NMA-tune: Generating Highly Designable and Dynamics Aware Protein Backbones
Urszula Julia Komorowska · Francisco Vargas · Alessandro Rondina · Pietro Lió · Mateja Jamnik
Protein's backbone flexibility is a crucial property that heavily influences its functionality. Recent work in the field of protein diffusion probabilistic modelling has leveraged Normal Mode Analysis (NMA) and, for the first time, introduced information about large scale protein motion into the generative process. However, obtaining molecules with both the desired dynamics and designable quality has proven challenging. In this work, we present NMA-tune, a new method that introduces the dynamics information to the protein design stage. NMA-tune uses a trainable component to condition the backbone generation on the lowest normal mode of oscillation. We implement NMA-tune as a plug-and-play extension to RFdiffusion, show that the proportion of samples with high quality structure and the desired dynamics is improved as compared to other methods without the trainable component, and we show the presence of the targeted modes in the Molecular Dynamics simulations.
WGFormer: An SE(3)-Transformer Driven by Wasserstein Gradient Flows for Molecular Ground-State Conformation Prediction
Fanmeng Wang · Minjie Cheng · Hongteng Xu
Predicting molecular ground-state conformation (i.e., energy-minimized conformation) is crucial for many chemical applications such as molecular docking and property prediction. Classic energy-based simulation is time-consuming when solving this problem, while existing learning-based methods have advantages in computational efficiency but sacrifice accuracy and interpretability.In this work, we propose a novel and effective method to bridge the energy-based simulation and the learning-based strategy, which designs and learns a Wasserstein gradient flow-driven SE(3)-Transformer, called WGFormer, for ground-state conformation prediction. Specifically, our method tackles this task within an auto-encoding framework, which encodes low-quality conformations by the proposed WGFormer and decodes corresponding ground-state conformations by an MLP.The architecture of WGFormer corresponds to Wasserstein gradient flows --- it optimizes conformations by minimizing an energy function defined on the latent mixture models of atoms, thereby significantly improving performance and interpretability.Extensive experiments demonstrate that our method consistently outperforms state-of-the-art competitors, providing a new and insightful paradigm to predict ground-state conformation.The code is available at https://github.com/FanmengWang/WGFormer.
MF-LAL: Drug Compound Generation Using Multi-Fidelity Latent Space Active Learning
Peter Eckmann · Dongxia Wu · Germano Heinzelmann · Michael Gilson · Rose Yu
Current generative models for drug discovery primarily use molecular docking as an oracle to guide the generation of active compounds. However, such models are often not useful in practice because even compounds with high docking scores do not consistently show real-world experimental activity. More accurate methods for activity prediction exist, such as molecular dynamics based binding free energy calculations, but they are too computationally expensive to use in a generative model. To address this challenge, we propose Multi-Fidelity Latent space Active Learning (MF-LAL), a generative modeling framework that integrates a set of oracles with varying cost-accuracy tradeoffs. Using active learning, we train a surrogate model for each oracle and use these surrogates to guide generation of compounds with high predicted activity. Unlike previous approaches that separately learn the surrogate model and generative model, MF-LAL combines the generative and multi-fidelity surrogate models into a single framework, allowing for more accurate activity prediction and higher quality samples. Our experiments on two disease-relevant proteins show that MF-LAL produces compounds with significantly better binding free energy scores than other single and multi-fidelity approaches (~50% improvement in mean binding free energy score). The code is available at https://github.com/Rose-STL-Lab/MF-LAL.
DISCO: learning to DISCover an evolution Operator for multi-physics-agnostic prediction
Rudy Morel · Jiequn Han · Edouard Oyallon
We address the problem of predicting the next states of a dynamical system governed by unknown temporal partial differential equations (PDEs) using only a short trajectory. While standard transformers provide a natural black-box solution to this task, the presence of a well-structured evolution operator in the data suggests a more tailored and efficient approach. Specifically, when the PDE is fully known, classical numerical solvers can evolve the state accurately with only a few parameters. Building on this observation, we introduce DISCO, a model that uses a large hypernetwork to process a short trajectory and generate the parameters of a much smaller operator network, which then predicts the next states through time integration. Our framework decouples dynamics estimation -- i.e., DISCovering an evolution Operator from a short trajectory -- from state prediction -- i.e., evolving this operator. Experiments show that pretraining our model on diverse physics datasets achieves state-of-the-art performance while requiring significantly fewer epochs. Moreover, it generalizes well to unseen initial conditions and remains competitive when fine-tuned on downstream tasks.
PolyConf: Unlocking Polymer Conformation Generation through Hierarchical Generative Models
Fanmeng Wang · Wentao Guo · Qi Ou · Hongshuai Wang · Haitao Lin · Hongteng Xu · Zhifeng Gao
Polymer conformation generation is a critical task that enables atomic-level studies of diverse polymer materials. While significant advances have been made in designing conformation generation methods for small molecules and proteins, these methods struggle to generate polymer conformations due to their unique structural characteristics.Meanwhile, the scarcity of polymer conformation datasets further limits the progress, making this important area largely unexplored.In this work, we propose PolyConf, a pioneering tailored polymer conformation generation method that leverages hierarchical generative models to unlock new possibilities.Specifically, we decompose the polymer conformation into a series of local conformations (i.e., the conformations of its repeating units), generating these local conformations through an autoregressive model, and then generating their orientation transformations via a diffusion model to assemble them into the complete polymer conformation.Moreover, we develop the first benchmark with a high-quality polymer conformation dataset derived from molecular dynamics simulations to boost related research in this area.The comprehensive evaluation demonstrates that PolyConf consistently outperforms existing conformation generation methods, thus facilitating advancements in polymer modeling and simulation.The whole work is available at https://polyconf-icml25.github.io.
Empower Structure-Based Molecule Optimization with Gradient Guided Bayesian Flow Networks
Keyue Qiu · Yuxuan Song · Jie Yu · Hongbo Ma · Ziyao Cao · Zhilong Zhang · Yushuai Wu · Mingyue Zheng · Hao Zhou · Wei-Ying Ma
Structure-based molecule optimization (SBMO) aims to optimize molecules with both continuous coordinates and discrete types against protein targets.A promising direction is to exert gradient guidance on generative models given its remarkable success in images, but it is challenging to guide discrete data and risks inconsistencies between modalities.To this end, we leverage a continuous and differentiable space derived through Bayesian inference, presenting Molecule Joint Optimization (MolJO), the gradient-based SBMO framework that facilitates joint guidance signals across different modalities while preserving SE(3)-equivariance.We introduce a novel backward correction strategy that optimizes within a sliding window of the past histories, allowing for a seamless trade-off between explore-and-exploit during optimization.MolJO achieves state-of-the-art performance on CrossDocked2020 benchmark (Success Rate 51.3\%, Vina Dock -9.05 and SA 0.78), more than 4x improvement in Success Rate compared to the gradient-based counterpart, and 2x ``Me-Better'' Ratio as much as 3D baselines. Furthermore, we extend MolJO to a wide range of optimization settings, including multi-objective optimization and challenging tasks in drug design such as R-group optimization and scaffold hopping, further underscoring its versatility.Code is available at https://github.com/AlgoMole/MolCRAFT.
DUNIA: Pixel-Sized Embeddings via Cross-Modal Alignment for Earth Observation Applications
Ibrahim Fayad · Max Zimmer · Martin Schwartz · Fabian Gieseke · Philippe CIAIS · Gabriel Belouze · Sarah Brood · Aurélien de Truchis · Alexandre d'Aspremont
Significant efforts have been directed towards adapting self-supervised multimodal learning for Earth observation applications. However, most current methods produce coarse patch-sized embeddings, limiting their effectiveness and integration with other modalities like LiDAR. To close this gap, we present DUNIA, an approach to learn pixel-sized embeddings through cross-modal alignment between images and full-waveform LiDAR data. As the model is trained in a contrastive manner, the embeddings can be directly leveraged in the context of a variety of environmental monitoring tasks in a zero-shot setting. In our experiments, we demonstrate the effectiveness of the embeddings for seven such tasks: canopy height mapping, fractional canopy cover, land cover mapping, tree species identification, plant area index, crop type classification, and per-pixel waveform-based vertical structure mapping. The results show that the embeddings, along with zero-shot classifiers, often outperform specialized supervised models, even in low-data regimes. In the fine-tuning setting, we show strong performances near or better than the state-of-the-art on five out of six tasks.
Learning Adversarial MDPs with Stochastic Hard Constraints
Francesco Emanuele Stradi · Matteo Castiglioni · Alberto Marchesi · Nicola Gatti
We study online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints, under bandit feedback. We consider three scenarios. In the first one, we address general CMDPs, where we design an algorithm attaining sublinear regret and cumulative positive constraints violation. In the second scenario, under the mild assumption that a policy strictly satisfying the constraints exists and is known to the learner, we design an algorithm that achieves sublinear regret while ensuring that constraints are satisfied at every episode with high probability. In the last scenario, we only assume the existence of a strictly feasible policy, which is not known to the learner, and we design an algorithm attaining sublinear regret and constant cumulative positive constraints violation. Finally, we show that in the last two scenarios, a dependence on the Slater's parameter is unavoidable. To the best of our knowledge, our work is the first to study CMDPs involving both adversarial losses and hard constraints. Thus, our algorithms can deal with general non-stationary environments subject to requirements much stricter than those manageable with existing ones, enabling their adoption in a much wider range of applications.
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
Feng Wang · Yaodong Yu · Wei Shao · Yuyin Zhou · Alan Yuille · Cihang Xie
Since the introduction of Vision Transformer (ViT), patchification has long been regarded as a common image pre-processing approach for plain visual architectures. By compressing the spatial size of images, this approach can effectively shorten the token sequence and reduce the computational cost of ViT-like plain architectures. In this work, we aim to thoroughly examine the information loss caused by this patchification-based compressive encoding paradigm and how it affects visual understanding. We conduct extensive patch size scaling experiments and excitedly observe an intriguing scaling law in patchification: the models can consistently benefit from decreased patch sizes and attain improved predictive performance, until it reaches the minimum patch size of 1*1, i.e., pixel tokenization. This conclusion is broadly applicable across different vision tasks, various input scales, and diverse architectures such as ViT and the recent Mamba models. Moreover, as a by-product, we discover that with smaller patches, task-specific decoder heads become less critical for dense prediction. In the experiments, we successfully scale up the visual sequence to an exceptional length of 50,176 tokens, achieving a competitive test accuracy of 84.6% with a base-sized model on the ImageNet-1k benchmark. We hope this study can provide insights and theoretical foundations for future works of building non-compressive vision models.
SPEX: Scaling Feature Interaction Explanations for LLMs
Justin S. Kang · Landon Butler · Abhineet Agarwal · Yigit Efe Erginbas · Ramtin Pedarsani · Bin Yu · Kannan Ramchandran
Large language models (LLMs) have revolutionized machine learning due to their ability to capture complex interactions between input features. Popular post-hoc explanation methods like SHAP provide *marginal* feature attributions, while their extensions to interaction importances only scale to small input lengths ($\approx 20$). We propose *Spectral Explainer* (SPEX), a model-agnostic interaction attribution algorithm that efficiently scales to large input lengths ($\approx 1000)$. SPEX exploits underlying natural sparsity among interactions—common in real-world data—and applies a sparse Fourier transform using a channel decoding algorithm to efficiently identify important interactions. We perform experiments across three difficult long-context datasets that require LLMs to utilize interactions between inputs to complete the task. For large inputs, SPEX outperforms marginal attribution methods by up to 20\% in terms of faithfully reconstructing LLM outputs. Further, SPEX successfully identifies key features and interactions that strongly influence model output. For one of our datasets, *HotpotQA*, SPEX provides interactions that align with human annotations. Finally, we use our model-agnostic approach to generate explanations to demonstrate abstract reasoning in closed-source LLMs (*GPT-4o mini*) and compositional reasoning in vision-language models.
DS-VLM: Diffusion Supervision Vision Language Model
Zhen Sun · Yunhang Shen · Jie Li · Xing Sun · Pingyang Dai · Liujuan Cao · Rongrong Ji
Vision-Language Models (VLMs) face two critical limitations in visual representation learning: degraded supervision due to information loss during gradient propagation, and the inherent semantic sparsity of textual supervision compared to visual data. We propose the Diffusion Supervision Vision-Language Model (DS-VLM), a plug-and-play framework that introduces diffusion-based direct supervision for vision-language alignment. By reconstructing input images through a diffusion model conditioned on outputs of the visual encoder and the connector, our method establishes a short-path gradient propagation channel from pixel space to visual features. This approach simultaneously preserves high-level semantic alignment through conventional text supervision while enhancing visual feature quality via pixel-level reconstruction constraints. Extensive experiments conducted across various visual encoders and LLMs of different scales demonstrate the effectiveness of our approach.
Multi-Modal Object Re-identification via Sparse Mixture-of-Experts
Yingying Feng · Jie Li · Chi Xie · Lei Tan · Jiayi Ji
We present MFRNet, a novel network for multi-modal object re-identification that integrates multi-modal data features to effectively retrieve specific objects across different modalities. Current methods suffer from two principal limitations: (1) insufficient interaction between pixel-level semantic features across modalities, and (2) difficulty in balancing modality-shared and modality-specific features within a unified architecture. To address these challenges, our network introduces two core components. First, the Feature Fusion Module (FFM) enables fine-grained pixel-level feature generation and flexible cross-modal interaction. Second, the Feature Representation Module (FRM) efficiently extracts and combines modality-specific and modality-shared features, achieving strong discriminative ability with minimal parameter overhead. Extensive experiments on three challenging public datasets (RGBNT201, RGBNT100, and MSVR310) demonstrate the superiority of our approach in terms of both accuracy and efficiency, with 8.4% mAP and 6.9% accuracy improved in RGBNT201 with negligible additional parameters.
Guided Structural Inference: Leveraging Priors with Soft Gating Mechanisms
Aoran Wang · Xinnan Dai · Jun Pang
Existing methods for inferring latent relational structures struggle to integrate partial prior knowledge, such as known edges, node-degree constraints, and global sparsity, without destabilizing training or conflicting with probabilistic assumptions. We propose Soft-Gated Structural Inference (SGSI), a VAE framework that seamlessly incorporates domain constraints via (1) soft gating with learnable edge masks to preserve gradients, (2) cloning-clamping of deterministic edges to avoid distributional conflicts, and (3) adaptive regularization to balance data-driven learning with domain constraints. By excluding known edges from stochastic inference, SGSI reallocates capacity to uncertain interactions, optimizing the information bottleneck trade-off. Experiments on 16 datasets show SGSI improves edge recovery by up to $9$\% AUROC over baselines, scales to larger graphs ($94.2$\% AUROC), and maintains stable training. SGSI bridges domain expertise with data-driven learning, enabling interpretable and robust structural discovery in dynamical systems.
What Has a Foundation Model Found? Inductive Bias Reveals World Models
Keyon Vafa · Peter Chang · Ashesh Rambachan · Sendhil Mullainathan
Foundation models are premised on the idea that sequence prediction can uncover deeper domain understanding, much like how Kepler's predictions of planetary motion later led to the discovery of Newtonian mechanics. However, evaluating whether these models truly capture deeper structure remains a challenge. We develop a technique for evaluating foundation models that examines how they adapt to synthetic datasets generated from some postulated world model. Our technique measures whether the foundation model's inductive bias aligns with the world model, and so we refer to it as an inductive bias probe. Across multiple domains, we find that foundation models can excel at their training tasks yet fail to develop inductive biases towards the underlying world model when adapted to new tasks. We particularly find that foundation models trained on orbital trajectories consistently fail to apply Newtonian mechanics when adapted to new physics tasks. Further analysis reveals that these models behave as if they develop task-specific heuristics that fail to generalize.
Relating Misfit to Gain in Weak-to-Strong Generalization Beyond the Squared Loss
Abhijeet Mulgund · Chirag Pabbaraju
The paradigm of weak-to-strong generalization constitutes the training of a strong AI model on data labeled by a weak AI model, with the goal that the strong model nevertheless outperforms its weak supervisor on the target task of interest. For the setting of real-valued regression with the squared loss, recent work quantitatively characterizes the gain in performance of the strong model over the weak model in terms of the misfit between the strong and weak model. We generalize such a characterization to learning tasks whose loss functions correspond to arbitrary Bregman divergences when the strong class is convex. This extends the misfit-based characterization of performance gain in weak-to-strong generalization to classification tasks, as the cross-entropy loss can be expressed in terms of a Bregman divergence. In most practical scenarios, however, the strong model class may not be convex. We therefore weaken this assumption and study weak-to-strong generalization for convex combinations of $k$ strong models in the strong class, in the concrete setting of classification. This allows us to obtain a similar misfit-based characterization of performance gain, up to an additional error term that vanishes as $k$ gets large. Our theoretical findings are supported by thorough experiments on synthetic as well as real-world datasets.
Enhancing Spectral GNNs: From Topology and Perturbation Perspectives
Taoyang Qin · Ke-Jia CHEN · Zheng Liu
Spectral Graph Neural Networks process graph signals using the spectral properties of the normalized graph Laplacian matrix. However, the frequent occurrence of repeated eigenvalues limits the expressiveness of spectral GNNs. To address this, we propose a higher-dimensional sheaf Laplacian matrix, which not only encodes the graph's topological information but also increases the upper bound on the number of distinct eigenvalues. The sheaf Laplacian matrix is derived from carefully designed perturbations of the block form of the normalized graph Laplacian, yielding a perturbed sheaf Laplacian (PSL) matrix with more distinct eigenvalues. We provide a theoretical analysis of the expressiveness of spectral GNNs equipped with the PSL and establish perturbation bounds for the eigenvalues. Extensive experiments on benchmark datasets for node classification demonstrate that incorporating the perturbed sheaf Laplacian enhances the performance of spectral GNNs.
Finding Wasserstein Ball Center: Efficient Algorithm and The Applications in Fairness
Yuntao Wang · Yuxuan Li · Qingyuan Yang · Hu Ding
Wasserstein Barycenter (WB) is a fundamental geometric optimization problem in machine learning, whose objective is to find a representative probability measure that minimizes the sum of Wasserstein distances to given distributions. WB has a number of applications in various areas. However, WB may lead to unfair outcome towards underrepresented groups in some applications (e.g., a "minority'' distribution may be far away from the obtained WB under Wasserstein distance). To address this issue, we propose an alternative objective called "Wasserstein Ball Center (WBC)''. Specifically, WBC is a distribution that encompasses all input distributions within the minimum Wasserstein distance, which can be formulated as a ``minmax'' optimization problem. We show that the WBC problem with fixed support is equivalent to solving a large-scale linear programming (LP) instance, which is quite different from the previously studied LP model for WB. By incorporating some novel observations on the induced normal equation, we propose an efficient algorithm that accelerates the interior point method by $O(\min(N^2m, Nm^2, m^4))$ times ("$N$'' is the number of distributions and "$m$'' is the support size). Finally, we conduct a set of experiments on both synthetic and real-world datasets, demonstrating the computational efficiency of our algorithm, and showing its ability to provide more fairness for input distributions.
Neural Interpretable PDEs: Harmonizing Fourier Insights with Attention for Scalable and Interpretable Physics Discovery
Ning Liu · Yue Yu
Attention mechanisms have emerged as transformative tools in core AI domains such as natural language processing and computer vision. Yet, their largely untapped potential for modeling intricate physical systems presents a compelling frontier. Learning such systems often entails discovering operators that map between functional spaces using limited instances of function pairs---a task commonly framed as a severely ill-posed inverse PDE problem. In this work, we introduce Neural Interpretable PDEs (NIPS), a novel neural operator architecture that builds upon and enhances Nonlocal Attention Operators (NAO) in both predictive accuracy and computational efficiency. NIPS employs a linear attention mechanism to enable scalable learning and integrates a learnable kernel network that acts as a channel-independent convolution in Fourier space. As a consequence, NIPS eliminates the need to explicitly compute and store large pairwise interactions, effectively amortizing the cost of handling spatial interactions into the Fourier transform. Empirical evaluations demonstrate that NIPS consistently surpasses NAO and other baselines across diverse benchmarks, heralding a substantial leap in scalable, interpretable, and efficient physics learning. Our code and data accompanying this paper are available at https://github.com/fishmoon1234/Nonlocal-Attention-Operator.
QEM-Bench: Benchmarking Learning-based Quantum Error Mitigation and QEMFormer as a Multi-ranged Context Learning Baseline
Tianyi Bao · Ruizhe Zhong · Xinyu Ye · Yehui Tang · Junchi Yan
Quantum Error Mitigation (QEM) has emerged as a pivotal technique for enhancing the reliability of noisy quantum devices in the Noisy Intermediate-Scale Quantum (NISQ) era. Recently, machine learning (ML)-based QEM approaches have demonstrated strong generalization capabilities without sampling overheads compared to conventional methods. However, evaluating these techniques is often hindered by a lack of standardized datasets and inconsistent experimental settings across different studies. In this work, we present QEM-Bench, a comprehensive benchmark suite of twenty-two datasets covering diverse circuit types and noise profiles, which provides a unified platform for comparing and advancing ML-based QEM methods. We further propose a refined ML-based QEM pipeline QEMFormer, which leverages a feature encoder that preserves local, global, and topological information, along with a two-branch model that captures short-range and long-range dependencies within the circuit. Empirical evaluations on QEM-Bench illustrate the superior performance of QEMFormer over existing baselines, underscoring the potential of integrated ML-QEM strategies.
Action-Minimization Meets Generative Modeling: Efficient Transition Path Sampling with the Onsager-Machlup Functional
Sanjeev Raja · Martin Šípka · Michael Psenka · Tobias Kreiman · Michal Pavelka · Aditi Krishnapriyan
Transition path sampling (TPS), which involves finding probable paths connecting two points on an energy landscape, remains a challenge due to the complexity of real-world atomistic systems. Current machine learning approaches rely on expensive training procedures and under-utilize growing quantities of atomistic data, limiting scalability and generalization. Generative models of atomistic conformational ensembles sample temporally independent states from energy landscapes, but their application to TPS remains mostly unexplored. In this work, we address TPS by interpreting candidate paths as trajectories sampled from stochastic dynamics induced by the learned score function of generative models, namely denoising diffusion and flow matching. Under these dynamics, finding high-likelihood transition paths becomes equivalent to minimizing the Onsager-Machlup (OM) action functional, enabling us to repurpose pre-trained generative models for TPS in a zero-shot fashion. We demonstrate our approach on a Müller-Brown potential and several fast-folding proteins, where we obtain diverse, physically realistic transition pathways, as well as tetrapeptides, where we demonstrate successful TPS on systems not seen by the generative model during training. Our method can be easily incorporated into new generative models, making it practically relevant as models continue to scale and improve.
Chaos Meets Attention: Transformers for Large-Scale Dynamical Prediction
Yi He · Yiming Yang · Xiaoyuan Cheng · Hai Wang · Xiao Xue · Boli Chen · Yukun Hu
Generating long-term trajectories of dissipative chaotic systems autoregressively is a highly challenging task. The inherent positive Lyapunov exponents amplify prediction errors over time.Many chaotic systems possess a crucial property — ergodicity on their attractors, which makes long-term prediction possible. State-of-the-art methods address ergodicity by preserving statistical properties using optimal transport techniques. However, these methods face scalability challenges due to the curse of dimensionality when matching distributions. To overcome this bottleneck, we propose a scalable transformer-based framework capable of stably generating long-term high-dimensional and high-resolution chaotic dynamics while preserving ergodicity. Our method is grounded in a physical perspective, revisiting the Von Neumann mean ergodic theorem to ensure the preservation of long-term statistics in the $\mathcal{L}^2$ space. We introduce novel modifications to the attention mechanism, making the transformer architecture well-suited for learning large-scale chaotic systems. Compared to operator-based and transformer-based methods, our model achieves better performances across five metrics, from short-term prediction accuracy to long-term statistics. In addition to our methodological contributions, we introduce new chaotic system benchmarks: a machine learning dataset of 140$k$ snapshots of turbulent channel flow and a processed high-dimensional Kolmogorov Flow dataset, along with various evaluation metrics for both short- and long-term performances. Both are well-suited for machine learning research on chaotic systems.
Scalable Equilibrium Sampling with Sequential Boltzmann Generators
Charlie Tan · Joey Bose · Chen Lin · Leon Klein · Michael Bronstein · Alexander Tong
Scalable sampling of molecular states in thermodynamic equilibrium is a long-standing challenge in statistical physics. Boltzmann generators tackle this problem by pairing normalizing flows with importance sampling to obtain uncorrelated samples under the target distribution. In this paper, we extend the Boltzmann generator framework with two key contributions, denoting our framework Sequential Boltzmann Generators (SBG). The first is a highly efficient Transformer-based normalizing flow operating directly on all-atom Cartesian coordinates. In contrast to the equivariant continuous flows of prior methods, we leverage exactly invertible non-equivariant architectures which are highly efficient during both sample generation and likelihood evaluation. This efficiency unlocks more sophisticated inference strategies beyond standard importance sampling. In particular, we perform inference-time scaling of flow samples using a continuous-time variant of sequential Monte Carlo, in which flow samples are transported towards the target distribution with annealed Langevin dynamics. SBG achieves state-of-the-art performance w.r.t. all metrics on peptide systems, demonstrating the first equilibrium sampling in Cartesian coordinates of tri-, tetra- and hexa-peptides that were thus far intractable for prior Boltzmann generators.
Symmetry-Driven Discovery of Dynamical Variables in Molecular Simulations
Jeet Mohapatra · Nima Dehmamy · Csaba Both · Subhro Das · Tommi Jaakkola
We introduce a novel approach for discovering effective degrees of freedom (DOF) in molecular dynamics simulations by mapping the DOF to approximate symmetries of the energy landscape. Unlike most existing methods, we do not require trajectory data but instead rely on knowledge of the forcefield (energy function) around the initial state. We present a scalable symmetry loss function compatible with existing force-field frameworks and a Hessian-based method efficient for smaller systems. Our approach enables systematic exploration of conformational space by connecting structural dynamics to energy landscape symmetries. We apply our method to two systems, Alanine dipeptide and Chignolin, recovering their known important conformations. Our approach can prove useful for efficient exploration in molecular simulations with potential applications in protein folding and drug discovery.
Learning the Electronic Hamiltonian of Large Atomic Structures
Chen Hao Xia · Manasa Kaniselvan · Alexandros Nikolaos Ziogas · Marko Mladenovic · Rayen Mahjoub · Alexander Maeder · Mathieu Luisier
Graph neural networks (GNNs) have shown promise in learning the ground-state electronic properties of materials, subverting *ab initio* density functional theory (DFT) calculations when the underlying lattices can be represented as small and/or repeatable unit cells (i.e., molecules and periodic crystals). Realistic systems are, however, non-ideal and generally characterized by higher structural complexity. As such, they require large (10+ Å) unit cells and thousands of atoms to be accurately described. At these scales, DFT becomes computationally prohibitive, making GNNs especially attractive. In this work, we present a strictly local equivariant GNN capable of learning the electronic Hamiltonian (**H**) of realistically extended materials. It incorporates an *augmented partitioning* approach that enables training on arbitrarily large structures while preserving local atomic environments beyond boundaries. We demonstrate its capabilities by predicting the electronic Hamiltonian of various systems with up to 3,000 nodes (atoms), 500,000+ edges, ~ 28 million orbital interactions (nonzero entries of **H**), and $\leq$0.53\% error in the eigenvalue spectra. Our work expands the applicability of current electronic property prediction methods to some of the most challenging cases encountered in computational materials science, namely systems with disorder, interfaces, and defects.
Beyond Atoms: Enhancing Molecular Pretrained Representations with 3D Space Modeling
Shuqi Lu · Xiaohong Ji · Bohang Zhang · Lin Yao · Siyuan Liu · Zhifeng Gao · Linfeng Zhang · Guolin Ke
Molecular pretrained representations (MPR) has emerged as a powerful approach for addressing the challenge of limited supervised data in applications such as drug discovery and material design. While early MPR methods relied on 1D sequences and 2D graphs, recent advancements have incorporated 3D conformational information to capture rich atomic interactions. However, these prior models treat molecules merely as discrete atom sets, overlooking the space surrounding them. We argue from a physical perspective that only modeling these discrete points is insufficient. We first present a simple yet insightful observation: naively adding randomly sampled virtual points beyond atoms can surprisingly enhance MPR performance. In light of this, we propose a principled framework that incorporates the entire 3D space spanned by molecules. We implement the framework via a novel Transformer-based architecture, dubbed SpaceFormer, with three key components:(1)grid-based space discretization; (2)grid sampling/merging; and (3)efficient 3D positional encoding.Extensive experiments show that SpaceFormer significantly outperforms previous 3D MPR models across various downstream tasks with limited data, validating the benefit of leveraging the additional 3D space beyond atoms in MPR models.
Broadband Ground Motion Synthesis by Diffusion Model with Minimal Condition
Jaeheun Jung · Jaehyuk Lee · ChangHae Jung · Hanyoung Kim · Bosung Jung · Donghun Lee
Shock waves caused by earthquakes can be devastating. Generating realistic earthquake-caused ground motion waveforms help reducing losses in lives and properties, yet generative models for the task tend to generate subpar waveforms. We present High-fidelity Earthquake Groundmotion Generation System (HEGGS) and demonstrate its superior performance using earthquakes from North American, East Asian, and European regions. HEGGS exploits the intrinsic characteristics of earthquake dataset and learns the waveforms using an end-to-end differentiable generator containing conditional latent diffusion model and hi-fidelity waveform construction model. We show the learning efficiency of HEGGS by training it on a single GPU machine and validate its performance using earthquake databases from North America, East Asia, and Europe, using diverse criteria from waveform generation tasks and seismology. Once trained, HEGGS can generate three dimensional E-N-Z seismic waveforms with accurate P/S phase arrivals, envelope correlation, signal-to-noise ratio, GMPE analysis, frequency content analysis, and section plot analysis.
EcoMapper: Generative Modeling for Climate-Aware Satellite Imagery
Muhammed Göktepe · Amir Hossein Shamseddin · Erencan Uysal · Javier Monteagudo · Lukas Drees · Aysim Toker · Senthold Asseng · Malte von Bloh
Satellite imagery is essential for Earth observation, enabling applications like crop yield prediction, environmental monitoring, and climatechange assessment. However, integrating satellite imagery with climate data remains a challenge, limiting its utility for forecasting and scenario analysis. We introduce a novel dataset of 2.9 million Sentinel-2 images spanning 15 land cover types with corresponding climate records, forming the foundation for two satellite image generation approaches using fine-tuned Stable Diffusion 3 models. The first is a text-to-image generation model that uses textual prompts with climate and land cover details to produce realistic synthetic imagery for specific regions. The second leverages ControlNet for multi-conditional image generation, preserving spatial structures while mapping climate data or generating time-series to simulate landscape evolution. By combining synthetic image generation with climate and land cover data, our work advances generative modeling in remote sensing, offering realistic inputs for environmental forecasting and new possibilities for climate adaptation and geospatial analysis.
Offline Model-based Optimization for Real-World Molecular Discovery
Dong-Hee Shin · Young-Han Son · Hyun Jung Lee · Deok-Joong Lee · Tae-Eui Kam
Molecular discovery has attracted significant attention in scientific fields for its ability to generate novel molecules with desirable properties. Although numerous methods have been developed to tackle this problem, most rely on an online setting that requires repeated online evaluation of candidate molecules using the oracle. However, in real-world molecular discovery, the oracle is often represented by wet-lab experiments, making this online setting impractical due to the significant time and resource demands. To fill this gap, we propose the Molecular Stitching (MolStitch) framework, which utilizes a fixed offline dataset to explore and optimize molecules without the need for repeated oracle evaluations. Specifically, MolStitch leverages existing molecules from the offline dataset to generate novel `stitched molecules' that combine their desirable properties. These stitched molecules are then used as training samples to fine-tune the generative model using preference optimization techniques. Experimental results on various offline multi-objective molecular optimization problems validate the effectiveness of MolStitch. The source code is available online.
HEAP: Hyper Extended A-PDHG Operator for Constrained High-dim PDEs
Mingquan Feng · Weixin Liao · Yixin Huang · Yifan Fu · Qifu Zheng · Junchi Yan
Neural operators have emerged as a promising approach for solving high-dimensional partial differential equations (PDEs). However, existing neural operators often have difficulty in dealing with constrained PDEs, where the solution must satisfy additional equality or inequality constraints beyond the governing equations. To close this gap, we propose a novel neural operator, Hyper Extended Adaptive PDHG (HEAP) for constrained high-dim PDEs, where the learned operator evolves in the parameter space of PDEs. We first show that the evolution operator learning can be formulated as a quadratic programming (QP) problem, then unroll the adaptive primal-dual hybrid gradient (APDHG) algorithm as the QP-solver into the neural operator architecture. This allows us to improve efficiency while retaining theoretical guarantees of the constrained optimization. Empirical results on a variety of high-dim PDEs show that HEAP outperforms the state-of-the-art neural operator model.
QuanONet: Quantum Neural Operator with Application to Differential Equation
Ruocheng Wang · Zhuo Xia · Ge Yan · Junchi Yan
Differential equations are essential and popular in science and engineering. Learning-based methods including neural operators, have emerged as a promising paradigm. We explore its quantum counterpart, and propose QuanONet -- a quantum neural operator which has not been well studied in literature compared with their counterparts in other machine learning areas. We design a novel architecture as a hardware-efficient ansatz, in the era of noisy intermediate-scale quantum (NISQ). Its circuit is pure quantum. By lying its ground on the operator approximation theorem for its quantum counterpart, QuanONet in theory can fit various differential equation operators. We also propose its modified version TF-QuanONet with ability to adaptively fit the dominant frequency of the problem. The real-device empirical results on problems including anti-derivative operators, Diffusion-reaction Systems demonstrate that QuanONet outperforms peer quantum methods when their model sizes are set akin to QuanONet.
Enhancing Ligand Validity and Affinity in Structure-Based Drug Design with Multi-Reward Optimization
Seungbeom Lee · Munsun Jo · Jungseul Ok · Dongwoo Kim
Deep learning-based Structure-based drug design aims to generate ligand molecules with desirable properties for protein targets. While existing models have demonstrated competitive performance in generating ligand molecules, they primarily focus on learning the chemical distribution of training datasets, often lacking effective steerability to ensure the desired chemical quality of generated molecules. To address this issue, we propose a multi-reward optimization framework that fine-tunes generative models for attributes, such as binding affinity, validity, and drug-likeness, together. Specifically, we derive direct preference optimization for a Bayesian flow network, used as a backbone for molecule generation, and integrate a reward normalization scheme to adopt multiple objectives. Experimental results show that our method generates more realistic ligands than baseline models while achieving higher binding affinity, expanding the Pareto front empirically observed in previous studies.
From Thousands to Billions: 3D Visual Language Grounding via Render-Supervised Distillation from 2D VLMs
Ang Cao · Sergio Arnaud · Oleksandr Maksymets · Jianing Yang · Ayush Jain · Ada Martin · Vincent-Pierre Berges · Paul McVay · Ruslan Partsey · Aravind Rajeswaran · Franziska Meier · Justin Johnson · Jeong Joon Park · Alexander Sax
3D vision-language grounding faces a fundamental data bottleneck: while 2D models train on billions of images, 3D models have access to only thousands of labeled scenes--a six-order-of-magnitude gap that severely limits performance. We introduce \textbf{\emph{LIFT-GS}}, a practical distillation technique that overcomes this limitation by using differentiable rendering to bridge 3D and 2D supervision. LIFT-GS predicts 3D Gaussian representations from point clouds and uses them to render predicted language-conditioned 3D masks into 2D views, enabling supervision from 2D foundation models (SAM, CLIP, LLaMA) without requiring any 3D annotations. This render-supervised formulation enables end-to-end training of complete encoder-decoder architectures and is inherently model-agnostic. LIFT-GS achieves state-of-the-art results with 25.7\% mAP on open-vocabulary instance segmentation (vs. 20.2\% prior SOTA) and consistent 10-30\% improvements on referential grounding tasks. Remarkably, pretraining effectively multiplies fine-tuning datasets by 2×, demonstrating strong scaling properties that suggest 3D VLG currently operates in a severely data-scarce regime. Project page: \url{https://liftgs.github.io}.
FourierMamba: Fourier Learning Integration with State Space Models for Image Deraining
Dong Li · Yidi Liu · Xueyang Fu · Jie Huang · Senyan Xu · Qi Zhu · Zheng-Jun Zha
Image deraining aims to remove rain streaks from rainy images and restore clear backgrounds. Currently, some research that employs the Fourier transform has proved to be effective for image deraining, due to it acting as an effective frequency prior for capturing rain streaks. However, despite there exists dependency of low frequency and high frequency in images, these Fourier-based methods rarely exploit the correlation of different frequencies for conjuncting their learning procedures, limiting the full utilization of frequency information for image deraining. Alternatively, the recently emerged Mamba technique depicts its effectiveness and efficiency for modeling correlation in various domains (e.g., spatial, temporal), and we argue that introducing Mamba into its unexplored Fourier spaces to correlate different frequencies would help improve image deraining. This motivates us to propose a new framework termed FourierMamba, which performs image deraining with Mamba in the Fourier space. Owing to the unique arrangement of frequency orders in Fourier space, the core of FourierMamba lies in the scanning encoding of different frequencies, where the low-high frequency order formats exhibit differently in the spatial dimension (unarranged in axis) and channel dimension (arranged in axis). Therefore, we design FourierMamba that correlates Fourier space information in the spatial and channel dimensions with distinct designs. Specifically, in the spatial dimension Fourier space, we introduce the zigzag coding to scan the frequencies to rearrange the orders from low to high frequencies, thereby orderly correlating the connections between frequencies; in the channel dimension Fourier space with arranged orders of frequencies in axis, we can directly use Mamba to perform frequency correlation and improve the channel information representation. Extensive experiments reveal that our method outperforms state-of-the-art methods both qualitatively and quantitatively.
FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification
Zhen Sun · Lei Tan · Yunhang Shen · Chengmao Cai · Xing Sun · Pingyang Dai · Liujuan Cao · Rongrong Ji
Multimodal person re-identification (Re-ID) aims to match pedestrian images across different modalities. However, most existing methods focus on limited cross-modal settings and fail to support arbitrary query-retrieval combinations, hindering practical deployment. We propose FlexiReID, a flexible framework that supports seven retrieval modes across four modalities: RGB, infrared, sketches, and text. FlexiReID introduces an adaptive mixture-of-experts (MoE) mechanism to dynamically integrate diverse modality features and a cross-modal query fusion module to enhance multimodal feature extraction. To facilitate comprehensive evaluation, we construct CIRS-PEDES, a unified dataset extending four popular Re-ID datasets to include all four modalities. Extensive experiments demonstrate that FlexiReID achieves state-of-the-art performance and offers strong generalization in complex scenarios.
Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng · Jiaqi Liao · Xinyu Tan · Quanfeng Lu · WENQI SHAO · Kaipeng Zhang · Yu Cheng · Dianqi Li · Ping Luo
Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive \textbf{Phy}sics \textbf{Gen}eration \textbf{Ben}chmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at https://github.com/OpenGVLab/PhyGenBench
Efficient Noise Calculation in Deep Learning-based MRI Reconstructions
Onat Dalmaz · Arjun Desai · Reinhard Heckel · Tolga Cukur · Akshay Chaudhari · Brian Hargreaves
Accelerated MRI reconstruction involves solving an ill-posed inverse problem where noise in acquired data propagates to the reconstructed images. Noise analyses are central to MRI reconstruction for providing an explicit measure of solution fidelity and for guiding the design and deployment of novel reconstruction methods. However, deep learning (DL)-based reconstruction methods have often overlooked noise propagation due to inherent analytical and computational challenges, despite its critical importance. This work proposes a theoretically grounded, memory-efficient technique to calculate voxel-wise variance for quantifying uncertainty due to acquisition noise in accelerated MRI reconstructions. Our approach is based on approximating the noise covariance using the DL network's Jacobian, which is intractable to calculate. To circumvent this, we derive an unbiased estimator for the diagonal of this covariance matrix—voxel-wise variance—, and introduce a Jacobian sketching technique to efficiently implement it. We evaluate our method on knee and brain MRI datasets for both data-driven and physics-driven networks trained in supervised and unsupervised manners. Compared to empirical references obtained via Monte-Carlo simulations, our technique achieves near-equivalent performance while reducing computational and memory demands by an order of magnitude or more. Furthermore, our method is robust across varying input noise levels, acceleration factors, and diverse undersampling schemes, highlighting its broad applicability. Our work reintroduces accurate and efficient noise analysis as a central tenet of reconstruction algorithms, holding promise to reshape how we evaluate and deploy DL-based MRI.
Probing Visual Language Priors in VLMs
Tiange Luo · Ang Cao · Gunhee Lee · Justin Johnson · Honglak Lee
Vision-Language Models (VLMs) may over-rely on visual language priors from their training data rather than true visual reasoning. To investigate this, we introduce ViLP, a benchmark featuring deliberately out-of-distribution images synthesized via image generation models and out-of-distribution Q\&A pairs. Each question in ViLP is coupled with three potential answers and three corresponding images: one that can be resolved by text priors alone and two that demand visual reasoning. Although humans achieve near-perfect accuracy, modern VLMs falter; for instance, GPT-4o achieves only 66.17\% on ViLP. To alleviate this, we propose a self-improving framework in which models generate new VQA data and then apply pixel-level and semantic corruptions to form ``good-bad" image pairs for self-training. Our proposed training objective, Image-DPO, compels VLMs to focus more on the actual visual inputs, and we demonstrate its effectiveness in LLaVA-v1.5 and Cambrian. Project Page: \href{https://vilp-team.github.io/}{ViLP}.
SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding
Woohyeon Park · Woojin Kim · Jaeik Kim · Jaeyoung Do
Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.
Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage
Saehyung Lee · Seunghyun Yoon · Trung Bui · Jing Shi · Sungroh Yoon
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that the proposed evaluation method aligns better with human judgments of factuality than existing metrics. Moreover, we show that current approaches for enhancing MLLM factuality often fail in hyper-detailed image captioning tasks. In contrast, our approach significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.
Beyond Entropy: Region Confidence Proxy for Wild Test-Time Adaptation
Zixuan Hu · Yichun Hu · Xiaotong Li · SHIXIANG TANG · LINGYU DUAN
Wild Test-Time Adaptation (WTTA) is proposed to adapt a source model to unseen domains under extreme data scarcity and multiple shifts. Previous approaches mainly focused on sample selection strategies, while overlooking the fundamental problem on underlying optimization. Initially, we critically analyze the widely-adopted entropy minimization framework in WTTA and uncover its significant limitations in noisy optimization dynamics that substantially hinder adaptation efficiency. Through our analysis, we identify region confidence as a superior alternative to traditional entropy, however, its direct optimization remains computationally prohibitive for real-time applications. In this paper, we introduce a novel region-integrated method ReCAP that bypasses the lengthy process. Specifically, we propose a probabilistic region modeling scheme that flexibly captures semantic changes in embedding space. Subsequently, we develop a finite-to-infinite asymptotic approximation that transforms the intractable region confidence into a tractable and upper-bounded proxy. These innovations significantly unlock the overlooked potential dynamics in local region in a concise solution. Our extensive experiments demonstrate the consistent superiority of ReCAP over existing methods across various datasets and wild scenarios. The source code will be available at https://github.com/hzcar/ReCAP.
Demystifying Catastrophic Forgetting in Two-Stage Incremental Object Detector
Qirui Wu · Shizhou Zhang · De Cheng · Yinghui Xing · di xu · Peng Wang · Yanning Zhang
Catastrophic forgetting is a critical chanllenge for incremental object detection (IOD). Most existing methods treat the detector monolithically, relying on instance replay or knowledge distillation without analyzing component-specific forgetting. Through dissection of Faster R-CNN, we reveal a key insight: Catastrophic forgetting is predominantly localized to the RoI Head classifier, while regressors retain robustness across incremental stages. This finding challenges conventional assumptions, motivating us to develop a framework termed NSGP-RePRE. Regional Prototype Replay (RePRE) mitigates classifier forgetting via replay of two types of prototypes: coarse prototypes represent class-wise semantic centers of RoI features, while fine-grained prototypes model intra-class variations. Null Space Gradient Projection (NSGP) is further introduced to eliminate prototype-feature misalignment by updating the feature extractor in directions orthogonal to subspace of old inputs via gradient projection, aligning RePRE with incremental learning dynamics. Our simple yet effective design allows NSGP-RePRE to achieve state-of-the-art performance on the Pascal VOC and MS COCO datasets under various settings. Our work not only advances IOD methodology but also provide pivotal insights for catastrophic forgetting mitigation in IOD. Code will be available soon.
Boosting Virtual Agent Learning and Reasoning: A Step-Wise, Multi-Dimensional, and Generalist Reward Model with Benchmark
Bingchen Miao · Yang Wu · Minghe Gao · Qifan Yu · Wendong Bu · Wenqiao Zhang · liyunfei · Siliang Tang · Tat-Seng Chua · Juncheng Li
The development of Generalist Virtual Agents (GVAs) has shown significant promise in autonomous task execution. However, current training paradigms face critical limitations, including reliance on outcome supervision and labor-intensive human annotations. To address these challenges, we propose Similar, a step-wise multi-dimensional generalist reward model, which offers fine-grained signals for agent training and can choose better actions for inference-time scaling. Specifically, we begin by systematically defining five dimensions for evaluating agent actions. Building on this framework, we design an MCTS-P algorithm to automatically collect and annotate step-wise, five-dimensional agent execution data. Using this data, we train Similar with our crafted Triple-M strategy. Furthermore, we introduce the first benchmark in the virtual agent domain for step-wise, multi-dimensional reward model training and evaluation, named SRM. This benchmark consists of two components: SRMTrain, which serves as the training set for Similar, and SRMEval, a manually selected test set for evaluating the reward model. Experimental results demonstrate that Similar, through its step-wise, multi-dimensional assessment and synergistic gain, provides GVAs with effective intermediate signals during both training and inference-time scaling. The code is available at https://github.com/antgroup/Similar.
Synthetic videos nowadays is widely used to complement data scarcity and diversity of real-world videos.Current synthetic datasets primarily replicate real-world scenarios, leaving impossible, counterfactual and anti-reality video concepts underexplored. This work aims to answer two questions: 1) Can today's video generation models effectively follow prompts to create impossible video content? 2) Are today's video understanding models good enough for understanding impossible videos?To this end, we introduce IPV-Bench, a novel benchmark designed to evaluate and foster progress in video understanding and generation. IPV-Bench is underpinned by a comprehensive taxonomy, encompassing 4 domains, 14 categories.It features diverse scenes that defy physical, biological, geographical, or social laws. Based on the taxonomy, a prompt suite is constructed to evaluate video generation models, challenging their prompt following and creativity capabilities. In addition, a video benchmark is curated to assess Video-LLMs on their ability of understanding impossible videos, which particularly requires reasoning on temporal dynamics and world knowledge. Comprehensive evaluations reveal limitations and insights for future directions of video models, paving the way for next-generation video models.
FreeMesh: Boosting Mesh Generation with Coordinates Merging
Jian Liu · Haohan Weng · Biwen Lei · Xianghui Yang · Zibo Zhao · Zhuo Chen · Song Guo · Tao Han · Chunchao Guo
The next-coordinate prediction paradigm has emerged as the de facto standard in current auto-regressive mesh generation methods.Despite their effectiveness, there is no efficient measurement for the various tokenizers that serialize meshes into sequences. In this paper, we introduce a new metric Per-Token-Mesh-Entropy (PTME) to evaluate the existing mesh tokenizers theoretically without any training.Building upon PTME, we propose a plug-and-play tokenization technique called coordinate merging.It further improves the compression ratios of existing tokenizers by rearranging and merging the most frequent patterns of coordinates.Through experiments on various tokenization methods like MeshXL, MeshAnything V2, and Edgerunner, we further validate the performance of our method.We hope that the proposed PTME and coordinate merging can enhance the existing mesh tokenizers and guide the further development of native mesh generation.
Temporal Misalignment in ANN-SNN Conversion and its Mitigation via Probabilistic Spiking Neurons
Velibor Bojkovic · Xiaofeng Wu · Bin Gu
Spiking Neural Networks (SNNs) offer a more energy-efficient alternative to Artificial Neural Networks (ANNs) by mimicking biological neural principles, establishing them as a promising approach to mitigate the increasing energy demands of large-scale neural models. However, fully harnessing the capabilities of SNNs remains challenging due to their discrete signal processing and temporal dynamics. ANN-SNN conversion has emerged as a practical approach, enabling SNNs to achieve competitive performance on complex machine learning tasks. In this work, we identify a phenomenon in the ANN-SNN conversion framework, termed temporal misalignment, in which random spike rearrangement across SNN layers leads to performance improvements. Based on this observation, we introduce biologically plausible two-phase probabilistic (TPP) spiking neurons, further enhancing the conversion process. We demonstrate the advantages of our proposed method both theoretically and empirically through comprehensive experiments on CIFAR-10/100, CIFAR10-DVS, and ImageNet across a variety of architectures, achieving state-of-the-art results.
Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection
Xiang Fang · Arvind Easwaran · Blaise Genest
Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many ID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where only a few labeled ID samples are available. Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID image samples, we leverage CLIP, connecting text with images, engineering learnable ID and OOD textual prompts. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts, and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.
LOCATE 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Paul McVay · Sergio Arnaud · Ada Martin · Arjun Majumdar · Krishna Murthy Jatavallabhula · Phillip Thomas · Ruslan Partsey · Daniel Dugas · Abha Gejji · Alexander Sax · Vincent-Pierre Berges · Mikael Henaff · Ayush Jain · Ang Cao · Ishita Prasad · Mrinal Kalakrishnan · Michael Rabbat · Nicolas Ballas · Mahmoud Assran · Oleksandr Maksymets · Aravind Rajeswaran · Franziska Meier
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com
How Do Images Align and Complement LiDAR? Towards a Harmonized Multi-modal 3D Panoptic Segmentation
Yining Pan · Qiongjie Cui · Xulei Yang · Na Zhao
LiDAR-based 3D panoptic segmentation often struggles with the inherent sparsity of data from LiDAR sensors, which makes it challenging to accurately recognize distant or small objects. Recently, a few studies have sought to overcome this challenge by integrating LiDAR inputs with camera images, leveraging the rich and dense texture information provided by the latter. While these approaches have shown promising results, they still face challenges, such as misalignment during data augmentation and the reliance on post-processing steps. To address these issues, we propose Image-Assists-LiDAR (IAL), a novel multi-modal 3D panoptic segmentation framework.In IAL, we first introduce a modality-synchronized data augmentation strategy, PieAug, to ensure alignment between LiDAR and image inputs from the start. Next, we adopt a transformer decoder to directly predict panoptic segmentation results. To effectively fuse LiDAR and image features into tokens for the decoder, we design a Geometric-guided Token Fusion (GTF) module. Additionally, we leverage the complementary strengths of each modality as priors for query initialization through a Prior-based Query Generation (PQG) module, enhancing the decoder’s ability to generate accurate instance masks. Our IAL framework achieves state-of-the-art performance compared to previous multi-modal 3D panoptic segmentation methods on two widely used benchmarks. Code and models are publicly available at https://github.com/IMPL-Lab/IAL.git.
Geometric Feature Embedding for Effective 3D Few-Shot Class Incremental Learning
Xiangqi Li · Libo Huang · Zhulin An · Weilun Feng · Chuanguang Yang · Boyu Diao · Fei Wang · Yongjun Xu
3D few-shot class incremental learning (FSCIL) aims to learn new point cloud categories from limited samples while preventing the forgetting of previously learned categories. This research area significantly enhances the capabilities of self-driving vehicles and computer vision systems. Existing 3D FSCIL approaches primarily utilize multimodal pre-trained models to extract the semantic features, heavily dependent on meticulously designed high-quality prompts and fine-tuning strategies. To reduce this dependence, this paper proposes a novel method for 3D FSCIL with Embedded Geometric features (3D-FLEG). Specifically, 3D-FLEG develops a point cloud geometric feature extraction module to capture category-related geometric characteristics. To address the modality heterogeneity issues that arise from integrating geometric and text features, 3D-FLEG introduces a geometric feature embedding module. By augmenting text prompts with spatial geometric features through these modules, 3D-FLEG can learn robust representations of new categories even with limited samples, while mitigating forgetting of the previously learned categories. Experiments conducted on several publicly available 3D point cloud datasets, including ModelNet, ShapeNet, ScanObjectNN, and CO3D, demonstrate 3D-FLEG's superiority over existing state-of-the-art 3D FSCIL methods. Code is available at https://github.com/lixiangqi707/3D-FLEG.
3D-LMVIC: Learning-based Multi-View Image Compression with 3D Gaussian Geometric Priors
Yujun Huang · Bin Chen · Niu Lian · Xin Wang · Baoyi An · Tao Dai · Shutao Xia
Existing multi-view image compression methods often rely on 2D projection-based similarities between views to estimate disparities. While effective for small disparities, such as those in stereo images, these methods struggle with the more complex disparities encountered in wide-baseline multi-camera systems, commonly found in virtual reality and autonomous driving applications. To address this limitation, we propose 3D-LMVIC, a novel learning-based multi-view image compression framework that leverages 3D Gaussian Splatting to derive geometric priors for accurate disparity estimation. Furthermore, we introduce a depth map compression model to minimize geometric redundancy across views, along with a multi-view sequence ordering strategy based on a defined distance measure between views to enhance correlations between adjacent views. Experimental results demonstrate that 3D-LMVIC achieves superior performance compared to both traditional and learning-based methods. Additionally, it significantly improves disparity estimation accuracy over existing two-view approaches.
Enhancing Foundation Models with Federated Domain Knowledge Infusion
Jiaqi Wang · Jingtao Li · Weiming Zhuang · Chen Chen · Lingjuan Lyu · Fenglong Ma
Vision foundation models (FMs) like CLIP have exhibited exceptional capabilities in visual and linguistic understanding, particularly in zero-shot inference tasks. However, these models struggle with data that significantly deviates from their training samples, necessitating fine-tuning, which is often infeasible in centralized settings due to data privacy concerns. Federated learning (FL) combined with parameter-efficient fine-tuning (PEFT) offers a potential solution, yet existing methods face issues with domain-specific characteristics and out-of-domain generalization. We propose a cross-silo Federated Adapter Generalization (FedAG), a novel federated fine-tuning approach that leverages multiple fine-grained adapters to capture domain-specific knowledge while enhancing out-of-domain generalization. Our method uses quality-aware in-domain mutual learning and attention-regularized cross-domain learning to integrate domain-specific insights effectively. Experiments of the CLIP model on three domain-shifting datasets, ImageCLEF-DA, Office-Home, and DomainNet, demonstrate the superior performance of FedAG in both in-domain and out-of-domain scenarios. We envision this work as a milestone for generalizing CLIP to handle the challenge of out-of-domain knowledge under federated learning setting.
Capturing Temporal Dynamics in Large-Scale Canopy Tree Height Estimation
Jan Pauls · Max Zimmer · Berkant Turan · Sassan Saatchi · Philippe CIAIS · Sebastian Pokutta · Fabian Gieseke
With the rise in global greenhouse gas emissions, accurate large-scale tree canopy height maps are essential for understanding forest structure, estimating above-ground biomass, and monitoring ecological disruptions. To this end, we present a novel approach to generate large-scale, high-resolution canopy height maps over time. Our model accurately predicts canopy height over multiple years given Sentinel 2 time series satellite data. Using GEDI LiDAR data as the ground truth for training the model, we present the first 10 m resolution temporal canopy height map of the European continent for the period 2019–2022. As part of this product, we also offer a detailed canopy height map for 2020, providing more precise estimates than previous studies. Our pipeline and the resulting temporal height map are publicly available, enabling comprehensive large-scale monitoring of forests and, hence, facilitating future research and ecological analyses. For an interactive viewer, see https://europetreemap.projects.earthengine.app/view/europeheight.
Code-Generated Graph Representations Using Multiple LLM Agents for Material Properties Prediction
Jiao Huang · Qianli Xing · Jinglong Ji · Bo Yang
Graph neural networks have recently demonstrated remarkable performance in predicting material properties. Crystalline material data is manually encoded into graph representations.Existing methods incorporate different attributes into constructing representations to satisfy the constraints arising from symmetries of material structure.However, existing methods for obtaining graph representations are specific to certain constraints, which are ineffective when facing new constraints.In this work, we propose a code generation framework with multiple large language model agents to obtain representations named Rep-CodeGen with three iterative stages simulating an evolutionary algorithm. To the best of our knowledge, Rep-CodeGen is the first framework for automatically generating code to obtain representations that can be used when facing new constraints. Furthermore, a type of representation from generated codes by our framework satisfies six constraints, with codes satisfying three constraints as bases. Extensive experiments on two real-world material datasets show that a property prediction method based on such a graph representation achieves state-of-the-art performance in material property prediction tasks.
Context Matters: Query-aware Dynamic Long Sequence Modeling of Gigapixel Images
Zhengrui Guo · Qichen Sun · Jiabo MA · Lishuang Feng · Jinzhuo Wang · Hao Chen
Whole slide image (WSI) analysis presents significant computational challenges due to the massive number of patches in gigapixel images. While transformer architectures excel at modeling long-range correlations through self-attention, their quadratic computational complexity makes them impractical for computational pathology applications. Existing solutions like local-global or linear self-attention reduce computational costs but compromise the strong modeling capabilities of full self-attention. In this work, we propose Querent, i.e., the query-aware long contextual dynamic modeling framework, which achieves a theoretically bounded approximation of full self-attention while delivering practical efficiency. Our method adaptively predicts which surrounding regions are most relevant for each patch, enabling focused yet unrestricted attention computation only with potentially important contexts. By using efficient region-wise metadata computation and importance estimation, our approach dramatically reduces computational overhead while preserving global perception to model fine-grained patch correlations. Through comprehensive experiments on biomarker prediction, gene mutation prediction, cancer subtyping, and survival analysis across over 10 WSI datasets, our method demonstrates superior performance compared to the state-of-the-art approaches. Codes are available at https://github.com/dddavid4real/Querent.
Enforcing Latent Euclidean Geometry in Single-Cell VAEs for Manifold Interpolation
Alessandro Palma · Sergei Rybakov · Leon Hetzel · Stephan Günnemann · Fabian Theis
Latent space interpolations are a powerful tool for navigating deep generative models in applied settings. An example is single-cell RNA sequencing, where existing methods model cellular state transitions as latent space interpolations with variational autoencoders, often assuming linear shifts and Euclidean geometry. However, unless explicitly enforced, linear interpolations in the latent space may not correspond to geodesic paths on the data manifold, limiting methods that assume Euclidean geometry in the data representations. We introduce FlatVI, a novel training framework that regularises the latent manifold of discrete-likelihood variational autoencoders towards Euclidean geometry, specifically tailored for modelling single-cell count data. By encouraging straight lines in the latent space to approximate geodesic interpolations on the decoded single-cell manifold, FlatVI enhances compatibility with downstream approaches that assume Euclidean latent geometry. Experiments on synthetic data support the theoretical soundness of our approach, while applications to time-resolved single-cell RNA sequencing data demonstrate improved trajectory reconstruction and manifold interpolation.
Beyond Sensor Data: Foundation Models of Behavioral Data from Wearables Improve Health Predictions
Eray Erturk · Fahad Kamran · Salar Abbaspourazad · Sean Jewell · Harsh Sharma · Yujie Li · Sinead Williamson · Nicholas Foti · Joseph Futoma
Wearable devices record physiological and behavioral signals that can improve health predictions. While foundation models are increasingly used for such predictions, they have been primarily applied to low-level sensor data, despite behavioral data often being more informative due to their alignment with physiologically relevant timescales and quantities. We develop foundation models of such behavioral signals using over 2.5B hours of wearable data from 162K individuals, systematically optimizing architectures and tokenization strategies for this unique dataset. Evaluated on 57 health-related tasks, our model shows strong performance across diverse real-world applications including individual-level classification and time-varying health state prediction. The model excels in behavior-driven tasks like sleep prediction, and improves further when combined with representations of raw sensor data. These results underscore the importance of tailoring foundation model design to wearables and demonstrate the potential to enable new health applications.
Position: Supervised Classifiers Answer the Wrong Questions for OOD Detection
Yucen Li · Daohan Lu · Polina Kirichenko · Shikai Qiu · Tim G. J. Rudner · C. Bayan Bruss · Andrew Wilson
To detect distribution shifts and improve model safety, many out-of-distribution (OOD) detection methods rely on the predictive uncertainty or features of supervised models trained on in-distribution data. In this position paper, we critically re-examine this popular family of OOD detection procedures, and we argue that these methods are fundamentally answering the wrong questions for OOD detection. There is no simple fix to this misalignment, since a classifier trained only on in-distribution classes cannot be expected to identify OOD points; for instance, a cat-dog classifier may confidently misclassify an airplane if it contains features that distinguish cats from dogs, despite generally appearing nothing alike. We find that uncertainty-based methods incorrectly conflate high uncertainty with being OOD, while feature-based methods incorrectly conflate far feature-space distance with being OOD. We show how these pathologies manifest as irreducible errors in OOD detection and identify common settings where these methods are ineffective. Additionally, interventions to improve OOD detection such as feature-logit hybrid methods, scaling of model and data size, epistemic uncertainty representation, and outlier exposure also fail to address this fundamental misalignment in objectives.
AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling
Alexander Capstick · Rahul G. Krishnan · Payam Barnaghi
Large language models (LLMs) acquire a breadth of information across various domains. However, their computational complexity, cost, and lack of transparency often hinder their direct application for predictive tasks where privacy and interpretability are paramount. In fields such as healthcare, biology, and finance, specialised and interpretable linear models still hold considerable value. In such domains, labelled data may be scarce or expensive to obtain. Well-specified prior distributions over model parameters can reduce the sample complexity of learning through Bayesian inference; however, eliciting expert priors can be time-consuming. We therefore introduce AutoElicit to extract knowledge from LLMs and construct priors for predictive models. We show these priors are informative and can be refined using natural language. We perform a careful study contrasting AutoElicit with in-context learning and demonstrate how to perform model selection between the two methods. We find that AutoElicit yields priors that can substantially reduce error over uninformative priors, using fewer labels, and consistently outperform in-context learning. We show that AutoElicit saves over 6 months of labelling effort when building a new predictive model for urinary tract infections from sensor recordings of people living with dementia.
MedRAX: Medical Reasoning Agent for Chest X-ray
Adibvafa Fallahpour · Jun Ma · Alif Munim · Hongwei Lyu · BO WANG
Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code have been publicly available at https://github.com/bowang-lab/MedRAX
Scalable Generation of Spatial Transcriptomics from Histology Images via Whole-Slide Flow Matching
Tinglin Huang · Tianyu Liu · Mehrtash Babadi · Wengong Jin · ZHITAO YING
Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.
MMedPO: Aligning Medical Vision-Language Models with Clinical-Aware Multimodal Preference Optimization
Kangyu Zhu · Peng Xia · Yun Li · Hongtu Zhu · Sheng Wang · Huaxiu Yao
The advancement of Large Vision-Language Models (LVLMs) has propelled their application in the medical field. However, Medical LVLMs (Med-LVLMs) encounter factuality challenges due to modality misalignment, where the models prioritize textual knowledge over visual input, leading to hallucinations that contradict information in medical images. Previous attempts to enhance modality alignment in Med-LVLMs through preference optimization have inadequately addressed clinical relevance in preference data, making these samples easily distinguishable and reducing alignment effectiveness. In response, we propose MMedPO, a novel multimodal medical preference optimization approach that considers the clinical relevance of preference samples to enhance Med-LVLM alignment. MMedPO curates multimodal preference data by introducing two types of dispreference: (1) plausible hallucinations injected through target Med-LVLMs or GPT-4o to produce medically inaccurate responses, and (2) lesion region neglect achieved through local lesion-noising, disrupting visual understanding of critical areas. We then calculate clinical relevance for each sample based on scores from multiple Med-LLMs and visual tools, enabling effective alignment. Our experiments demonstrate that MMedPO significantly enhances factual accuracy in Med-LVLMs, achieving substantial improvements over existing preference optimization methods by 14.2% and 51.7% on the Med-VQA and report generation tasks, respectively. Our code are available in https://github.com/aiming-lab/MMedPO}{https://github.com/aiming-lab/MMedPO.
Active Learning for Efficient Discovery of Optimal Combinatorial Perturbations
Jason Qin · Hans-Hermann Wessels · Carlos Fernandez-Granda · Yuhan Hao
Combinatorial CRISPR screening enables large-scale identification of synergistic gene pairs for combination therapies, but exhaustive experimentation is infeasible. We introduce NAIAD, an active learning framework that efficiently discovers optimal gene pairs by leveraging single-gene perturbation effects and adaptive gene embeddings that scale with the training data size, mitigating overfitting in small-sample learning while capturing complex gene interactions as more data is collected. Evaluated on four CRISPR datasets with over 350,000 interactions, NAIAD trained on small datasets outperforms existing models by up to 40\%. Its recommendation system prioritizes gene pairs with maximum predicted effects, accelerating discovery with fewer experiments. We also extend NAIAD to optimal drug combination identification among 2,000 candidates. Overall, NAIAD enhances combinatorial perturbation design and drives advances in genomics research and therapeutic development in combination therapy. Our code is publicly available at: https://github.com/NeptuneBio/NAIAD
A Model of Place Field Reorganization During Reward Maximization
M Ganesh Kumar · Blake Bordelon · Jacob A Zavatone-Veth · Cengiz Pehlevan
When rodents learn to navigate in a novel environment, a high density of place fields emerges at reward locations, fields elongate against the trajectory, and individual fields change spatial selectivity while demonstrating stable behavior. Why place fields demonstrate these characteristic phenomena during learning remains elusive. We develop a normative framework using a reward maximization objective, whereby the temporal difference (TD) error drives place field reorganization to improve policy learning. Place fields are modeled using Gaussian radial basis functions to represent states in an environment, and directly synapse to an actor-critic for policy learning. Each field's amplitude, center, and width, as well as downstream weights, are updated online at each time step to maximize rewards. We demonstrate that this framework unifies three disparate phenomena observed in navigation experiments. Furthermore, we show that these place field phenomena improve policy convergence when learning to navigate to a single target and relearning multiple new targets. To conclude, we develop a simple normative model that recapitulates several aspects of hippocampal place field learning dynamics and unifies mechanisms to offer testable predictions for future experiments.
Differential Coding for Training-Free ANN-to-SNN Conversion
Zihan Huang · Wei Fang · Tong Bu · Peng Xue · Zecheng Hao · Wenxuan Liu · Yuanhong Tang · Zhaofei Yu · Tiejun Huang
Spiking Neural Networks (SNNs) exhibit significant potential due to their low energy consumption. Converting Artificial Neural Networks (ANNs) to SNNs is an efficient way to achieve high-performance SNNs. However, many conversion methods are based on rate coding, which requires numerous spikes and longer time-steps compared to directly trained SNNs, leading to increased energy consumption and latency. This article introduces differential coding for ANN-to-SNN conversion, a novel coding scheme that reduces spike counts and energy consumption by transmitting changes in rate information rather than rates directly, and explores its application across various layers. Additionally, the threshold iteration method is proposed to optimize thresholds based on activation distribution when converting Rectified Linear Units (ReLUs) to spiking neurons. Experimental results on various Convolutional Neural Networks (CNNs) and Transformers demonstrate that the proposed differential coding significantly improves accuracy while reducing energy consumption, particularly when combined with the threshold iteration method, achieving state-of-the-art performance. The source codes of the proposed method are available at https://github.com/h-z-h-cell/ANN-to-SNN-DCGS.
Generating Hypotheses of Dynamic Causal Graphs in Neuroscience: Leveraging Generative Factor Models of Observed Time Series
Zachary Brown · David Carlson
The field of hypothesis generation promises to reduce costs in neuroscience by narrowing the range of interventional studies needed to study various phenomena. Existing machine learning methods can generate scientific hypotheses from complex datasets, but many approaches assume causal relationships are static over time, limiting their applicability to systems with dynamic, state-dependent behavior, such as the brain. While some techniques attempt dynamic causal discovery through factor models, they often restrict relationships to linear patterns or impose other simplifying assumptions. We propose a novel method that models dynamic graphs as a conditionally weighted superposition of static graphs, where each static graph can capture nonlinear relationships. This approach enables the detection of complex, time-varying interactions between variables beyond linear limitations. Our method improves f1-scores of predicted dynamic causal patterns by roughly 22-28% on average over baselines in some of our experiments, with some improvements reaching well over 60%. A case study on real brain data demonstrates our method's ability to uncover relationships linked to specific behavioral states, offering valuable insights into neural dynamics.
Flow-field inference from neural data using deep recurrent networks
Timothy Doyeon Kim · Thomas Luo · Tankut Can · Kamesh Krishnamurthy · Jonathan Pillow · Carlos Brody
Neural computations underlying processes such as decision-making, working memory, and motor control are thought to emerge from neural population dynamics. But estimating these dynamics remains a significant challenge. Here we introduce Flow-field Inference from Neural Data using deep Recurrent networks (FINDR), an unsupervised deep learning method for inferring low-dimensional, nonlinear, stochastic dynamics underlying neural population activity. Using spike train data from frontal brain regions of rats performing an auditory decision-making task, we demonstrate that FINDR performs competitively with existing methods in capturing the heterogeneous responses of individual neurons. When trained to disentangle task-relevant and irrelevant activity, FINDR uncovers interpretable low-dimensional dynamics. These dynamics can be visualized as flow fields and attractors, enabling direct tests of attractor-based theories of neural computation. We suggest FINDR as a powerful method for revealing the low-dimensional task-relevant dynamics of neural populations and their associated computations.
NeuroTree: Hierarchical Functional Brain Pathway Decoding for Mental Health Disorders
Jun-En Ding · Dongsheng Luo · Chenwei Wu · Feng Liu
Mental disorders are among the most widespread diseases globally. Analyzing functional brain networks through functional magnetic resonance imaging (fMRI) is crucial for understanding mental disorder behaviors. Although existing fMRI-based graph neural networks (GNNs) have demonstrated significant potential in brain network feature extraction, they often fail to characterize complex relationships between brain regions and demographic information in mental disorders. To overcome these limitations, we propose a learnable NeuroTree framework that integrates a $k$-hop AGE-GCN with neural ordinary differential equations (ODEs) and contrastive masked functional connectivity (CMFC) to enhance similarities and dissimilarities of brain region distance. Furthermore, NeuroTree effectively decodes fMRI network features into tree structures, which improves the capture of high-order brain regional pathway features and enables the identification of hierarchical neural behavioral patterns essential for understanding disease-related brain subnetworks. Our empirical evaluations demonstrate that NeuroTree achieves state-of-the-art performance across two distinct mental disorder datasets. It provides valuable insights into age-related deterioration patterns, elucidating their underlying neural mechanisms. The code and datasets are available at https://github.com/Ding1119/NeuroTree.
TTFSFormer: A TTFS-based Lossless Conversion of Spiking Transformer
Lusen Zhao · Zihan Huang · Ding Jianhao · Zhaofei Yu
ANN-to-SNN conversion has emerged as a key approach to train Spiking Neural Networks (SNNs), particularly for Transformer architectures, as it maps pre-trained ANN parameters to SNN equivalents without requiring retraining, thereby preserving ANN accuracy while eliminating training costs. Among various coding methods used in ANN-to-SNN conversion, time-to-first-spike (TTFS) coding, which allows each neuron to at most one spike, offers significantly lower energy consumption. However, while previous TTFS-based SNNs have achieved comparable performance with convolutional ANNs, the attention mechanism and nonlinear layers in Transformer architectures remains a challenge by existing SNNs with TTFS coding. This paper proposes a new neuron structure for TTFS coding that expands its representational range and enhances the capability to process nonlinear functions, along with detailed designs of nonlinear neurons for different layers in Transformer. Experimental results on different models demonstrate that our proposed method can achieve high accuracy with significantly lower energy consumption. To the best of our knowledge, this is the first work to focus on converting Transformer to SNN with TTFS coding.
Can Biologically Plausible Temporal Credit Assignment Rules Match BPTT for Neural Similarity? E-prop as an Example
Yuhan Helena Liu · Guangyu Robert Yang · Christopher Cueva
Understanding how the brain learns may be informed by studying biologically plausible learning rules. These rules, often approximating gradient descent learning to respect biological constraints such as locality, must meet two critical criteria to be considered an appropriate brain model: (1) good neuroscience task performance and (2) alignment with neural recordings. While extensive research has assessed the first criterion, the second remains underexamined. Employing methods such as Procrustes analysis on well-known neuroscience datasets, this study demonstrates the existence of a biologically plausible learning rule — namely e-prop, which is based on gradient truncation and has demonstrated versatility across a wide range of tasks — that can achieve neural data similarity comparable to Backpropagation Through Time (BPTT) when matched for task accuracy. Our findings also reveal that model architecture and initial conditions can play a more significant role in determining neural similarity than the specific learning rule. Furthermore, we observe that BPTT-trained models and their biologically plausible counterparts exhibit similar dynamical properties at comparable accuracies. These results underscore the substantial progress made in developing biologically plausible learning rules, highlighting their potential to achieve both competitive task performance and neural data similarity.
SE(3)-Equivariant Diffusion Policy in Spherical Fourier Space
Xupeng Zhu · Fan Wang · Robin Walters · Jane Shi
Diffusion Policies are effective at learning closed-loop manipulation policies from human demonstrations but generalize poorly to novel arrangements of objects in 3D space, hurting real-world performance. To address this issue, we propose Spherical Diffusion Policy (SDP), an SE(3) equivariant diffusion policy that adapts trajectories according to 3D transformations of the scene. Such equivariance is achieved by embedding the states, actions, and the denoising process in spherical Fourier space. Additionally, we employ novel spherical FiLM layers to condition the action denoising process equivariantly on the scene embeddings. Lastly, we propose a spherical denoising temporal U-net that achieves spatiotemporal equivariance with computational efficiency. In the end, SDP is end-to-end SE(3) equivariant, allowing robust generalization across transformed 3D scenes. SDP demonstrates a large performance improvement over strong baselines in 20 simulation tasks and 5 physical robot tasks including single-arm and bi-manual embodiments. Code is available at https://github.com/amazon-science/SphericalDiffusionPolicy.
Three-Dimensional Trajectory Prediction with 3DMoTraj Dataset
Hao Zhou · Xu Yang · Mingyu Fan · Lu Qi · Xiangtai Li · Ming-Hsuan Yang · Fei Luo
With the growing interest in embodied and spatial intelligence, accurately predicting trajectories in 3D environments has become increasingly critical. However, no datasets have been explicitly designed to study 3D trajectory prediction. To this end, we contribute a 3D motion trajectory (3DMoTraj) dataset collected from unmanned underwater vehicles (UUVs) operating in oceanic environments. Mathematically, trajectory prediction becomes significantly more complex when transitioning from 2D to 3D. To tackle this challenge, we analyze the prediction complexity of 3D trajectories and propose a new method consisting of two key components: decoupled trajectory prediction and correlated trajectory refinement. The former decouples inter-axis correlations, thereby reducing prediction complexity and generating coarse predictions. The latter refines the coarse predictions by modeling their inter-axis correlations. Extensive experiments show that our method significantly improves 3D trajectory prediction accuracy and outperforms state-of-the-art methods. Both the 3DMoTraj dataset and the method are available at https://github.com/zhouhao94/3DMoTraj.
WOMD-Reasoning: A Large-Scale Dataset for Interaction Reasoning in Driving
Yiheng Li · Cunxin Fan · Chongjian GE · Seth Zhao · Chenran Li · Chenfeng Xu · Huaxiu Yao · Masayoshi Tomizuka · Bolei Zhou · Chen Tang · Mingyu Ding · Wei Zhan
Language models uncover unprecedented abilities in analyzing driving scenarios, owing to their limitless knowledge accumulated from text-based pre-training. Naturally, they should particularly excel in analyzing rule-based interactions, such as those triggered by traffic laws, which are well documented in texts. However, such interaction analysis remains underexplored due to the lack of dedicated language datasets that address it. Therefore, we propose Waymo Open Motion Dataset-Reasoning (WOMD-Reasoning), a comprehensive large-scale Q&As dataset built on WOMD focusing on describing and reasoning traffic rule-induced interactions in driving scenarios. WOMD-Reasoning also presents by far the largest multi-modal Q&A dataset, with 3 million Q&As on real-world driving scenarios, covering a wide range of driving topics from map descriptions and motion status descriptions to narratives and analyses of agents' interactions, behaviors, and intentions. To showcase the applications of WOMD-Reasoning, we design Motion-LLaVA, a motion-language model fine-tuned on WOMD-Reasoning. Quantitative and qualitative evaluations are performed on WOMD-Reasoning dataset as well as the outputs of Motion-LLaVA, supporting the data quality and wide applications of WOMD-Reasoning, in interaction predictions, traffic rule compliance plannings, etc. The dataset and its vision modal extension are available on https://waymo.com/open/download/. The codes & prompts to build it are available on https://github.com/yhli123/WOMD-Reasoning.
Hierarchical Planning for Complex Tasks with Knowledge Graph-RAG and Symbolic Verification
Flavio Petruzzellis · Cristina Cornelio · Pietro Lió
Large Language Models (LLMs) have shown promise as robotic planners but often struggle with long-horizon and complex tasks, especially in specialized environments requiring external knowledge. While hierarchical planning and Retrieval-Augmented Generation (RAG) address some of these challenges, they remain insufficient on their own and a deeper integration is required for achieving more reliable systems. To this end, we propose a neuro-symbolic approach that enhances LLMs-based planners with Knowledge Graph-based RAG for hierarchical plan generation. This method decomposes complex tasks into manageable subtasks, further expanded into executable atomic action sequences. To ensure formal correctness and proper decomposition, we integrate a Symbolic Validator, which also functions as a failure detector by aligning expected and observed world states. Our evaluation against baseline methods demonstrates the consistent significant advantages of integrating hierarchical planning, symbolic verification, and RAG across tasks of varying complexity and different LLMs. Additionally, our experimental setup and novel metrics not only validate our approach for complex planning but also serve as a tool for assessing LLMs' reasoning and compositional capabilities. Code available at https://github.com/corneliocristina/HVR.
Latent Diffusion Planning for Imitation Learning
Amber Xie · Oleh Rybkin · Dorsa Sadigh · Chelsea Finn
Recent progress in imitation learning has been enabled by policy architectures that scale to complex visuomotor tasks, multimodal distributions, and large datasets. However, these methods often rely on learning from large amount of expert demonstrations.To address these shortcomings, we propose Latent Diffusion Planning (LDP), a modular approach consisting of a planner which can leverage action-free demonstrations, and an inverse dynamics model which can leverage suboptimal data, that both operate over a learned latent space. First, we learn a compact latent space through a variational autoencoder, enabling effective forecasting of future states in image-based domains. Then, we train a planner and an inverse dynamics model with diffusion objectives. By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches, as they cannot leverage such additional data.
ABNet: Adaptive explicit-Barrier Net for Safe and Scalable Robot Learning
Wei Xiao · Johnson Tsun-Hsuan Wang · Chuang Gan · Daniela Rus
Safe learning is central to AI-enabled robots where a single failure may lead to catastrophic results. Existing safe learning methods are not scalable, inefficient and hard to train, and tend to generate unstable signals under noisy inputs that are challenging to be deployed for robots. To address these challenges, we propose Adaptive explicit-Barrier Net (ABNet) in which barriers explicitly show up in the closed-form model that guarantees safety. The ABNet has the potential to incrementally scale toward larger safe foundation models. Each head of ABNet could learn safe control policies from different features and focuses on specific part of the observation. In this way, we do not need to directly construct a large model for complex tasks, which significantly facilitates the training of the model while ensuring its stable output. Most importantly, we can still formally prove the safety guarantees of the ABNet. We demonstrate the efficiency and strength of ABNet in 2D robot obstacle avoidance, safe robot manipulation, and vision-based end-to-end autonomous driving, with results showing much better robustness and guarantees over existing models.
SeedLoRA: A Fusion Approach to Efficient LLM Fine-Tuning
Yong Liu · Di Fu · Shenggan Cheng · Zirui Zhu · Yang Luo · Minhao Cheng · Cho-Jui Hsieh · Yang You
Despite Low-Rank Adaptation (LoRA)'s popularity for fine-tuning large models, it often exhibits a noticeable performance gap compared to full fine-tuning, particularly in complex tasks such as mathematical reasoning and code generation. Motivated by this discrepancy, we propose a novel fusion approach for LoRA fine-tuned models. Our key insight is that LoRA models trained with different random seeds on the same task often exhibit complementary strengths. In contrast to existing research that typically focuses on fusing models trained on diverse tasks, we explore the potential of combining multiple LoRA models fine-tuned on the same task with different random seeds. This intra-task fusion method aims to leverage the strengths of various fine-tuned models to create a more robust and effective adaptation. To validate our approach, we conducted comprehensive experiments across three key areas: mathematical reasoning, code generation, and general instruction-tuning tasks. The results demonstrate that our fusion method significantly enhances LoRA's performance, outperforming both standalone LoRA models and current fusion methods. Notably, this advancement substantially narrows the gap between LoRA and full fine-tuning, thus offering a more effective approach to model adaptation without the GPU memory burden of full parameter fine-tuning.
Subspace Optimization for Large Language Models with Convergence Guarantees
Yutong He · Pengrui Li · Yipeng Hu · Chuyan Chen · Kun Yuan
Subspace optimization algorithms, such as GaLore (Zhao et al., 2024), have gained attention for pre-training and fine-tuning large language models (LLMs) due to their memory efficiency. However, their convergence guarantees remain unclear, particularly in stochastic settings. In this paper, we reveal that GaLore does not always converge to the optimal solution and provide an explicit counterexample to support this finding. We further explore the conditions under which GaLore achieves convergence, showing that it does so when either (i) a sufficiently large mini-batch size is used or (ii) the gradient noise is isotropic. More significantly, we introduce GoLore (Gradient random Low-rank projection), a novel variant of GaLore that provably converges in typical stochastic settings, even with standard batch sizes. Our convergence analysis extends naturally to other subspace optimization algorithms. Finally, we empirically validate our theoretical results and thoroughly test the proposed mechanisms. Codes are available at https://github.com/pkumelon/Golore.
Alberta Wells Dataset: Pinpointing Oil and Gas Wells from Satellite Imagery
Pratinav Seth · Michelle Lin · BREFO YAW · Jade Boutot · Mary Kang · David Rolnick
Millions of abandoned oil and gas wells are scattered across the world, leaching methane into the atmosphere and toxic compounds into the groundwater. Many of these locations are unknown, preventing the wells from being plugged and their polluting effects averted. Remote sensing is a relatively unexplored tool for pinpointing abandoned wells at scale. We introduce the first large-scale Benchmark dataset for this problem, leveraging high-resolution multi-spectral satellite imagery from Planet Labs. Our curated Dataset comprises over 213,000 wells (abandoned, suspended, and active) from Alberta, a region with especially high well density, sourced from the Alberta Energy Regulator and verified by domain experts. We evaluate baseline algorithms for well detection and segmentation, showing the promise of computer vision approaches but also significant room for improvement.
HyperIV: Real-time Implied Volatility Smoothing
Yongxin Yang · Wenqi Chen · Chao Shu · Timothy Hospedales
We propose HyperIV, a novel approach for real-time implied volatility smoothing that eliminates the need for traditional calibration procedures. Our method employs a hypernetwork to generate parameters for a compact neural network that constructs complete volatility surfaces within 2 milliseconds, using only 9 market observations. Moreover, the generated surfaces are guaranteed to be free of static arbitrage. Extensive experiments across 8 index options demonstrate that HyperIV achieves superior accuracy compared to existing methods while maintaining computational efficiency. The model also exhibits strong cross-asset generalization capabilities, indicating broader applicability across different market instruments. These key features -- rapid adaptation to market conditions, guaranteed absence of arbitrage, and minimal data requirements -- make HyperIV particularly valuable for real-time trading applications. We make code available at https://github.com/qmfin/hyperiv.
Dataflow-Guided Neuro-Symbolic Language Models for Type Inference
gen li · Yao Wan · Hongyu Zhang · Zhou Zhao · Wenbin Jiang · Xuanhua Shi · Hai Jin · Zheng Wang
Language Models (LMs) are increasingly used for type inference, aiding in error detection and software development. Some real-world deployments of LMs require the model to run on local machines to safeguard the intellectual property of the source code. This setting often limits the size of the LMs that can be used. We present Nester, the first neuro-symbolic approach that enhances LMs for type inference by integrating symbolic learning without increasing model size. Nester breaks type inference into sub-tasks based on the data and control flow of the input code, encoding them as a modular high-level program. This program executes multi-step actions, such as evaluating expressions and analyzing conditional branches of the target code, combining static typing with LMs to infer potential types. Evaluated on the ManyTypes4Py dataset in Python, Nester outperforms two state-of-the-art type inference methods (HiTyper and TypeGen), achieving 70.7\% Top-1 Exact Match, which is 18.3\% and 3.6\% higher than HiTyper and TypeGen, respectively. For complex type annotations like typing.Optional and typing.Union, Nester achieves 51.0\% and 16.7\%, surpassing TypeGen by 28.3\% and 5.8\%.
Voronoi-grid-based Pareto Front Learning and Its Application to Collaborative Federated Learning
Mengmeng Chen · Xiaohu Wu · QIQI LIU · Tiantian He · Yew Soon ONG · Yaochu Jin · Qicheng Lao · Han Yu
Multi-objective optimization (MOO) exists extensively in machine learning, and aims to find a set of Pareto-optimal solutions, called the Pareto front, e.g., it is fundamental for multiple avenues of research in federated learning (FL). Pareto-Front Learning (PFL) is a powerful method implemented using Hypernetworks (PHNs) to approximate the Pareto front. This method enables the acquisition of a mapping function from a given preference vector to the solutions on the Pareto front. However, most existing PFL approaches still face two challenges: (a) sampling rays in high-dimensional spaces; (b) failing to cover the entire Pareto Front which has a convex shape. Here, we introduce a novel PFL framework, called as PHN-HVVS, which decomposes the design space into Voronoi grids and deploys a genetic algorithm (GA) for Voronoi grid partitioning within high-dimensional space. We put forward a new loss function, which effectively contributes to more extensive coverage of the resultant Pareto front and maximizes the HV Indicator. Experimental results on multiple MOO machine learning tasks demonstrate that PHN-HVVS outperforms the baselines significantly in generating Pareto front. Also, we illustrate that PHN-HVVS advances the methodologies of several recent problems in the FL field. The code is available athttps://github.com/buptcmm/phnhvvs.
Machines and Mathematical Mutations: Using GNNs to Characterize Quiver Mutation Classes
Jesse He · Helen Jenne · Herman Chau · Davis Brown · Mark Raugas · Sara Billey · Henry Kvinge
Machine learning is becoming an increasingly valuable tool in mathematics, enabling one to identify subtle patterns across collections of examples so vast that they would be impossible for a single researcher to feasibly review and analyze. In this work, we use graph neural networks to investigate quiver mutation---an operation that transforms one quiver (or directed multigraph) into another---which is central to the theory of cluster algebras with deep connections to geometry, topology, and physics. In the study of cluster algebras, the question of mutation equivalence is of fundamental concern: given two quivers, can one efficiently determine if one quiver can be transformed into the other through a sequence of mutations? In this paper, we use graph neural networks and AI explainability techniques to independently discover mutation equivalence criteria for quivers of type $\tilde{D}$. Along the way, we also show that even without explicit training to do so, our model captures structure within its hidden representation that allows us to reconstruct known criteria from type $D$, adding to the growing evidence that modern machine learning models are capable of learning abstract and general rules from mathematical data.
Physics Aware Neural Networks for Unsupervised Binding Energy Prediction
Ke Liu · Hao Chen · Chunhua Shen
Developing models for protein-ligand interactions holds substantial significance for drug discovery. Supervised methods often failed due to the lack of labeled data for predicting the protein-ligand binding energy, like antibodies. Therefore, unsupervised approaches are urged to make full use of the unlabeled data. To tackle the problem, we propose an efficient, unsupervised protein-ligand binding energy prediction model via the conservation of energy (CEBind), which follows the physical laws. Specifically, given a protein-ligand complex, we randomly sample forces for each atom in the ligand. Then these forces are applied rigidly to the ligand to perturb its position, following the law of rigid body dynamics. Finally, CEBind predicts the energy of both the unperturbed complex and the perturbed complex. The energy gap between two complexes equals the work of the outer forces, following the law of conservation of energy. Extensive experiments are conducted on the unsupervised protein-ligand binding energy prediction benchmarks, comparing them with previous works. Empirical results and theoretic analysis demonstrate that CEBind is more efficient and outperforms previous unsupervised models on benchmarks.
ETTA: Elucidating the Design Space of Text-to-Audio Models
Sang-gil Lee · Zhifeng Kong · ARUSHI GOEL · Sungwon Kim · Rafael Valle · Bryan Catanzaro
Recent years have seen significant progress in Text-To-Audio (TTA) synthesis, enabling users to enrich their creative workflows with synthetic audio generated from natural language prompts. Despite this progress, the effects of data, model architecture, training objective functions, and sampling strategies on target benchmarks are not well understood. With the purpose of providing a holistic understanding of the design space of TTA models, we set up a large-scale empirical experiment focused on diffusion and flow matching models. Our contributions include: 1) AF-Synthetic, a large dataset of high quality synthetic captions obtained from an audio understanding model; 2) a systematic comparison of different architectural, training, and inference design choices for TTA models; 3) an analysis of sampling methods and their Pareto curves with respect to generation quality and inference speed. We leverage the knowledge obtained from this extensive analysis to propose our best model dubbed Elucidated Text-To-Audio (ETTA). When evaluated on AudioCaps and MusicCaps, ETTA provides improvements over the baselines trained on publicly available data, while being competitive with models trained on proprietary data. Finally, we show ETTA's improved ability to generate creative audio following complex and imaginative captions -- a task that is more challenging than current benchmarks.
Learning Initial Basis Selection for Linear Programming via Duality-Inspired Tripartite Graph Representation and Comprehensive Supervision
Anqi Lu · Junchi Yan
For the fundamental linear programming (LP) problems, the simplex method remains popular, which usually requires an appropriate initial basis as a warm start to accelerate the solving process. Predicting an initial basis close to an optimal one can often accelerate the solver, but a closer initial basis does not always result in greater acceleration. To achieve better acceleration, we propose a GNN model based on a tripartite graph representation inspired by LP duality. This approach enables more effective feature extraction for general LP problems and enhances the expressiveness of GNNs. Additionally, we introduce novel loss functions targeting basic variable selection and basis feasibility, along with data preprocessing schemes, to further improve learning capability. In addition to achieving high prediction accuracy, we enhance the quality of the initial basis for practical use. Experimental results show that our approach greatly surpasses the state-of-the-art method in predicting initial basis with greater accuracy and in reducing the number of iterations and solving time of the LP solver.
AUTOCIRCUIT-RL: Reinforcement Learning-Driven LLM for Automated Circuit Topology Generation
Prashanth Vijayaraghavan · Luyao Shi · Ehsan Degan · Vandana Mukherjee · Xin Zhang
Analog circuit topology synthesis is integral to Electronic Design Automation (EDA), enabling the automated creation of circuit structures tailored to specific design requirements. However, the vast design search space and strict constraint adherence make efficient synthesis challenging. Leveraging the versatility of Large Language Models (LLMs), we propose AUTOCIRCUIT-RL, a novel reinforcement learning (RL)-based framework for automated analog circuit synthesis. The framework operates in two phases: instruction tuning, where an LLM learns to generate circuit topologies from structured prompts encoding design constraints, and RL refinement, which further improves the instruction-tuned model using reward models that evaluate validity, efficiency, and output voltage. The refined model is then used directly to generate topologies that satisfy the design constraints. Empirical results show that AUTOCIRCUIT-RL generates ~12% more valid circuits and improves efficiency by ~14% compared to the best baselines, while reducing duplicate generation rates by ~38%. It achieves over 60% success in synthesizing valid circuits with limited training data, demonstrating strong generalization. These findings highlight the framework's effectiveness in scaling to complex circuits while maintaining efficiency and constraint adherence, marking a significant advancement in AI-driven circuit design.
SK-VQA: Synthetic Knowledge Generation at Scale for Training Context-Augmented Multimodal LLMs
Xin Su · Man Luo · Kris Pan · Tien Pei Chou · Vasudev Lal · Phillip Howard
Multimodal retrieval-augmented generation (RAG) plays a crucial role in domains such as knowledge-based visual question answering (KB-VQA), where models should effectively integrate additional knowledge to generate a response. However, existing vision and language models (VLMs) are not inherently designed for context-augmented generation, limiting their effectiveness in such tasks. While synthetic data generation has recently gained attention for training large VLMs, its application for context-augmented generation remains underexplored. To address this gap, we introduce SKVQA, a large-scale synthetic multimodal dataset containing over 2 million visual question-answer pairs, each associated with external knowledge sources to determine the final answer. Compared to previous datasets, SKVQA exhibits 11× more unique questions, greater domain diversity, and a broader spectrum of image sources. Through human evaluations, we confirm the high quality of the generated question-answer pairs and their contextual relevance. Extensive experiments show that SKVQA serves both as a challenging benchmark for knowledge-based VQA and as an effective training resource for adapting generative multimodal models to context-augmented generation. Our results further indicate that models trained on SKVQA demonstrate enhanced generalization in both context-aware VQA and multimodal RAG settings.
EditLord: Learning Code Transformation Rules for Code Editing
Weichen Li · Albert Jan · Baishakhi Ray · Junfeng Yang · Chengzhi Mao · Kexin Pei
Code editing is a foundational task in software development, where its effectiveness depends on whether it introduces desired code property changes without changing the original code's intended functionality. Existing approaches often formulate code editing as an implicit end-to-end task, omitting the fact that code-editing procedures inherently consist of discrete and explicit steps, and thus suffer from suboptimal performance and lack of robustness and generalization. We introduce EditLord, a code editing framework that makes the code transformation steps explicit. Our key insight is to employ a language model (LM) as an inductive learner to extract code editing rules from the training code pairs as concise meta-rule sets.Such rule sets will be manifested for each training sample to augment them for finetuning or assist in prompting- and iterative-based code editing.EditLord outperforms the state-of-the-art by an average of 22.7% in editing performance and 58.1% in robustness while achieving 20.2% higher functional correctness, across critical software engineering and security applications, LM models, and editing modes.
Pairwise Maximum Likelihood For Multi-Class Logistic Regression Model With Multiple Rare Classes
Xuetong Li · Danyang Huang · Hansheng Wang
We study in this work the problem of multi-class logistic regression with one major class and multiple rare classes, which is motivated by a real application in TikTok live stream data. The model is inspired by the two-class logistic regression model of Wang (2020) but with surprising theoretical findings, which in turn motivate new estimation methods with excellent statistical and computational efficiency. Specifically, since rigorous theoretical analysis suggests that the resulting maximum likelihood estimators of different rare classes should be asymptotically independent, we consider to solve multiple pairwise two-class logistic regression problems instead of optimizing the joint log-likelihood function with computational challenge in multi-class problem, which are computationally much easier and can be conducted in a fully parallel way. To further reduce the computation cost, a subsample-based pairwise likelihood estimator is developed by down-sampling the major class. We show rigorously that the resulting estimators could be as asymptotically efficient as the global maximum likelihood estimator under appropriate regularity conditions. Extensive simulation studies are presented to support our theoretical findings and a TikTok live stream dataset is analyzed for illustration purpose.
Online advertising is a vital revenue source for major internet platforms. Recently, joint advertising, which assigns a bundle of two advertisers in an ad slot instead of allocating a single advertiser, has emerged as an effective method for enhancing allocation efficiency and revenue. However, existing mechanisms for joint advertising fail to realize the optimality, as they tend to focus on individual advertisers and overlook bundle structures. This paper identifies an optimal mechanism for joint advertising in a single-slot setting. For multi-slot joint advertising, we propose BundleNet, a novel bundle-based neural network approach specifically designed for joint advertising. Our extensive experiments demonstrate that the mechanisms generated by BundleNet approximate the theoretical analysis results in the single-slot setting and achieve state-of-the-art performance in the multi-slot setting. This significantly increases platform revenue while ensuring approximate dominant strategy incentive compatibility and individual rationality.
TIMING: Temporality-Aware Integrated Gradients for Time Series Explanation
Hyeongwon Jang · Changhun Kim · Eunho Yang
Recent explainable artificial intelligence (XAI) methods for time series primarily estimate point-wise attribution magnitudes, while overlooking the directional impact on predictions, leading to suboptimal identification of significant points. Our analysis shows that conventional Integrated Gradients (IG) effectively capture critical points with both positive and negative impacts on predictions. However, current evaluation metrics fail to assess this capability, as they inadvertently cancel out opposing feature contributions. To address this limitation, we propose novel evaluation metrics—Cumulative Prediction Difference (CPD) and Cumulative Prediction Preservation(CPP)—to systematically assess whether attribution methods accurately identify significant positive and negative points in time series XAI. Under these metrics, conventional IG outperforms recent counterparts. However, directly applying IG to time series data may lead to suboptimal outcomes, as generated paths ignore temporal relationships and introduce out-of-distribution samples. To overcome these challenges, we introduce TIMING, which enhances IG by incorporating temporal awareness while maintaining its theoretical properties. Extensive experiments on synthetic and real-world time series benchmarks demonstrate that TIMING outperforms existing timeseries XAI baselines. Our code is available at https://github.com/drumpt/TIMING.
Temporal Query Network for Efficient Multivariate Time Series Forecasting
Shengsheng Lin · Haojun Chen · Haijie Wu · Chunyun Qiu · Weiwei Lin
Sufficiently modeling the correlations among variables (aka channels) is crucial for achieving accurate multivariate time series forecasting (MTSF). In this paper, we propose a novel technique called Temporal Query (TQ) to more effectively capture multivariate correlations, thereby improving model performance in MTSF tasks. Technically, the TQ technique employs periodically shifted learnable vectors as queries in the attention mechanism to capture global inter-variable patterns, while the keys and values are derived from the raw input data to encode local, sample-level correlations. Building upon the TQ technique, we develop a simple yet efficient model named Temporal Query Network (TQNet), which employs only a single-layer attention mechanism and a lightweight multi-layer perceptron (MLP). Extensive experiments demonstrate that TQNet learns more robust multivariate correlations, achieving state-of-the-art forecasting accuracy across 12 challenging real-world datasets. Furthermore, TQNet achieves high efficiency comparable to linear-based methods even on high-dimensional datasets, balancing performance and computational cost. The code is available at: https://github.com/ACAT-SCUT/TQNet.
VerbalTS: Generating Time Series from Texts
Shuqi Gu · Chuyue Li · Baoyu Jing · Kan Ren
Time series synthesis has become a foundational task in modern society, underpinning decision-making across various scenes. Recent approaches primarily generate time series from structured conditions, such as attribute-based metadata. However, these methods struggle to capture the full complexity of time series, as the predefined structures often fail to reflect intricate temporal dynamics or other nuanced characteristics. Moreover, constructing structured metadata requires expert knowledge, making large-scale data labeling costly and impractical. In this paper, we introduce VerbalTS, a novel framework for generating time series from unstructured textual descriptions, offering a more expressive and flexible solution to time series synthesis. To bridge the gap between unstructured text and time series data, VerbalTS employs a multi-focal alignment and generation framework, effectively modeling their complex relationships. Experiments on two synthetic and four real-world datasets demonstrate that VerbalTS outperforms existing methods in both generation quality and semantic alignment with textual conditions.
Breaking the Curse of Multiagency in Robust Multi-Agent Reinforcement Learning
Laixi Shi · Jingchu Gai · Eric Mazumdar · Yuejie Chi · Adam Wierman
Standard multi-agent reinforcement learning (MARL) algorithms are vulnerable to sim-to-real gaps. To address this, distributionally robust Markov games (RMGs) have been proposed to enhance robustness in MARL by optimizing the worst-case performance when game dynamics shift within a prescribed uncertainty set. RMGs remains under-explored, from reasonable problem formulation to the development of sample-efficient algorithms. Two notorious and open challenges are the formulation of the uncertainty set and whether the corresponding RMGs can overcome the curse of multiagency, where the sample complexity scales exponentially with the number of agents. In this work, we propose a natural class of RMGs inspired by behavioral economics, where each agent's uncertainty set is shaped by both the environment and the integrated behavior of other agents. We first establish the well-posedness of this class of RMGs by proving the existence of game-theoretic solutions such as robust Nash equilibria and coarse correlated equilibria (CCE). Assuming access to a generative model, we then introduce a sample-efficient algorithm for learning the CCE whose sample complexity scales polynomially with all relevant parameters. To the best of our knowledge, this is the first algorithm to break the curse of multiagency for RMGs, regardless of the uncertainty set formulation.
LIVS: A Pluralistic Alignment Dataset for Inclusive Public Spaces
Rashid Mushkani · Perampalli Shravan Nayak · Hugo Berard · Allison Cohen · Shin Koseki · Hadrien Bertrand
We introduce the Local Intersectional Visual Spaces (LIVS) dataset, a benchmark for multi-criteria alignment, developed through a two-year participatory process with 30 community organizations to support the pluralistic alignment of text-to-image (T2I) models in inclusive urban planning. The dataset encodes 37,710 pairwise comparisons across 13,462 images, structured along six criteria—Accessibility, Safety, Comfort, Invitingness, Inclusivity, and Diversity—derived from 634 community-defined concepts. Using Direct Preference Optimization (DPO), we fine-tune Stable Diffusion XL to reflect multi-criteria spatial preferences and evaluate the LIVS dataset and the fine-tuned model through four case studies: (1) DPO increases alignment with annotated preferences, particularly when annotation volume is high; (2) preference patterns vary across participant identities, underscoring the need for intersectional data; (3) human-authored prompts generate more distinctive visual outputs than LLM-generated ones, influencing annotation decisiveness; and (4) intersectional groups assign systematically different ratings across criteria, revealing the limitations of single-objective alignment. While DPO improves alignment under specific conditions, the prevalence of neutral ratings indicates that community values are heterogeneous and often ambiguous. LIVS provides a benchmark for developing T2I models that incorporate local, stakeholder-driven preferences, offering a foundation for context-aware alignment in spatial design.
AI for Global Climate Cooperation: Modeling Global Climate Negotiations, Agreements, and Long-Term Cooperation in RICE-N
Tianyu Zhang · Andrew Williams · Phillip Wozny · Kai-Hendrik Cohrs · Koen Ponse · Marco Jiralerspong · Soham Phade · Sunil Srinivasa · Lu Li · Yang Zhang · Prateek Gupta · Erman Acar · Irina Rish · Yoshua Bengio · Stephan Zheng
Global cooperation on climate change mitigation is essential to limit temperature increases while supporting long-term, equitable economic growth and sustainable development. Achieving such cooperation among diverse regions, each with different incentives, in a dynamic environment shaped by complex geopolitical and economic factors, without a central authority, is a profoundly challenging game-theoretic problem. This article introduces RICE-N, a multi-region integrated assessment model that simulates the global climate, economy, and climate negotiations and agreements. RICE-N uses multi-agent reinforcement learning (MARL) to encourage agents to develop strategic behaviors based on the environmental dynamics and the actions of the others. We present two negotiation protocols: (1) Bilateral Negotiation, an exemplary protocol and (2) Basic Club, inspired from Climate Clubs and the carbon border adjustment mechanism (Nordhaus, 2015; Comissions, 2022). We compare their impact against a no-negotiation baseline with various mitigation strategies, showing that both protocols significantly reduce temperature growth at the cost of a minor drop in production while ensuring a more equitable distribution of the emission reduction costs.
Sparse Autoencoders for Hypothesis Generation
Rajiv Movva · Kenny Peng · Nikhil Garg · Jon Kleinberg · Emma Pierson
We describe HypotheSAEs, a general method to hypothesize interpretable relationships between text data (e.g., headlines) and a target variable (e.g., clicks). HypotheSAEs has three steps: (1) train a sparse autoencoder on text embeddings to produce interpretable features describing the data distribution, (2) select features that predict the target variable, and (3) generate a natural language interpretation of each feature (e.g., mentions being surprised or shocked) using an LLM. Each interpretation serves as a hypothesis about what predicts the target variable. Compared to baselines, our method better identifies reference hypotheses on synthetic datasets (at least +0.06 in F1) and produces more predictive hypotheses on real datasets (~twice as many significant findings), despite requiring 1-2 orders of magnitude less compute than recent LLM-based methods. HypotheSAEs also produces novel discoveries on two well-studied tasks: explaining partisan differences in Congressional speeches and identifying drivers of engagement with online headlines.
Structured Preconditioners in Adaptive Optimization: A Unified Analysis
Shuo Xie · Tianhao Wang · Sashank J. Reddi · Sanjiv Kumar · Zhiyuan Li
We present a novel unified analysis for a broad class of adaptive optimization algorithms with structured (e.g., layerwise, diagonal, and kronecker-factored) preconditioners for both online regret minimization and offline convex optimization. Our analysis not only provides matching rate to several important structured preconditioned algorithms including diagonal AdaGrad, full-matrix AdaGrad, and AdaGrad-Norm, but also gives an improved convergence rate for a one-sided variant of Shampoo over that of original Shampoo. Interestingly, more structured preconditioners (e.g., diagonal Adagrad, AdaGrad-Norm which use less space and compute) are often presented as computationally efficient approximations to full-matrix Adagrad, aiming for improved optimization performance through better approximations. Our unified analysis challenges this prevailing view and reveals, perhaps surprisingly, that more structured preconditioners, despite using less space and computation per step, can outperform their less structured counterparts. To demonstrate this, we show that one-sided Shampoo, which is relatively much cheaper than full-matrix AdaGrad could outperform it both theoretically and experimentally.
Armijo Line-search Can Make (Stochastic) Gradient Descent Provably Faster
Sharan Vaswani · Reza Babanezhad
Armijo line-search (Armijo-LS) is a standard method to set the step-size for gradient descent (GD). For smooth functions, Armijo-LS alleviates the need to know the global smoothness constant $L$ and adapts to the ``local'' smoothness, enabling GD to converge faster. Existing theoretical analyses show that GD with Armijo-LS ($\texttt{GD-LS}$) can result in constant factor improvements over GD with a $1/L$ step-size (denoted as $\texttt{GD(1/L)}$). We strengthen these results and show that if the objective function satisfies a certain non-uniform smoothness condition, $\texttt{GD-LS}$ can result in a faster convergence rate than $\texttt{GD(1/L)}$. In particular, we prove that for convex objectives corresponding to logistic regression and multi-class classification, $\texttt{GD-LS}$ can converge to the optimum at a linear rate, and hence improves over the sublinear convergence of $\texttt{GD(1/L)}$. Furthermore, for non-convex objectives satisfying gradient domination (e.g., those corresponding to the softmax policy gradient in RL or generalized linear models with a logistic link function), $\texttt{GD-LS}$ can match the fast convergence of algorithms tailored for these specific settings. Finally, we prove that under the interpolation assumption, for convex losses, stochastic GD with a stochastic line-search can match the fast convergence of $\texttt{GD-LS}$.
MERIT: Maximum-normalized Element-wise Ratio for Language Model Large-batch Training
Yang Luo · Zangwei Zheng · Ziheng Qin · Zirui Zhu · Yong Liu · Yang You
Large-batch training has become a cornerstone in accelerating the training of deep neural networks, yet it poses challenges in optimization and generalization. Existing optimizers like AdamW present performance degradation during language models' large-batch training, due to the information bottleneck in attention layers caused by the sharp increase of max attention logit. While the LAMB optimizer partially addresses this issue, some attention layers still face this issue. The reason is that $l_2$-norm-based trust ratios in LAMB are less effective in directly influencing the max value of query/key weights. Furthermore, the weight-wise trust ratio in LAMB is error-prone as it overlooks relationships of weight values within rows or columns. Building on these observations, we propose a novel optimizer, MERIT, which leverages the max-norm to calculate the trust ratio to constrain the max attention logit more effectively. Moreover, we further construct element-wise trust ratios to provide more robust update scaling by focusing on local weight structures. Extensive experiments of large-batch training across various sizes of GPT-2 models demonstrate the superior performance of MERIT. Notably, during the training of GPT-2 Medium, MERIT enables a 6k batch size without any performance degradation compared to the standard batch size (480) with 48B training tokens.This work highlights the importance of considering the max attention logit and finer-granularity trust ratio in large-batch training. It successfully improves the training stability and paves the way for larger batch usage, enabling faster development and iteration of large language models. Code is available at https://github.com/NUS-HPC-AI-Lab/MERIT.
Sharpness-Aware Minimization (SAM) has been demonstrated to improve the generalization performance of overparameterized models by seeking flat minima on the loss landscape through optimizing model parameters that incur the largest loss within a neighborhood. Nevertheless, such min-max formulations are computationally challenging especially when the problem is highly non-convex. Additionally, focusing only on the worst-case local solution while ignoring potentially many other local solutions may be suboptimal when searching for flat minima. In this work, we propose Tilted SAM (TSAM), a smoothed generalization of SAM inspired by exponential tilting that effectively assigns higher priority to local solutions that incur larger losses. TSAM is parameterized by a tilt hyperparameter $t$ and reduces to SAM as $t$ approaches infinity. We show that TSAM is smoother than SAM and thus easier to optimize, and it explicitly favors flatter minima. We develop algorithms motivated by the discretization of Hamiltonian dynamics to solve TSAM. Empirically, TSAM arrives at flatter local minima and results in superior test performance than the baselines of SAM and ERM across a range of image and text tasks.
In this work, we propose a new architecture-agnostic method for training idempotent neural networks. An idempotent operator satisfies $f(x) = f(f(x))$, meaning it can be applied iteratively with no effect beyond the first application. Some neural networks used in data transformation tasks, such as image generation and augmentation, can represent non-linear idempotent projections. Using methods from perturbation theory we derive the recurrence relation ${\mathbf{K}' \leftarrow 3\mathbf{K}^2 - 2\mathbf{K}^3}$ for iteratively projecting a real-valued matrix $\mathbf{K}$ onto the manifold of idempotent matrices. Our analysis shows that for linear, single-layer MLP networks this projection 1) has idempotent fixed points, and 2) is attracting only around idempotent points. We give an extension to non-linear networks by considering our approach as a substitution of the gradient for the canonical loss function, achieving an architecture-agnostic training scheme. We provide experimental results for MLP- and CNN-based architectures with significant improvement in idempotent error over the canonical gradient-based approach. Finally, we demonstrate practical applications of the method as we train a generative network successfully using only a simple reconstruction loss paired with our method.
Nonlinearly Preconditioned Gradient Methods under Generalized Smoothness
Konstantinos Oikonomidis · Jan Quan · Emanuel Laude · Panagiotis Patrinos
We analyze nonlinearly preconditioned gradient methods for solving smooth minimization problems. We introduce a generalized smoothness property, based on the notion of abstract convexity, that is broader than Lipschitz smoothness and provide sufficient first- and second-order conditions. Notably, our framework encapsulates algorithms associated with the gradient clipping method and brings out novel insights for the class of $(L_0,L_1)$-smooth functions that has received widespread interest recently, thus allowing us to extend beyond already established methods. We investigate the convergence of the proposed method in both the convex and nonconvex setting.
Linear convergence of Sinkhorn's algorithm for generalized static Schrödinger bridge
Rahul Choudhary · Hanbaek Lyu
The classical static Schrödinger Bridge (SSB) problem, which seeks the most likely stochastic evolution between two marginal probability measures, has been studied extensively in the optimal transport and statistical physics communities, and more recently in machine learning communities in the surge of generative models. The standard approach to solve SSB is to first identify its Kantorovich dual and use Sinkhorn's algorithm to find the optimal potential functions. While the original SSB is only a strictly convex minimization problem, this approach is known to warrant linear convergence under mild assumptions. In this work, we consider a generalized SSB allowing any strictly increasing divergence functional, far generalizing the entropy functional $x\log x$ in the standard SSB. This problem naturally arises in a wide range of seemingly unrelated problems in entropic optimal transport, random graphs/matrices, and combinatorics. We establish Kantorovich duality and linear convergence of Sinkhorn's algorithm for the generalized SSB problem under mild conditions. Our results provide a new rigorous foundation for understanding Sinkhorn-type iterative methods in the context of large-scale generalized Schrödinger bridges.
Provable and Practical Online Learning Rate Adaptation with Hypergradient Descent
Ya-Chi Chu · Wenzhi Gao · Yinyu Ye · Madeleine Udell
This paper investigates the convergence properties of the hypergradient descent method ($\texttt{HDM}$), a 25-year-old heuristic originally proposed for adaptive stepsize selection in stochastic first-order methods. We provide the first rigorous convergence analysis of $\texttt{HDM}$ using the online learning framework and apply this analysis to develop a new state-of-the-art adaptive gradient methods with empirical and theoretical support. Notably, $\texttt{HDM}$ automatically identifies the optimal stepsize for the local optimization landscape and achieves local superlinear convergence. Our analysis explains the instability of $\texttt{HDM}$ reported in the literature and proposes efficient strategies to address it. We also develop two $\texttt{HDM}$ variants with heavy-ball and Nesterov momentum. Experiments on deterministic convex problems show $\texttt{HDM}$ with heavy-ball momentum ($\texttt{HDM-HB}$) exhibits robust performance and significantly outperforms other adaptive first-order methods. Moreover, $\texttt{HDM-HB}$ often matches the performance of $\texttt{L-BFGS}$, an efficient and practical quasi-Newton method, using less memory and cheaper iterations.
On Linear Convergence in Smooth Convex-Concave Bilinearly-Coupled Saddle-Point Optimization: Lower Bounds and Optimal Algorithms
Ekaterina Borodich · Alexander Gasnikov · Dmitry Kovalev
We revisit the smooth convex-concave bilinearly-coupled saddle-point problem of the form $\min_x\max_y f(x) + \langle y,\mathbf{B} x\rangle - g(y)$. In the highly specific case where function $f(x)$ is strongly convex and function $g(y)$ is affine, or both functions are affine, there exist lower bounds on the number of gradient evaluations and matrix-vector multiplications required to solve the problem, as well as matching optimal algorithms. A notable aspect of these algorithms is that they are able to attain linear convergence, i.e., the number of iterations required to solve the problem is proportional to $\log(1/\epsilon)$. However, the class of bilinearly-coupled saddle-point problems for which linear convergence is possible is much wider and can involve general smooth non-strongly convex functions $f(x)$ and $g(y)$. Therefore, *we develop the first lower complexity bounds and matching optimal linearly converging algorithms for this problem class*. Our lower complexity bounds are much more general, but they cover and unify the existing results in the literature. On the other hand, our algorithm implements the separation of complexities, which, for the first time, enables the simultaneous achievement of both optimal gradient evaluation and matrix-vector multiplication complexities, resulting in the best theoretical performance to date.
CaDA: Cross-Problem Routing Solver with Constraint-Aware Dual-Attention
Han Li · Fei Liu · Zhi Zheng · Yu Zhang · Zhenkun Wang
Vehicle routing problems (VRPs) are significant combinatorial optimization problems (COPs) holding substantial practical importance. Recently, neural combinatorial optimization (NCO), which involves training deep learning models on extensive data to learn vehicle routing heuristics, has emerged as a promising approach due to its efficiency and the reduced need for manual algorithm design. However, applying NCO across diverse real-world scenarios with various constraints necessitates cross-problem capabilities. Current cross-problem NCO methods for VRPs typically employ a constraint-unaware model, limiting their cross-problem performance. Furthermore, they rely solely on global connectivity, which fails to focus on key nodes and leads to inefficient representation learning. This paper introduces a \underline{C}onstraint-\underline{A}ware \underline{D}ual-\underline{A}ttention Model (CaDA), designed to address these limitations. CaDA incorporates a constraint prompt that efficiently represents different problem variants. Additionally, it features a dual-attention mechanism with a global branch for capturing broader graph-wide information and a sparse branch that selectively focuses on the key node connections. We comprehensively evaluate our model on 16 different VRPs and compare its performance against existing cross-problem VRP solvers. CaDA achieves state-of-the-art results across all tested VRPs. Our ablation study confirms that each component contributes to its cross-problem learning performance. The source code for CaDA is publicly available at \url{https://github.com/CIAM-Group/CaDA}.
Discrete and Continuous Difference of Submodular Minimization
George Orfanides · Tim Hoheisel · Marwa El Halabi
Submodular functions, defined on continuous or discrete domains, arise in numerous applications. We study the minimization of the difference of submodular (DS) functions, over both domains, extending prior work restricted to set functions.We show that all functions on discrete domains and all smooth functions on continuous domains are DS. For discrete domains, we observe that DS minimization is equivalent to minimizing the difference of two convex (DC) functions, as in the set function case. We propose a novel variant of the DC Algorithm (DCA) and apply it to the resulting DC Program, obtaining comparable theoretical guarantees as in the set function case. The algorithm can be applied to continuous domains via discretization. Experiments demonstrate that our method outperforms baselines in integer compressive sensing and integer least squares.
Self-Supervised Transformers as Iterative Solution Improvers for Constraint Satisfaction
Yudong W Xu · Wenhao Li · Scott Sanner · Elias Khalil
We present a Transformer-based framework for Constraint Satisfaction Problems (CSPs). CSPs find use in many applications and thus accelerating their solution with machine learning is of wide interest. Most existing approaches rely on supervised learning from feasible solutions or reinforcement learning, paradigms that require either feasible solutions to these NP-Complete CSPs or large training budgets and a complex expert-designed reward signal. To address these challenges, we propose ConsFormer, a self-supervised framework that leverages a Transformer as a solution refiner. ConsFormer constructs a solution to a CSP iteratively in a process that mimics local search. Instead of using feasible solutions as labeled data, we devise differentiable approximations to the discrete constraints of a CSP to guide model training. Our model is trained to improve random assignments for a single step but is deployed iteratively at test time, circumventing the bottlenecks of supervised and reinforcement learning. Experiments on Sudoku, Graph Coloring, Nurse Rostering, and MAXCUT demonstrate that our method can tackle out-of-distribution CSPs simply through additional iterations.
GPTAQ: Efficient Finetuning-Free Quantization for Asymmetric Calibration
Yuhang Li · Ruokai Yin · Donghyun Lee · Shiting Xiao · Priyadarshini Panda
We introduce GPTAQ, a novel finetuning-free quantization method for compressing large-scale transformer architectures.Unlike the previous GPTQ method, which independently calibrates each layer, we always match the quantized layer's output to the exact output in the full-precision model, resulting in a scheme that we call asymmetric calibration. Such a scheme can effectively reduce the quantization error accumulated in previous layers. We analyze this problem using optimal brain compression to derive a close-formed solution. The new solution explicitly minimizes the quantization error as well as the accumulated asymmetry error. Furthermore, we utilize various techniques to parallelize the solution calculation, including channel parallelization, neuron decomposition, and Cholesky reformulation for matrix fusion. As a result, GPTAQ is easy to implement, simply using 20 more lines of code than GPTQ but improving its performance under low-bit quantization. Remarkably, on a single GPU, we quantize a 405B language transformer as well as EVA-02—the rank first vision transformer that achieves 90% pretraining Imagenet accuracy. Code is available at Github.
DSBRouter: End-to-end Global Routing via Diffusion Schr\"{o}dinger Bridge
Liangliang Shi · Shenhui Zhang · Xingbo Du · Nianzu Yang · Junchi Yan
Global routing (GR) is a fundamental task in modern chip design and various learning techniques have been devised. However, a persistent challenge is the inherent lack of a mechanism to guarantee the routing connectivity in network's prediction results, necessitating post-processing search or reinforcement learning (RL) to enforce the connectivity. In this paper, we propose a neural GR solver called DSBRouter, leveraging the Diffusion Schr\"{o}dinger Bridge (DSB) model for GR. During training, unlike previous works that learn the mapping from noise to routes, we establish a bridge between the initial pins and the routing via DSB, which learns the forward and backward mapping between them. For inference, based on the evaluation metric (e.g. low overflow), we further introduce a sampling scheme with evaluation-based guidance to enhance the routing predictions. Note that DSBRouter is an end-to-end model that does not require a post-step to ensure connectivity. Empirical results show that it achieves SOTA performance on the overflow reduction in ISPD98 and part of ISPD07. In some cases, DSBRouter can even generate routes with zero overflow.
Regularized Langevin Dynamics for Combinatorial Optimization
Shengyu Feng · Yiming Yang
This work proposes a simple yet effective sampling framework for combinatorial optimization (CO). Our method builds on discrete Langevin dynamics (LD), an efficient gradient-guided generative paradigm. However, we observe that directly applying LD often leads to limited exploration. To overcome this limitation, we propose the Regularized Langevin Dynamics (RLD), which enforces an expected distance between the sampled and current solutions, effectively avoiding local minima. We develop two CO solvers on top of RLD, one based on simulated annealing (SA), and the other one based on neural network (NN). Empirical results on three classic CO problems demonstrate that both of our methods can achieve comparable or better performance against the previous state-of-the-art (SOTA) SA- and NN-based solvers. In particular, our SA algorithm reduces the runtime of the previous SOTA SA method by up to 80\%, while achieving equal or superior performance. In summary, RLD offers a promising framework for enhancing both traditional heuristics and NN models to solve CO problems. Our code is available at https://github.com/Shengyu-Feng/RLD4CO.
EquivaMap: Leveraging LLMs for Automatic Equivalence Checking of Optimization Formulations
Haotian Zhai · Connor Lawless · Ellen Vitercik · Liu Leqi
A fundamental problem in combinatorial optimization is identifying equivalent formulations. Despite the growing need for automated equivalence checks---driven, for example, by optimization copilots, which generate problem formulations from natural language descriptions---current approaches rely on simple heuristics that fail to reliably check formulation equivalence.Inspired by Karp reductions, in this workwe introduce Quasi-Karp equivalence, a formal criterion for determining when two optimization formulations are equivalentbased on the existence of a mapping between their decision variables. We propose EquivaMap, a framework that leverages large language models to automatically discover such mappings for scalable, reliable equivalence checking, with a verification stage that ensures mapped solutions preserve feasibility and optimality without additional solver calls. To evaluate our approach, we construct EquivaFormulation, the first open-source dataset of equivalent optimization formulations, generatedby applying transformations such as adding slack variables or valid inequalitiesto existing formulations.Empirically, EquivaMapsignificantly outperforms existing methods, achieving substantial improvements in correctly identifying formulation equivalence.
Scalable First-order Method for Certifying Optimal k-Sparse GLMs
Jiachang Liu · Soroosh Shafiee · Andrea Lodi
This paper investigates the problem of certifying optimality for sparse generalized linear models (GLMs), where sparsity is enforced through an $\ell_0$ cardinality constraint. While branch-and-bound (BnB) frameworks can certify optimality by pruning nodes using dual bounds, existing methods for computing these bounds are either computationally intensive or exhibit slow convergence, limiting their scalability to large-scale problems. To address this challenge, we propose a first-order proximal gradient algorithm designed to solve the perspective relaxation of the problem within a BnB framework. Specifically, we formulate the relaxed problem as a composite optimization problem and demonstrate that the proximal operator of the non-smooth component can be computed exactly in log-linear time complexity, eliminating the need to solve a computationally expensive second-order cone program. Furthermore, we introduce a simple restart strategy that enhances convergence speed while maintaining low per-iteration complexity. Extensive experiments on synthetic and real-world datasets show that our approach significantly accelerates dual bound computations and is highly effective in providing optimality certificates for large-scale problems.
FSL-SAGE: Accelerating Federated Split Learning via Smashed Activation Gradient Estimation
Srijith Nair · Michael Lin · Peizhong Ju · Amirreza Talebi · Elizabeth Bentley · Jia (Kevin) Liu
Collaborative training methods like Federated Learning (FL) and Split Learning(SL) enable distributed machine learning without sharing raw data. However, FL assumes clients can train entire models, which is infeasible for large-scalemodels.In contrast, while SL alleviates the client memory constraint in FL by offloading most training to the server, it increases network latency due to its sequential nature. Other methods address the conundrum by using local loss functions for parallel client-side training to improve efficiency, but they lack server feedback and potentially suffer poor accuracy. We propose FSL-SAGE (Federated Split Learning via Smashed Activation Gradient Estimation), a new federated split learning algorithm that estimates server-side gradient feedback via auxiliary models.These auxiliary models periodically adapt to emulate server behavior on localdatasets. We show that FSL-SAGE achieves a convergence rate of $\mathcal{O}(1/\sqrt{T})$, where $T$ is the number of communication rounds.This result matches FedAvg, while significantly reducing communication costs andclient memory requirements. Our empirical results also verify that it outperforms existing state-of-the-artFSL methods, offering both communication efficiency and accuracy.
HALoS: Hierarchical Asynchronous Local SGD over Slow Networks for Geo-Distributed Large Language Model Training
Geon-Woo Kim · Junbo Li · Shashidhar Gandham · Omar Baldonado · Adithya Gangidi · Pavan Balaji · Zhangyang “Atlas” Wang · Aditya Akella
Training large language models (LLMs) increasingly relies on geographically distributed accelerators, causing prohibitive communication costs across regions and uneven utilization of heterogeneous hardware. We propose HALoS, a hierarchical asynchronous optimization framework that tackles these issues by introducing local parameter servers (LPSs) within each region and a global parameter server (GPS) that merges updates across regions. This hierarchical design minimizes expensive inter-region communication, reduces straggler effects, and leverages fast intra-region links. We provide a rigorous convergence analysis for HALoS under non-convex objectives, including theoretical guarantees on the role of hierarchical momentum in asynchronous training. Empirically, HALoS attains up to 7.5× faster convergence than synchronous baselines in geo-distributed LLM training and improves upon existing asynchronous methods by up to 2.1×. Crucially, HALoS preserves the model quality of fully synchronous SGD—matching or exceeding accuracy on standard language modeling and downstream benchmarks—while substantially lowering total training time. These results demonstrate that hierarchical, server-side update accumulation and global model merging are powerful tools for scalable, efficient training of new-era LLMs in heterogeneous, geo-distributed environments.
Achieving Linear Speedup and Near-Optimal Complexity for Decentralized Optimization over Row-stochastic Networks
Liyuan Liang · Xinyi Chen · Gan Luo · Kun Yuan
A key challenge in decentralized optimization is determining the optimal convergence rate and designing algorithms to achieve it. While this problem has been extensively addressed for doubly-stochastic and column-stochastic mixing matrices, the row-stochastic scenario remains unexplored. This paper bridges this gap by introducing effective metrics to capture the influence of row-stochastic mixing matrices and establishing the first convergence lower bound for decentralized learning over row-stochastic networks. However, existing algorithms fail to attain this lower bound due to two key issues: deviation in the descent direction caused by the adapted gradient tracking (GT) and instability introduced by the Pull-Diag protocol. To address descent deviation, we propose a novel analysis framework demonstrating that Pull-Diag-GT achieves linear speedup—the first such result for row-stochastic decentralized optimization. Moreover, by incorporating a multi-step gossip (MG) protocol, we resolve the instability issue and attain the lower bound, achieving near-optimal complexity for decentralized optimization over row-stochastic networks.
Sample, Scrutinize and Scale: Effective Inference-Time Search by Scaling Verification
Eric Zhao · Pranjal Awasthi · Sreenivas Gollapudi
Sampling-based search, a simple paradigm for utilizing test-time compute, involves generating multiple candidate responses and selecting the best one---typically by verifying each response for correctness. In this paper, we study the scaling trends governing sampling-based search. Among our findings is that simply scaling up a minimalist implementation that uses only random sampling and direct self-verification results in sustained performance improvements that, for example, elevate the Gemini v1.5 Pro model's reasoning capabilities past that of o1-Preview on popular benchmarks. We partially attribute the scalability of sampling-based search to a phenomenon of implicit scaling, where sampling a larger pool of responses in turn improves verification accuracy. We further identify two useful principles for improving self-verification capabilities with test-time compute: (1) comparing across responses provides helpful signals about the locations of errors and hallucinations, and (2) different model output styles are useful for different contexts---chains of thought are useful for reasoning but harder to verify. We also find that, though accurate verification can be elicited, frontier models demonstrate remarkably weak out-of-box verification capabilities and introduce a benchmark to measure progress on these deficiencies.
Efficiently Serving Large Multimodal Models Using EPD Disaggregation
Gursimran Singh · Xinglu Wang · Yifan Hu · Timothy Yu · Linzi Xing · Wei Jiang · Zhefeng Wang · Bai Xiaolong · Yi Li · Ying Xiong · Yong Zhang · Zhenan Fan
Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15× lower peak memory utilization), batch sizes (up to 22× larger), 10× more images per request, and 2.2× larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90–100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available at https://github.com/vbdi/epdserve.
Behavior-Regularized Diffusion Policy Optimization for Offline Reinforcement Learning
Chen-Xiao Gao · Chenyang Wu · Mingjun Cao · Chenjun Xiao · Yang Yu · Zongzhang Zhang
Behavior regularization, which constrains the policy to stay close to some behavior policy, is widely used in offline reinforcement learning (RL) to manage the risk of hazardous exploitation of unseen actions. Nevertheless, existing literature on behavior-regularized RL primarily focuses on explicit policy parameterizations, such as Gaussian policies. Consequently, it remains unclear how to extend this framework to more advanced policy parameterizations, such as diffusion models. In this paper, we introduce BDPO, a principled behavior-regularized RL framework tailored for diffusion-based policies, thereby combining the expressive power of diffusion policies and the robustness provided by regularization. The key ingredient of our method is to calculate the Kullback-Leibler (KL) regularization analytically as the accumulated discrepancies in reverse-time transition kernels along the diffusion trajectory. By integrating the regularization, we develop an efficient two-time-scale actor-critic RL algorithm that produces the optimal policy while respecting the behavior constraint. Comprehensive evaluations conducted on synthetic 2D tasks and continuous control tasks from the D4RL benchmark validate its effectiveness and superior performance.
Learning to Trust Bellman Updates: Selective State-Adaptive Regularization for Offline RL
Qin-Wen Luo · Ming-Kun Xie · Ye-Wen Wang · Sheng-Jun Huang
Offline reinforcement learning (RL) aims to learn an effective policy from a static dataset.To alleviate extrapolation errors, existing studies often uniformly regularize the value function or policy updates across all states.However, due to substantial variations in data quality, the fixed regularization strength often leads to a dilemma: Weak regularization strength fails to address extrapolation errors and value overestimation, while strong regularization strength shifts policy learning toward behavior cloning, impeding potential performance enabled by Bellman updates.To address this issue, we propose the selective state-adaptive regularization method for offline RL. Specifically, we introduce state-adaptive regularization coefficients to trust state-level Bellman-driven results, while selectively applying regularization on high-quality actions, aiming to avoid performance degradation caused by tight constraints on low-quality actions.By establishing a connection between the representative value regularization method, CQL, and explicit policy constraint methods, we effectively extend selective state-adaptive regularization to these two mainstream offline RL approaches.Extensive experiments demonstrate that the proposed method significantly outperforms the state-of-the-art approaches in both offline and offline-to-online settings on the D4RL benchmark. The implementation is available at https://github.com/QinwenLuo/SSAR.
Stable Offline Value Function Learning with Bisimulation-based Representations
Brahma Pavse · Yudong Chen · Qiaomin Xie · Josiah Hanna
In reinforcement learning, offline value function learning is the procedure of using an offline dataset to estimate the expected discounted return from each state when taking actions according to a fixed target policy. The stability of this procedure, i.e., whether it converges to its fixed-point, critically depends on the representations of the state-action pairs. Poorly learned representations can make value function learning unstable, or even divergent. Therefore, it is critical to stabilize value function learning by explicitly shaping the state-action representations. Recently, the class of bisimulation-based algorithms have shown promise in shaping representations for control. However, it is still unclear if this class of methods can \emph{stabilize} value function learning. In this work, we investigate this question and answer it affirmatively. We introduce a bisimulation-based algorithm called kernel representations for offline policy evaluation (\textsc{krope}). \textsc{krope} uses a kernel to shape state-action representations such that state-action pairs that have similar immediate rewards and lead to similar next state-action pairs under the target policy also have similar representations. We show that \textsc{krope}: 1) learns stable representations and 2) leads to lower value error than baselines. Our analysis provides new theoretical insight into the stability properties of bisimulation-based methods and suggests that practitioners can use these methods to improve the stability and accuracy of offline evaluation of reinforcement learning agents.
Decision Mixer: Integrating Long-term and Local Dependencies via Dynamic Token Selection for Decision-Making
Hongling Zheng · Li Shen · Yong Luo · Deheng Ye · Bo Du · Jialie SHEN · Dacheng Tao
The Conditional Sequence Modeling (CSM) paradigm, benefiting from the transformer's powerful distribution modeling capabilities, has demonstrated considerable promise in offline Reinforcement Learning (RL) tasks. Depending on the task's nature, it is crucial to carefully balance the interplay between inherent local features and long-term dependencies in Markov decision trajectories to mitigate potential performance degradation and unnecessary computational overhead. In this paper, we propose Decision Mixer (DM), which addresses the conflict between features of different scales in the modeling process from the perspective of dynamic integration. Drawing inspiration from conditional computation, we design a plug-and-play dynamic token selection mechanism to ensure the model can effectively allocate attention to different features based on task characteristics. Additionally, we employ an auxiliary predictor to alleviate the short-sightedness issue in the autoregressive sampling process. DM achieves state-of-the-art performance on various standard RL benchmarks while requiring significantly fewer computational resources, offering a viable solution for building efficient and scalable RL foundation models. Code is available at here.
Latent Action Learning Requires Supervision in the Presence of Distractors
Alexander Nikulin · Ilya Zisman · Denis Tarasov · Nikita Lyubaykin · Andrei Polubarov · Igor Kiselev · Vladislav Kurenkov
Recently, latent action learning, pioneered by Latent Action Policies (LAPO), have shown remarkable pre-training efficiency on observation-only data, offering potential for leveraging vast amounts of video available on the web for embodied AI. However, prior work has focused on distractor-free data, where changes between observations are primarily explained by ground-truth actions. Unfortunately, real-world videos contain action-correlated distractors that may hinder latent action learning. Using Distracting Control Suite (DCS) we empirically investigate the effect of distractors on latent action learning and demonstrate that LAPO struggle in such scenario. We propose LAOM, a simple LAPO modification that improves the quality of latent actions by 8x, as measured by linear probing. Importantly, we show that providing supervision with ground-truth actions, as few as 2.5% of the full dataset, during latent action learning improves downstream performance by 4.2x on average. Our findings suggest that integrating supervision during Latent Action Models (LAM) training is critical in the presence of distractors, challenging the conventional pipeline of first learning LAM and only then decoding from latent to ground-truth actions.
Accurate and Efficient World Modeling with Masked Latent Transformers
Maxime Burchi · Radu Timofte
The Dreamer algorithm has recently obtained remarkable performance across diverse environment domains by training powerful agents with simulated trajectories. However, the compressed nature of its world model's latent space can result in the loss of crucial information, negatively affecting the agent's performance. Recent approaches, such as $\Delta$-IRIS and DIAMOND, address this limitation by training more accurate world models. However, these methods require training agents directly from pixels, which reduces training efficiency and prevents the agent from benefiting from the inner representations learned by the world model. In this work, we propose an alternative approach to world modeling that is both accurate and efficient. We introduce EMERALD (Efficient MaskEd latent tRAnsformer worLD model), a world model using a spatial latent state with MaskGIT predictions to generate accurate trajectories in latent space and improve the agent performance. On the Crafter benchmark, EMERALD achieves new state-of-the-art performance, becoming the first method to surpass human experts performance within 10M environment steps. Our method also succeeds to unlock all 22 Crafter achievements at least once during evaluation.
Safety-Polarized and Prioritized Reinforcement Learning
Ke Fan · Jinpeng Zhang · Xuefeng Zhang · Yunze Wu · Jingyu Cao · Yuan Zhou · Jianzhu Ma
Motivated by the first priority of safety in many real-world applications, we propose \textsc{MaxSafe}, a chance-constrained bi-level optimization framework for safe reinforcement learning. \textsc{MaxSafe} first minimizes the unsafe probability and then maximizes the return among the safest policies. We provide a tailored Q-learning algorithm for the \textsc{MaxSafe} objective, featuring a novel learning process for \emph{optimal action masks} with theoretical convergence guarantees. To enable the application of our algorithm to large-scale experiments, we introduce two key techniques: \emph{safety polarization} and \emph{safety prioritized experience replay}. Safety polarization generalizes the optimal action masking by polarizing the Q-function, which assigns low values to unsafe state-action pairs, effectively discouraging their selection. In parallel, safety prioritized experience replay enhances the learning of optimal action masks by prioritizing samples based on temporal-difference (TD) errors derived from our proposed state-action reachability estimation functions. This approach efficiently addresses the challenges posed by sparse cost signals. Experiments on diverse autonomous driving and safe control tasks show that our methods achieve near-maximal safety and an optimal reward-safety trade-off.
MENTOR: Mixture-of-Experts Network with Task-Oriented Perturbation for Visual Reinforcement Learning
Suning Huang · Zheyu Zhang · Tianhai Liang · Yihan Xu · Zhehao Kou · Chenhao Lu · Guowei Xu · Zhengrong Xue · Huazhe Xu
Visual deep reinforcement learning (RL) enables robots to acquire skills from visual input for unstructured tasks. However, current algorithms suffer from low sample efficiency, limiting their practical applicability. In this work, we present MENTOR, a method that improves both the architecture and optimization of RL agents. Specifically, MENTOR replaces the standard multi-layer perceptron (MLP) with a mixture-of-experts (MoE) backbone and introduces a task-oriented perturbation mechanism. MENTOR outperforms state-of-the-art methods across three simulation benchmarks and achieves an average of 83\% success rate on three challenging real-world robotic manipulation tasks, significantly surpassing the 32% success rate of the strongest existing model-free visual RL algorithm. These results underscore the importance of sample efficiency in advancing visual RL for real-world robotics. Experimental videos are available at https://suninghuang19.github.io/mentor_page/.
Continual Reinforcement Learning by Planning with Online World Models
Zichen Liu · Guoji Fu · Chao Du · Wee Sun Lee · Min Lin
Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a Follow-The-Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of $\mathcal{O}(\sqrt{K^2D\log(T)})$ under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.
Active Fine-Tuning of Multi-Task Policies
Marco Bagatella · Jonas Hübotter · Georg Martius · Andreas Krause
Pre-trained generalist policies are rapidly gaining relevance in robot learning due to their promise of fast adaptation to novel, in-domain tasks.This adaptation often relies on collecting new demonstrations for a specific task of interest and applying imitation learning algorithms, such as behavioral cloning.However, as soon as several tasks need to be learned, we must decide which tasks should be demonstrated and how often?We study this multi-task problem and explore an interactive framework in which the agent adaptively selects the tasks to be demonstrated.We propose AMF (Active Multi-task Fine-tuning), an algorithm to maximize multi-task policy performance under a limited demonstration budget by collecting demonstrations yielding the largest information gain on the expert policy.We derive performance guarantees for AMF under regularity assumptions and demonstrate its empirical effectiveness to efficiently fine-tune neural policies in complex and high-dimensional environments.
We present Gradient Boosting Reinforcement Learning (GBRL), a framework that adapts the strengths of Gradient Boosting Trees (GBT) to reinforcement learning (RL) tasks. While neural networks (NNs) have become the de facto choice for RL, they face significant challenges with structured and categorical features and tend to generalize poorly to out-of-distribution samples. These are challenges for which GBTs have traditionally excelled in supervised learning. However, GBT's application in RL has been limited. The design of traditional GBT libraries is optimized for static datasets with fixed labels, making them incompatible with RL's dynamic nature, where both state distributions and reward signals evolve during training. GBRL overcomes this limitation by continuously interleaving tree construction with environment interaction. Through extensive experiments, we demonstrate that GBRL outperforms NNs in domains with structured observations and categorical features, while maintaining competitive performance on standard continuous control benchmarks. Like its supervised learning counterpart, GBRL demonstrates superior robustness to out-of-distribution samples and better handles irregular state-action relationships.
Robust Offline Reinforcement Learning with Linearly Structured $f$-Divergence Regularization
Cheng Tang · Zhishuai Liu · Pan Xu
The Robust Regularized Markov Decision Process (RRMDP) is proposed to learn policies robust to dynamics shifts by adding regularization to the transition dynamics in the value function. Existing methods mostly use unstructured regularization, potentially leading to conservative policies under unrealistic transitions. To address this limitation, we propose a novel framework, the $d$-rectangular linear RRMDP ($d$-RRMDP), which introduces latent structures into both transition kernels and regularization. We focus on offline reinforcement learning, where an agent learns policies from a precollected dataset in the nominal environment. We develop the Robust Regularized Pessimistic Value Iteration (R2PVI) algorithm that employs linear function approximation for robust policy learning in $d$-RRMDPs with $f$-divergence based regularization terms on transition kernels. We provide instance-dependent upper bounds on the suboptimality gap of R2PVI policies, demonstrating that these bounds are influenced by how well the dataset covers state-action spaces visited by the optimal robust policy under robustly admissible transitions. We establish information-theoretic lower bounds to verify that our algorithm is near-optimal. Finally, numerical experiments validate that R2PVI learns robust policies and exhibits superior computational efficiency compared to baseline methods.
When Maximum Entropy Misleads Policy Optimization
Ruipeng Zhang · Ya-Chien Chang · Sicun Gao
The Maximum Entropy Reinforcement Learning (MaxEnt RL) framework is a leading approach for achieving efficient learning and robust performance across many RL tasks. However, MaxEnt methods have also been shown to struggle with performance-critical control problems in practice, where non-MaxEnt algorithms can successfully learn. In this work, we analyze how the trade-off between robustness and optimality affects the performance of MaxEnt algorithms in complex control tasks: while entropy maximization enhances exploration and robustness, it can also mislead policy optimization, leading to failure in tasks that require precise, low-entropy policies. Through experiments on a variety of control problems, we concretely demonstrate this misleading effect. Our analysis leads to better understanding of how to balance reward design and entropy maximization in challenging control problems.
NBDI: A Simple and Effective Termination Condition for Skill Extraction from Task-Agnostic Demonstrations
Myunsoo Kim · Hayeong Lee · Seong-Woong Shim · JunHo Seo · Byung-Jun Lee
Intelligent agents are able to make decisions based on different levels of granularity and duration. Recent advances in skill learning enabled the agent to solve complex, long-horizon tasks by effectively guiding the agent in choosing appropriate skills. However, the practice of using fixed-length skills can easily result in skipping valuable decision points, which ultimately limits the potential for further exploration and faster policy learning.In this work, we propose to learn a simple and effective termination condition that identifies decision points through a state-action novelty module that leverages agent experience data.Our approach, Novelty-based Decision Point Identification (NBDI), outperforms previous baselines in complex, long-horizon tasks, and remains effective even in the presence of significant variations in the environment configurations of downstream tasks, highlighting the importance of decision point identification in skill learning.
Inverse Optimization via Learning Feasible Regions
Ke Ren · Peyman Mohajerin Esfahani · Angelos Georghiou
We study inverse optimization (IO), where the goal is to use a parametric optimization program as the hypothesis class to infer relationships between input-decision pairs. Most of the literature focuses on learning only the objective function, as learning the constraint function (i.e., feasible regions) leads to nonconvex training programs. Motivated by this, we focus on learning feasible regions for known linear objectives, and introduce two training losses along with a hypothesis class to parameterize the constraint function. Our hypothesis class surpasses the previous objective-only method by naturally capturing discontinuous behaviors in input-decision pairs. We introduce a customized block coordinate descent algorithm with a smoothing technique to solve the training problems, while for further restricted hypothesis classes, we reformulate the training optimization as a tractable convex program or mixed integer linear program. Synthetic experiments and two power system applications including comparisons with state-of-the-art approaches showcase and validate the proposed approach.
Fast, Accurate Manifold Denoising by Tunneling Riemannian Optimization
Shiyu Wang · Mariam Avagyan · Yihan Shen · Arnaud Lamy · Tingran Wang · Szabolcs Marka · Zsuzsanna Marka · John Wright
Learned denoisers play a fundamental role in various signal generation (e.g., diffusion models) and reconstruction (e.g., compressed sensing) architectures, whose success derives from their ability to leverage low-dimensional structure in data. Existing denoising methods, however, either rely on local approximations that require a linear scan of the entire dataset or treat denoising as generic function approximation problems, sacrificing efficiency and interpretability. We consider the problem of efficiently denoising a new noisy data point sampled from an unknown manifold $\mathcal M \in \mathbb{R}^D$, using only noisy samples. This work proposes a framework for test-time efficient manifold denoising, by framing the concept of "learning-to-denoise" as *"learning-to-optimize"*. We have two technical innovations: (i) *online learning* methods which learn to optimize over the manifold of clean signals using only noisy data, effectively "growing" an optimizer one sample at a time. (ii) *mixed-order* methods which guarantee that the learned optimizers achieve global optimality, ensuring both efficiency and near-optimal denoising performance. We corroborate these claims with theoretical analyses of both the complexity and denoising performance of mixed-order traversal. Our experiments on scientific manifolds demonstrate significantly improved complexity-performance tradeoffs compared to nearest neighbor search, which underpins existing provable denoising approaches based on exhaustive search.
Variance-Reduced Forward-Reflected-Backward Splitting Methods for Nonmonotone Generalized Equations
Quoc Tran-Dinh
We develop two novel stochastic variance-reduction methods to approximate solutions of a class of nonmonotone [generalized] equations. Our algorithms leverage a new combination of ideas from the forward-reflected-backward splitting method and a class of unbiased variance-reduced estimators. We construct two new stochastic estimators within this class, inspired by the well-known SVRG and SAGA estimators. These estimators significantly differ from existing approaches used in minimax and variational inequality problems. By appropriately choosing parameters, both algorithms achieve state-of-the-art oracle complexity of $\mathcal{O}(n + n^{2/3} \epsilon^{-2})$ for obtaining an $\epsilon$-solution in terms of the operator residual norm for a class of nonmonotone problems, where $n$ is the number of summands and $\epsilon$ signifies the desired accuracy. This complexity aligns with the best-known results in SVRG and SAGA methods for stochastic nonconvex optimization. We test our algorithms on some numerical examples and compare them with existing methods. The results demonstrate promising improvements offered by the new methods compared to their competitors.
Non-Asymptotic and Non-Lipschitzian Bounds on Optimal Values in Stochastic Optimization Under Heavy Tails
Jindong Tong · Hongcheng Liu · Johannes Royset
This paper focuses on non-asymptotic confidence bounds (CB) for the optimal values of stochastic optimization (SO) problems. Existing approaches often rely on two conditions that may be restrictive: The need for a global Lipschitz constant and the assumption of light-tailed distributions. Beyond either of the conditions, it remains largely unknown whether computable CBs can be constructed. In view of this literature gap, we provide three key findings below: (i) Based on the conventional formulation of sample average approximation (SAA), we derive non-Lipschitzian CBs for convex SP problems under heavy tails. (ii) We explore diametrical risk minimization (DRM)---a recently introduced modification to SAA---and attain non-Lipschitzian CBs for nonconvex SP problems in light-tailed settings. (iii) We extend our analysis of DRM to handle heavy-tailed randomness by utilizing properties in formulations for training over-parameterized classification models.
Stacey: Promoting Stochastic Steepest Descent via Accelerated $\ell_p$-Smooth Nonconvex Optimization
Xinyu Luo · Cedar Site Bai · Bolian Li · Petros Drineas · Ruqi Zhang · Brian Bullins
While popular optimization methods such as SGD, AdamW, and Lion depend on steepest descent updates in either $\ell_2$ or $\ell_\infty$ norms, there remains a critical gap in handling the non-Euclidean structure observed in modern deep networks training. In this work, we address this need by introducing a new accelerated $\ell_p$ steepest descent algorithm, called Stacey, which uses interpolated primal-dual iterate sequences to effectively navigate non-Euclidean smooth optimization tasks. In addition to providing novel theoretical guarantees for the foundations of our algorithm, we empirically compare our approach against these popular methods on tasks including image classification and language model (LLM) pretraining, demonstrating both faster convergence and higher final accuracy. We further evaluate different values of $p$ across various models and datasets, underscoring the importance and efficiency of non-Euclidean approaches over standard Euclidean methods. Code can be found at https://github.com/xinyuluo8561/Stacey.
With rapid advancements in machine learning, first-order algorithms have emerged as the backbone of modern optimization techniques, owing to their computational efficiency and low memory requirements. Recently, the connection between accelerated gradient methods and damped heavy-ball motion, particularly within the framework of Hamiltonian dynamics, has inspired the development of innovative quantum algorithms for continuous optimization. One such algorithm, Quantum Hamiltonian Descent (QHD), leverages quantum tunneling to escape saddle points and local minima, facilitating the discovery of global solutions in complex optimization landscapes. However, QHD faces several challenges, including slower convergence rates compared to classical gradient methods and limited robustness in highly non-convex problems due to the non-local nature of quantum states. Furthermore, the original QHD formulation primarily relies on function value information, which limits its effectiveness. Inspired by insights from high-resolution differential equations that have elucidated the acceleration mechanisms in classical methods, we propose an enhancement to QHD by incorporating gradient information, leading to what we call gradient-based QHD. This gradient-based QHD achieves faster convergence and significantly increases the likelihood of identifying global solutions. Numerical simulations on challenging problem instances demonstrate that this gradient-based QHD outperforms existing quantum and classical methods by at least an order of magnitude.
Contextual Optimization Under Model Misspecification: A Tractable and Generalizable Approach
Omar Bennouna · Jiawei Zhang · Saurabh Amin · Asuman Ozdaglar
Contextual optimization problems are prevalent in decision-making applications where historical data and contextual features are used to learn predictive models that inform optimal actions. However, practical applications often suffer from model misspecification due to incomplete knowledge of the underlying data-generating process, leading to suboptimal decisions. Existing approaches primarily address the well-specified case, leaving a critical gap in handling misspecified models. In this paper, we propose a novel Integrated Learning and Optimization (ILO) framework that explicitly accounts for model misspecification by introducing a tractable surrogate loss function with strong theoretical guarantees on generalizability, tractability, and optimality. Our surrogate loss aligns with the true decision performance objective, ensuring robustness to misspecification without imposing restrictive assumptions. The proposed approach effectively mitigates the challenges of non-convexity and non-smoothness in the target loss function, leading to efficient optimization procedures. We provide rigorous theoretical analysis and experimental validation, demonstrating superior performance compared to state-of-the-art methods. Our work offers a principled solution to the practically relevant challenge of model misspecification in contextual optimization.
Enhancing Parallelism in Decentralized Stochastic Convex Optimization
Ofri Eisen · Ron Dorfman · Kfir Levy
Decentralized learning has emerged as a powerful approach for handling large datasets across multiple machines in a communication-efficient manner. However, such methods often face scalability limitations, as increasing the number of machines beyond a certain point negatively impacts convergence rates. In this work, we propose Decentralized Anytime SGD, a novel decentralized learning algorithm that significantly extends the critical parallelism threshold, enabling the effective use of more machines without compromising performance. Within the stochastic convex optimization (SCO) framework, we establish a theoretical upper bound on parallelism that surpasses the current state-of-the-art, allowing larger networks to achieve favorable statistical guarantees and closing the gap with centralized learning in highly connected topologies.
RobustZero: Enhancing MuZero Reinforcement Learning Robustness to State Perturbations
Yushuai Li · Hengyu Liu · Torben Pedersen · Yuqiang He · Kim Larsen · Lu Chen · Christian Jensen · Jiachen Xu · TIANYI LI
The MuZero reinforcement learning method has achieved superhuman performance at games, and advances that enable MuZero to contend with complex actions now enable use of MuZero-class methods in real-world decision-making applications. However, some real-world applications are susceptible to state perturbations caused by malicious attacks and noisy sensors. To enhance the robustness of MuZero-class methods to state perturbations, we propose RobustZero, the first MuZero-class method that is $\underline{robust}$ to worst-case and random-case state perturbations, with $\underline{zero}$ prior knowledge of the environment’s dynamics. We present a training framework for RobustZero that features a self-supervised representation network, targeting the generation of a consistent initial hidden state, which is key to obtain consistent policies before and after state perturbations, and it features a unique loss function that facilitates robustness. We present an adaptive adjustment mechanism to enable model update, enhancing robustness to both worst-case and random-case state perturbations. Experiments on two classical control environments, three energy system environments, three transportation environments, and four Mujoco environments demonstrate that RobustZero can outperform state-of-the-art methods at defending against state perturbations.
Knowledge Retention in Continual Model-Based Reinforcement Learning
Haotian Fu · Yixiang Sun · Michael L. Littman · George Konidaris
We propose DRAGO, a novel approach for continual model-based reinforcement learning aimed at improving the incremental development of world models across a sequence of tasks that differ in their reward functions but not the state space or dynamics. DRAGO comprises two key components: Synthetic Experience Rehearsal, which leverages generative models to create synthetic experiences from past tasks, allowing the agent to reinforce previously learned dynamics without storing data, and Regaining Memories Through Exploration, which introduces an intrinsic reward mechanism to guide the agent toward revisiting relevant states from prior tasks. Together, these components enable the agent to maintain a comprehensive and continually developing world model, facilitating more effective learning and adaptation across diverse environments. Empirical evaluations demonstrate that DRAGO is able to preserve knowledge across tasks, achieving superior performance in various continual learning scenarios.
Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn
Hongyao Tang · Johan Obando-Ceron · Pablo Samuel Castro · Aaron Courville · Glen Berseth
Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this paper, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability induced by the data in each training batch. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (C-CHAIN) and demonstrate it improves learning performance and outperforms baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.
Value-Based Deep RL Scales Predictably
Oleh Rybkin · Michal Nauman · Preston Fu · Charlie Snell · Pieter Abbeel · Sergey Levine · Aviral Kumar
Scaling data and compute is critical in modern machine learning. However, scaling also demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from low compute or low data runs, without ever running the large-scale experiment. In this paper, we show predictability of value-based off-policy deep RL. First, we show that data and compute requirements to reach a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can extrapolate data requirements into a higher compute regime, and compute requirements into a higher data regime. Second, we determine the optimal allocation of total budget across data and compute to obtain given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between different hyperparameters, which is used to counteract effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.
LipsNet++: Unifying Filter and Controller into a Policy Network
Xujie Song · Liangfa Chen · Tong Liu · Wenxuan Wang · Yinuo Wang · Shentao Qin · Yinsong Ma · Jingliang Duan · Shengbo Li
Deep reinforcement learning (RL) is effective for decision-making and control tasks like autonomous driving and embodied AI. However, RL policies often suffer from the action fluctuation problem in real-world applications, resulting in severe actuator wear, safety risk, and performance degradation. This paper identifies the two fundamental causes of action fluctuation: observation noise and policy non-smoothness. We propose LipsNet++, a novel policy network with Fourier filter layer and Lipschitz controller layer to separately address both causes. The filter layer incorporates a trainable filter matrix that automatically extracts important frequencies while suppressing noise frequencies in the observations. The controller layer introduces a Jacobian regularization technique to achieve a low Lipschitz constant, ensuring smooth fitting of a policy function. These two layers function analogously to the filter and controller in classical control theory, suggesting that filtering and control capabilities can be seamlessly integrated into a single policy network. Both simulated and real-world experiments demonstrate that LipsNet++ achieves the state-of-the-art noise robustness and action smoothness. The code and videos are publicly available at https://xjsong99.github.io/LipsNet_v2.
ARS: Adaptive Reward Scaling for Multi-Task Reinforcement Learning
MYUNG-SIK CHO · Jong Eui Park · Jeonghye Kim · Youngchul Sung
Multi-task reinforcement learning (RL) encounters significant challenges due to varying task complexities and their reward distributions from the environment. To address these issues, in this paper, we propose Adaptive Reward Scaling (ARS), a novel framework that dynamically adjusts reward magnitudes and leverages a periodic network reset mechanism. ARS introduces a history-based reward scaling strategy that ensures balanced reward distributions across tasks, enabling stable and efficient training. The reset mechanism complements this approach by mitigating overfitting and ensuring robust convergence. Empirical evaluations on the Meta-World benchmark demonstrate that ARS significantly outperforms baseline methods, achieving superior performance on challenging tasks while maintaining overall learning efficiency. These results validate ARS's effectiveness in tackling diverse multi-task RL problems, paving the way for scalable solutions in complex real-world applications.
SOLD: Slot Object-Centric Latent Dynamics Models for Relational Manipulation Learning from Pixels
Malte Mosbach · Jan Ewertz · Angel Villar-Corrales · Sven Behnke
Learning a latent dynamics model provides a task-agnostic representation of an agent's understanding of its environment. Leveraging this knowledge for model-based reinforcement learning (RL) holds the potential to improve sample efficiency over model-free methods by learning from imagined rollouts. Furthermore, because the latent space serves as input to behavior models, the informative representations learned by the world model facilitate efficient learning of desired skills. Most existing methods rely on holistic representations of the environment’s state. In contrast, humans reason about objects and their interactions, predicting how actions will affect specific parts of their surroundings. Inspired by this, we propose Slot-Attention for Object-centric Latent Dynamics (SOLD), a novel model-based RL algorithm that learns object-centric dynamics models in an unsupervised manner from pixel inputs. We demonstrate that the structured latent space not only improves model interpretability but also provides a valuable input space for behavior models to reason over. Our results show that SOLD outperforms DreamerV3 and TD-MPC2 - state-of-the-art model-based RL algorithms - across a range of multi-object manipulation environments that require both relational reasoning and dexterous control. Videos and code are available at https:// slot-latent-dynamics.github.io.
SENSEI: Semantic Exploration Guided by Foundation Models to Learn Versatile World Models
Cansu Sancaktar · Christian Gumbsch · Andrii Zadaianchuk · Pavel Kolev · Georg Martius
Exploration is a cornerstone of reinforcement learning (RL). Intrinsic motivation attempts to decouple exploration from external, task-based rewards. However, established approaches to intrinsic motivation that follow general principles such as information gain, often only uncover low-level interactions. In contrast, children’s play suggests that they engage in meaningful high-level behavior by imitating or interacting with their caregivers. Recent work has focused on using foundation models to inject these semantic biases into exploration. However, these methods often rely on unrealistic assumptions, such as language-embedded environments or access to high-level actions. We propose SEmaNtically Sensible ExploratIon (SENSEI), a framework to equip model-based RL agents with an intrinsic motivation for semantically meaningful behavior. SENSEI distills a reward signal of interestingness from Vision Language Model (VLM) annotations, enabling an agent to predict these rewards through a world model. Using model-based RL, SENSEI trains an exploration policy that jointly maximizes semantic rewards and uncertainty. We show that in both robotic and video game-like simulations SENSEI discovers a variety of meaningful behaviors from image observations and low-level actions. SENSEI provides a general tool for learning from foundation model feedback, a crucial research direction, as VLMs become more powerful.
Return Capping: Sample Efficient CVaR Policy Gradient Optimisation
Harry Mead · Clarissa Costen · Bruno Lacerda · Nick Hawes
When optimising for conditional value at risk (CVaR) using policy gradients (PG), current methods rely on discarding a large proportion of trajectories, resulting in poor sample efficiency. We propose a reformulation of the CVaR optimisation problem by capping the total return of trajectories used in training, rather than simply discarding them, and show that this is equivalent to the original problem if the cap is set appropriately. We show, with empirical results in an number of environments, that this reformulation of the problem results in consistently improved performance compared to baselines. We have made all our code available here: \url{https://github.com/HarryMJMead/cvar-return-capping}.
Imitation Learning from a Single Temporally Misaligned Video
William Huey · Yuki (Huaxiaoyue) Wang · Anne Wu · Yoav Artzi · Sanjiban Choudhury
We examine the problem of learning sequential tasks from a single visual demonstration.A key challenge arises when demonstrations are temporally misaligned due to variations in timing, differences in embodiment, or inconsistencies in execution. Existing approaches treat imitation as a distribution-matching problem, aligning individual frames between the agent and the demonstration. However, we show that such frame-level matching fails to enforce temporal ordering or ensure consistent progress.Our key insight is that matching should instead be defined at the level of sequences. We propose that perfect matching occurs when one sequence successfully covers all the subgoals in the same order as the other sequence. We present ORCA (ORdered Coverage Alignment), a dense per-timestep reward function that measures the probability of the agent covering demonstration frames in the correct order. On temporally misaligned demonstrations, we show that agents trained with the ORCA reward achieve $4.5$x improvement ($0.11 \rightarrow 0.50$ average normalized returns) for Meta-world tasks and $6.6$x improvement ($6.55 \rightarrow 43.3$ average returns) for Humanoid-v4 tasks compared to the best frame-level matching algorithms. We also provide empirical analysis showing that ORCA is robust to varying levels of temporal misalignment. The project website is at https://portal-cornell.github.io/orca/
Explicit Exploration for High-Welfare Equilibria in Game-Theoretic Multiagent Reinforcement Learning
Austin Nguyen · Anri Gu · Michael Wellman
Iterative extension of empirical game models through deep reinforcement learning (RL) has proved an effective approach for finding equilibria in complex games. When multiple equilibria exist, we may also be interested in finding solutions with particular characteristics. We address this issue of equilibrium selection in the context of Policy Space Response Oracles (PSRO), a flexible game-solving framework based on deep RL, by skewing the strategy exploration process towards higher-welfare solutions. At each iteration, we create an exploration policy that imitates high welfare-yielding behavior and train a response to the current solution, regularized to be similar to the exploration policy. With no additional simulation expense, our approach, named Ex$^2$PSRO, tends to find higher welfare equilibria than vanilla PSRO in two benchmarks: a sequential bargaining game and a social dilemma game. Further experiments demonstrate Ex$^2$PSRO's composability with other PSRO variants and illuminate the relationship between exploration policy choice and algorithmic performance.
Risk-Sensitive Theory of Mind: Coordinating with Agents of Unknown Bias using Cumulative Prospect Theory
Mason O. Smith · Wenlong Zhang
Humans are often modeled as rational actors by interactive agents when they are in fact frequently observed to make biased decisions. This erroneous assumption may cause an agent’s model of the human to fail, especially when interaction occurs in bias-inducing settings that prompt risky decisions. To address this, this paper formulates a risk-sensitive multi-agent coordination problem and presents the novel Risk-Sensitive Theory of Mind (RS-ToM) framework that allows an autonomous agent to reason about and adapt to a partner of unknown risk-sensitivity. In simulated studies, we show that an agent with an RS-ToM is able to better coordinate with such a partner when compared to an agent that assumes their partner is rational. Thus, we observe significant improvements to team performance, coordination fluency, compliance with partner risk-preferences, and predictability. The presented results suggest that an RS-ToM will be able to model and plan with partners that exhibit these risk-sensitive biases in the real world.
Finite-Sample Convergence Bounds for Trust Region Policy Optimization in Mean Field Games
Antonio Ocello · Daniil Tiapkin · Lorenzo Mancini · Mathieu Lauriere · Eric Moulines
We introduce Mean Field Trust Region Policy Optimization (MF-TRPO), a novel algorithm designed to compute approximate Nash equilibria for ergodic Mean Field Games (MFGs) in finite state-action spaces. Building on the well-established performance of TRPO in the reinforcement learning (RL) setting, we extend its methodology to the MFG framework, leveraging its stability and robustness in policy optimization. Under standard assumptions in the MFG literature, we provide a rigorous analysis of MF-TRPO, establishing theoretical guarantees on its convergence. Our results cover both the exact formulation of the algorithm and its sample-based counterpart, where we derive high-probability guarantees and finite sample complexity. This work advances MFG optimization by bridging RL techniques with mean-field decision-making, offering a theoretically grounded approach to solving complex multi-agent problems.
Leveraging Skills from Unlabeled Prior Data for Efficient Online Exploration
Max Wilcoxson · Qiyang Li · Kevin Frans · Sergey Levine
Unsupervised pretraining has been transformative in many supervised domains. However, applying such ideas to reinforcement learning (RL) presents a unique challenge in that fine-tuning does not involve mimicking task-specific data, but rather exploring and locating the solution through iterative self-improvement. In this work, we study how unlabeled offline trajectory data can be leveraged to learn efficient exploration strategies. While prior data can be used to pretrain a set of low-level skills, or as additional off-policy data for online RL, it has been unclear how to combine these ideas effectively for online exploration. Our method SUPE (Skills from Unlabeled Prior data for Exploration) demonstrates that a careful combination of these ideas compounds their benefits. Our method first extracts low-level skills using a variational autoencoder (VAE), and then pseudo-labels unlabeled trajectories with optimistic rewards and high-level action labels, transforming prior data into high-level, task-relevant examples that encourage novelty-seeking behavior. Finally, SUPE uses these transformed examples as additional off-policy data for online RL to learn a high-level policy that composes pretrained low-level skills to explore efficiently. In our experiments, SUPE consistently outperforms prior strategies across a suite of 42 long-horizon, sparse-reward tasks.
Novelty Detection in Reinforcement Learning with World Models
Geigh Zollicoffer · Kenneth Eaton · Jonathan Balloch · Julia Kim · Wei Zhou · Robert Wright · Mark Riedl
Reinforcement learning (RL) using world models has found significant recent successes.However, when a sudden change to world mechanics or properties occurs then agent performance and reliability can dramatically decline.We refer to the sudden change in visual properties or state transitions as novelties.Implementing novelty detection within generated world model frameworks is a crucialtask for protecting the agent when deployed. In this paper, we propose straightforward bounding approaches to incorporate novelty detection into world model RL agents by utilizing the misalignment of the world model's hallucinated states and the true observed states as a novelty score. We provideeffective approaches to detecting novelties in a distribution of transitions learned by an agent ina world model. Finally, we show the advantage ofour work in a novel environment compared to traditional machine learning novelty detection methods as well as currently accepted RL-focused novelty detection algorithms.
Local Manifold Approximation and Projection for Manifold-Aware Diffusion Planning
Kyowoon Lee · Jaesik Choi
Recent advances in diffusion-based generative modeling have demonstrated significant promise in tackling long-horizon, sparse-reward tasks by leveraging offline datasets. While these approaches have achieved promising results, their reliability remains inconsistent due to the inherent stochastic risk of producing infeasible trajectories, limiting their applicability in safety-critical applications. We identify that the primary cause of these failures is inaccurate guidance during the sampling procedure, and demonstrate the existence of manifold deviation by deriving a lower bound on the guidance gap. To address this challenge, we propose Local Manifold Approximation and Projection (LoMAP), a training-free method that projects the guided sample onto a low-rank subspace approximated from offline datasets, preventing infeasible trajectory generation. We validate our approach on standard offline reinforcement learning benchmarks that involve challenging long-horizon planning. Furthermore, we show that, as a standalone module, LoMAP can be incorporated into the hierarchical diffusion planner, providing further performance enhancements.
Monte Carlo Tree Diffusion for System 2 Planning
Jaesik Yoon · Hyeonseo Cho · Doojin Baek · Yoshua Bengio · Sungjin Ahn
Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)—whose performance naturally improves with inference-time computation scaling—standard diffusion‐based planners offer only limited avenues for the scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree‐structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long‐horizon tasks show that MCTD outperforms diffusion baselines, yielding higher‐quality solutions as inference-time computation increases.
Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity
Qiuhao Wang · Yuqi Zha · Chin Pang Ho · Marek Petrik
Robust Markov Decision Processes (MDPs) offer a promising framework for computing reliable policies under model uncertainty. While policy gradient methods have gained increasing popularity in robust discounted MDPs, their application to the average-reward criterion remains largely unexplored. This paper proposes a Robust Projected Policy Gradient (RP2G), the first generic policy gradient method for robust average-reward MDPs (RAMDPs) that is applicable beyond the typical rectangularity assumption on transition ambiguity. In contrast to existing robust policy gradient algorithms, RP2G incorporates an adaptive decreasing tolerance mechanism for efficient policy updates at each iteration. We also present a comprehensive convergence analysis of RP2G for solving ergodic tabular RAMDPs. Furthermore, we establish the first study of the inner worst-case transition evaluation problem in RAMDPs, proposing two gradient-based algorithms tailored for rectangular and general ambiguity sets, each with provable convergence guarantees. Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.
Offline Opponent Modeling with Truncated Q-driven Instant Policy Refinement
Yuheng Jing · Kai Li · Bingyun Liu · Ziwen Zhang · Haobo Fu · Qiang Fu · Junliang Xing · Jian Cheng
Offline Opponent Modeling (OOM) aims to learn an adaptive autonomous agent policy that dynamically adapts to opponents using an offline dataset from multi-agent games. Previous work assumes that the dataset is optimal. However, this assumption is difficult to satisfy in the real world. When the dataset is suboptimal, existing approaches struggle to work. To tackle this issue, we propose a simple and general algorithmic improvement framework, Truncated Q-driven Instant Policy Refinement (TIPR), to handle the suboptimality of OOM algorithms induced by datasets. The TIPR framework is plug-and-play in nature. Compared to original OOM algorithms, it requires only two extra steps: (1) Learn a horizon-truncated in-context action-value function, namely Truncated Q, using the offline dataset. The Truncated Q estimates the expected return within a fixed, truncated horizon and is conditioned on opponent information. (2) Use the learned Truncated Q to instantly decide whether to perform policy refinement and to generate policy after refinement during testing. Theoretically, we analyze the rationale of Truncated Q from the perspective of No Maximization Bias probability. Empirically, we conduct extensive comparison and ablation experiments in four representative competitive environments. TIPR effectively improves various OOM algorithms pretrained with suboptimal datasets.
Fundamental Bias in Inverting Random Sampling Matrices with Application to Sub-sampled Newton
Chengmei Niu · Zhenyu Liao · Zenan Ling · Michael Mahoney
A substantial body of work in machine learning (ML) and randomized numerical linear algebra (RandNLA) has exploited various sorts of random sketching methodologies, including random sampling and random projection, with much of the analysis using Johnson--Lindenstrauss and subspace embedding techniques. Recent studies have identified the issue of inversion bias -- the phenomenon that inverses of random sketches are not unbiased, despite the unbiasedness of the sketches themselves. This bias presents challenges for the use of random sketches in various ML pipelines, such as fast stochastic optimization, scalable statistical estimators, and distributed optimization. In the context of random projection, the inversion bias can be easily corrected for dense Gaussian projections (which are, however, too expensive for many applications). Recent work has shown how the inversion bias can be corrected for sparse sub-gaussian projections. In this paper, we show how the inversion bias can be corrected for random sampling methods, both uniform and non-uniform leverage-based, as well as for structured random projections, including those based on the Hadamard transform. Using these results, we establish problem-independent local convergence rates for sub-sampled Newton methods.
Theoretical guarantees on the best-of-n alignment policy
Ahmad Beirami · Alekh Agarwal · Jonathan Berant · Alexander D'Amour · Jacob Eisenstein · Chirag Nagpal · Ananda Suresh
A simple and effective method for the inference-time alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n.$ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$.
Convergence of Consistency Model with Multistep Sampling under General Data Assumptions
Yiding Chen · Yiyi Zhang · Owen Oertell · Wen Sun
Diffusion models accomplish remarkable success in data generation tasks across various domains. However, the iterative sampling process is computationally expensive. Consistency models are proposed to learn consistency functions to map from noise to data directly, which allows one-step fast data generation and multistep sampling to improve sample quality. In this paper, we study the convergence of consistency models when the self-consistency property holds approximately under the training distribution. Our analysis requires only mild data assumption and applies to a family of forward processes. When the target data distribution has bounded support or has tails that decay sufficiently fast, we show that the samples generated by the consistency model are close to the target distribution in Wasserstein distance; when the target distribution satisfies some smoothness assumption, we show that with an additional perturbation step for smoothing, the generated samples are close to the target distribution in total variation distance. We provide two case studies with commonly chosen forward processes to demonstrate the benefit of multistep sampling.
Social predictions do not passively describe the future; they actively shape it. They inform actions and change individual expectations in ways that influence the likelihood of the predicted outcome. Given these dynamics, to what extent can social events be predicted? This question was discussed throughout the 20th century by authors like Merton, Morgenstern, Simon, and others who considered it a central issue in social science methodology. In this work, we provide a modern answer to this old problem. Using recent ideas from performative prediction and outcome indistinguishability, we establish that one can always efficiently predict social events accurately, regardless of how predictions influence data. While achievable, we also show that these predictions are often undesirable, highlighting the limitations of previous desiderata. We end with a discussion of various avenues forward.
Fundamental limits of learning in sequence multi-index models and deep attention networks: high-dimensional asymptotics and sharp thresholds
Emanuele Troiani · Hugo Cui · Yatin Dandi · FLORENT KRZAKALA · Lenka Zdeborová
In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayes-optimal learning, in the limit of large dimension $D$ and proportionally large number of samples $N$, we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting --namely approximate message-passing--, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.
PDUDT: Provable Decentralized Unlearning under Dynamic Topologies
Jing Qiao · Yu Liu · Zengzhe Chen · Mingyi Li · YUAN YUAN · Xiao Zhang · Dongxiao Yu
This paper investigates decentralized unlearning, aiming to eliminate the impact of a specific client on the whole decentralized system. However, decentralized communication characterizations pose new challenges for effective unlearning: the indirect connections make it difficult to trace the specific client's impact, while the dynamic topology limits the scalability of retraining-based unlearning methods.In this paper, we propose the first **P**rovable **D**ecentralized **U**nlearning algorithm under **D**ynamic **T**opologies called PDUDT. It allows clients to eliminate the influence of a specific client without additional communication or retraining. We provide rigorous theoretical guarantees for PDUDT, showing it is statistically indistinguishable from perturbed retraining. Additionally, it achieves an efficient convergence rate of $\mathcal{O}(\frac{1}{T})$ in subsequent learning, where $T$ is the total communication rounds. This rate matches state-of-the-art results. Experimental results show that compared with the Retrain method, PDUDT saves more than 99\% of unlearning time while achieving comparable unlearning performance.
Laplace Transform Based Low-Complexity Learning of Continuous Markov Semigroups
Vladimir Kostic · Karim Lounici · Hélène Halconruy · Timothée Devergne · Pietro Novelli · Massimiliano Pontil
Markov processes serve as universal models for many real-world random processes. This paper presents a data-driven approach to learning these models through the spectral decomposition of the infinitesimal generator (IG) of the Markov semigroup. Its unbounded nature complicates traditional methods such as vector-valued regression and Hilbert-Schmidt operator analysis. Existing techniques, including physics-informed kernel regression, are computationally expensive and limited in scope, with no recovery guarantees for transfer operator methods when the time-lag is small. We propose a novel method leveraging the IG's resolvent, characterized by the Laplace transform of transfer operators. This approach is robust to time-lag variations, ensuring accurate eigenvalue learning even for small time-lags. Our statistical analysis applies to a broader class of Markov processes than current methods while reducing computational complexity from quadratic to linear in the state dimension. Finally, we demonstrate our theoretical findings in several experiments.
Learning Parametric Distributions from Samples and Preferences
Marc Jourdan · Gizem Yüce · Nicolas Flammarion
Recent advances in language modeling have underscored the role of preference feedback in enhancing model performance. This paper investigates the conditions under which preference feedback improves parameter estimation in classes of continuous parametric distributions. In our framework, the learner observes pairs of samples from an unknown distribution along with their relative preferences depending on the same unknown parameter. We show that preferences-based M-estimators achieve a better asymptotic variance than sample-only M-estimators, further improved by deterministic preferences. Leveraging the hard constraints revealed by deterministic preferences, we propose an estimator achieving an estimation error scaling of $\mathcal{O}(1/n)$---a significant improvement over the $\Theta(1/\sqrt{n})$ rate attainable with samples alone. Next, we establish a lower bound that matches this accelerated rate; up to problem-dependent constants. While the assumptions underpinning our analysis are restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on the log-probability reward.
PAC Learning with Improvements
Idan Attias · Avrim Blum · Keziah Naggita · Donya Saless · Dravyansh Sharma · Matthew Walter
One of the most basic lower bounds in machine learning is that in nearly any nontrivial setting, it takes at least $1/\epsilon$ samples to learn to error $\epsilon$ (and more, if the classifier being learned is complex). However, suppose that data points are agents who have the ability to improve by a small amount if doing so will allow them to receive a (desired) positive classification. In that case, we may actually be able to achieve zero error by just being "close enough". For example, imagine a hiring test used to measure an agent's skill at some job such that for some threshold $\theta$, agents who score above $\theta$ will be successful and those who score below $\theta$ will not (i.e., learning a threshold on the line). Suppose also that by putting in effort, agents can improve their skill level by some small amount $r$. In that case, if we learn an approximation $\hat{\theta}$ of $\theta$ such that $\theta \leq \hat{\theta} \leq \theta + r$ and use it for hiring, we can actually achieve error zero, in the sense that (a) any agent classified as positive is truly qualified, and (b) any agent who truly is qualified can be classified as positive by putting in effort. Thus, the ability for agents to improve has the potential to allow for a goal one could not hope to achieve in standard models, namely zero error.\In this paper, we explore this phenomenon more broadly, giving general results and examining under what conditions the ability of agents to improve can allow for a reduction in the sample complexity of learning, or alternatively, can make learning harder. We also examine both theoretically and empirically what kinds of improvement-aware algorithms can take into account agents who have the ability to improve to a limited extent when it is in their interest to do so.
Distributed Nonparametric Estimation: from Sparse to Dense Samples per Terminal
Deheng Yuan · Tao Guo · Zhongyi Huang
Consider the communication-constrained problem of nonparametric function estimation, in which each distributed terminal holds multiple i.i.d. samples. Under certain regularity assumptions, we characterize the minimax optimal rates for all regimes, and identify phase transitions of the optimal rates as the samples per terminal vary from sparse to dense. This fully solves the problem left open by previous works, whose scopes are limited to regimes with either dense samples or a single sample per terminal. To achieve the optimal rates, we design a layered estimation protocol by exploiting protocols for the parametric density estimation problem. We show the optimality of the protocol using information-theoretic methods and strong data processing inequalities, and incorporating the classic balls and bins model. The optimal rates are immediate for various special cases such as density estimation, Gaussian, binary, Poisson and heteroskedastic regression models.
Metastable Dynamics of Chain-of-Thought Reasoning: Provable Benefits of Search, RL and Distillation
Juno Kim · Denny Wu · Jason Lee · Taiji Suzuki
A key paradigm to improve the reasoning capabilities of large language models (LLMs) is to allocate more inference-time compute to search against a verifier or reward model. This process can then be utilized to refine the pretrained model or distill its reasoning patterns into more efficient models. In this paper, we study inference-time computation by viewing chain-of-thought (CoT) generation as a metastable Markov process: easy reasoning steps (e.g., algebraic manipulations) form densely connected clusters, while hard reasoning steps (e.g., applying a relevant theorem) create sparse, low-probability edges between clusters, leading to phase transitions at longer timescales. Under this framework, we prove that implementing a search protocol that rewards sparse edges improves CoT by decreasing the expected number of steps to reach different clusters. In contrast, we establish a limit on reasoning capability when the model is restricted to local information of the pretrained graph. We also show that the information gained by search can be utilized to obtain a better reasoning model: (1) the pretrained model can be directly finetuned to favor sparse edges via policy gradient methods, and moreover (2) a compressed \emph{metastable representation} of the reasoning dynamics can be distilled into a smaller, more efficient model.
Machine learning, the predominant approach in the field of artificial intelligence, enables computers to learn from data and experience. In the supervised learning framework, accurate and efficient learning of dependencies between data instances and their corresponding labels requires auxiliary information about the data distribution and the target function. This central concept aligns with the notion of regularization in statistical learning theory. Real-world datasets are often characterized by multiscale data instance distributions and well-behaved, smooth target functions. Scale-invariant probability distributions, such as power-law distributions, provide notable examples of multiscale data instance distributions in various contexts. This paper introduces a hierarchical learning model that leverages such a multiscale data structure with a multiscale entropy-based training procedure and explores its statistical and computational advantages. The hierarchical learning model is inspired by the logical progression in human learning from easy to complex tasks and features interpretable levels. In this model, the logarithm of any data instance’s norm can be construed as the data instance's complexity, and the allocation of computational resources is tailored to this complexity, resulting in benefits such as increased inference speed. Furthermore, our multiscale analysis of the statistical risk yields stronger guarantees compared to conventional uniform convergence bounds.
Heterogeneous Data Game: Characterizing the Model Competition Across Multiple Data Sources
Renzhe Xu · Kang Wang · Bo Li
Data heterogeneity across multiple sources is common in real-world machine learning (ML) settings. Although many methods focus on enabling a single model to handle diverse data, real-world markets often comprise multiple competing ML providers. In this paper, we propose a game-theoretic framework—the Heterogeneous Data Game—to analyze how such providers compete across heterogeneous data sources. We investigate the resulting pure Nash equilibria (PNE), showing that they can be non-existent, homogeneous (all providers converge on the same model), or heterogeneous (providers specialize in distinct data sources). Our analysis spans monopolistic, duopolistic, and more general markets, illustrating how factors such as the ``temperature'' of data-source choice models and the dominance of certain data sources shape equilibrium outcomes. We offer theoretical insights into both homogeneous and heterogeneous PNEs, guiding regulatory policies and practical strategies for competitive ML marketplaces.
Provably Efficient Algorithm for Best Scoring Rule Identification in Online Principal-Agent Information Acquisition
Zichen Wang · Chuanhao Li · Huazheng Wang
We investigate the problem of identifying the optimal scoring rule within the principal-agent framework for online information acquisition problem. We focus on the principal's perspective, seeking to determine the desired scoring rule through interactions with the agent. To address this challenge, we propose two algorithms: OIAFC and OIAFB, tailored for fixed confidence and fixed budget settings, respectively. Our theoretical analysis demonstrates that OIAFC can extract the desired $(\epsilon, \delta)$-scoring rule with a efficient instance-dependent sample complexity or an instance-independent sample complexity. Our analysis also shows that OIAFB matches the instance-independent performance bound of OIAFC, while both algorithms share the same complexity across fixed confidence and fixed budget settings.
We consider the problem of learning to exploit learning algorithms through repeated interactions in games. Specifically, we focus on the case of repeated two player, finite-action games, in which an optimizer aims to steer a no-regret learner to a Stackelberg equilibrium without knowledge of its payoffs. We first show that this is impossible if the optimizer only knows that the learner is using an algorithm from the general class of no-regret algorithms. This suggests that the optimizer requires more information about the learner's objectives or algorithm to successfully exploit them. Building on this intuition, we reduce the problem for the optimizer to that of recovering the learner's payoff structure. We demonstrate the effectiveness of this approach if the learner’s algorithm is drawn from a smaller class by analyzing two examples: one where the learner uses an ascent algorithm, and another where the learner uses stochastic mirror ascent with known regularizer and step sizes.
Maintaining Proportional Committees with Dynamic Candidate Sets
Chris Dong · Jannik Peters
Multiwinner voting is the study of electing a fixed-size committee given individual agents' preferences over candidates. Most research in this field has been limited to a static setting, with only one election over a fixed set of candidates. However, this approach overlooks the dynamic nature of applications, where candidate sets are subject to change.We extend the study of proportionality in multiwinner voting to dynamic settings, allowing candidates to join or leave the election and demanding that each chosen committee satisfies proportionality without differing too much from the previously selected committee. We consider approval preferences, ranked preferences, and the proportional clustering setting. In these settings, we either give algorithms making few changes or show that such algorithms cannot exist for various proportionality axioms. In particular, we show that such algorithms cannot exist for ranked preferences and provide amortized and exact algorithms for several proportionality notions in the other two settings.
The rise of Generative AI (GenAI) has significantly impacted human-based forums like Stack Overflow, which are essential for generating high-quality data. This creates a negative feedback loop, hindering the development of GenAI systems, which rely on such data to provide accurate responses. In this paper, we provide a possible remedy: A novel strategy we call selective response. Selective response implies that GenAI could strategically provide inaccurate (or conservative) responses to queries involving emerging topics and novel technologies, thereby driving users to use human-based forums. We show that selective response can potentially have a compounding effect on the data generation process, increasing both GenAI's revenue and user welfare in the long term. From an algorithmic perspective, we propose an approximately optimal approach to maximize GenAI's revenue under social welfare constraints. From a regulatory perspective, we derive sufficient and necessary conditions for selective response to improve welfare.
Machine learning models play a key role for service providers looking to gain market share in consumer markets. However, traditional learning approaches do not take into account the existence of additional providers, who compete with each other for consumers. Our work aims to study learning in this market setting, as it affects providers, consumers, and the market itself. We begin by analyzing such markets through the lens of the learning objective, and show that accuracy cannot be the only consideration. We then propose a method for classification under competition, so that a learner can maximize market share in the presence of competitors. We show that our approach benefits the providers as well as the consumers, and find that the timing of market entry and model updates can be crucial. We display the effectiveness of our approach across a range of domains, from simple distributions to noisy datasets, and show that the market as a whole remains stable by converging quickly to an equilibrium.
Collaborative Mean Estimation Among Heterogeneous Strategic Agents: Individual Rationality, Fairness, and Truthful Contribution
Alex Clinton · Yiding Chen · Jerry Zhu · Kirthevasan Kandasamy
We study a collaborative learning problem where $m$ agents aim to estimate a vector $\mu =(\mu_1,\ldots,\mu_d)\in \mathbb{R}^d$ by sampling from associated univariate normal distributions $(\mathcal{N}(\mu_k, \sigma^2))\_{k\in[d]}$. Agent $i$ incurs a cost $c_{i,k}$ to sample from $\mathcal{N}(\mu_k, \sigma^2)$. Instead of working independently, agents can exchange data, collecting cheaper samples and sharing them in return for costly data, thereby reducing both costs and estimation error. We design a mechanism to facilitate such collaboration, while addressing two key challenges: ensuring *individually rational (IR) and fair outcomes* so all agents benefit, and *preventing strategic behavior* (e.g. non-collection, data fabrication) to avoid socially undesirable outcomes.We design a mechanism and an associated Nash equilibrium (NE) which minimizes the social penalty-sum of agents' estimation errors and collection costs-while being IR for all agents. We achieve a $\mathcal{O}(\sqrt{m})$-approximation to the minimum social penalty in the worst case and an $\mathcal{O}(1)$-approximation under favorable conditions. Additionally, we establish three hardness results: no nontrivial mechanism guarantees *(i)* a dominant strategy equilibrium where agents report truthfully, *(ii)* is IR for every strategy profile of other agents, *(iii)* or avoids a worst-case $\Omega(\sqrt{m})$ price of stability in any NE. Finally, by integrating concepts from axiomatic bargaining, we demonstrate that our mechanism supports fairer outcomes than one which minimizes social penalty.
In open-environment applications, data are often collected from heterogeneous modalities with distinct encodings, resulting in feature space heterogeneity. This heterogeneity inherently induces label shift, making cross-modal knowledge transfer particularly challenging when the source and target data exhibit simultaneous heterogeneous feature spaces and shifted label distributions. Existing studies address only partial aspects of this issue, leaving the broader problem unresolved. To bridge this gap, we introduce a new concept of Heterogeneous Label Shift (HLS), targeting this critical but underexplored challenge. We first analyze the impact of heterogeneous feature spaces and label distribution shifts on model generalization and introduce a novel error decomposition theorem. Based on these insights, we propose a bound minimization HLS framework that decouples and tackles feature heterogeneity and label shift accordingly. Extensive experiments on various benchmarks for cross-modal classification validate the effectiveness and practical relevance of the proposed approach.
Towards a Formal Theory of Representational Compositionality
Eric Elmoznino · Thomas Jiralerspong · Yoshua Bengio · Guillaume Lajoie
Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought and language. In AI, it enables a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, we lack satisfying formal definitions for it. Here, we propose such a definition called representational compositionality that is conceptually simple, quantitative, and grounded in algorithmic information theory. Intuitively, representational compositionality states that a compositional representation is both expressive and describable as a simple function of parts. We validate our definition on both real and synthetic data, and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We hope that our definition can inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought. We make our code available at https://github.com/EricElmoznino/complexity_compositionality.
Test-Time Training Provably Improves Transformers as In-context Learners
Halil Alperen Gozeten · Muhammed Emrullah Ildiz · Xuechen Zhang · Mahdi Soltanolkotabi · Marco Mondelli · Samet Oymak
Test-time training (TTT) methods explicitly update the weights of a model to adapt to the specific test instance, and they have found success in a variety of settings, including most recently language modeling and reasoning. To demystify this success, we investigate a gradient-based TTT algorithm for in-context learning, where we train a transformer model on the in-context demonstrations provided in the test prompt. Specifically, we provide a comprehensive theoretical characterization of linear transformers when the update rule is a single gradient step. Our theory (i) delineates the role of alignment between pretraining distribution and target task, (ii) demystifies how TTT can alleviate distribution shift, and (iii) quantifies the sample complexity of TTT including how it can significantly reduce the eventual sample size required for in-context learning. As our empirical contribution, we study the benefits of TTT for TabPFN, a tabular foundation model. In line with our theory, we demonstrate that TTT significantly reduces the required sample size for tabular classification (3 to 5 times fewer) unlocking substantial inference efficiency with a negligible training cost.
Can Diffusion Models Learn Hidden Inter-Feature Rules Behind Images?
Yujin Han · Andi Han · Wei Huang · Chaochao Lu · Difan Zou
Despite the remarkable success of diffusion models (DMs) in data generation, they exhibit specific failure cases with unsatisfactory outputs. We focus on one such limitation: the ability of DMs to learn hidden rules between image features. Specifically, for image data with dependent features ($\mathbf{x}$) and ($\mathbf{y}$) (e.g., the height of the sun ($\mathbf{x}$) and the length of the shadow ($\mathbf{y}$)), we investigate whether DMs can accurately capture the inter-feature rule ($p(\mathbf{y}|\mathbf{x})$). Empirical evaluations on mainstream DMs (e.g., Stable Diffusion 3.5) reveal consistent failures, such as inconsistent lighting-shadow relationships and mismatched object-mirror reflections. Inspired by these findings, we design four synthetic tasks with strongly correlated features to assess DMs' rule-learning abilities. Extensive experiments show that while DMs can identify coarse-grained rules, they struggle with fine-grained ones. Our theoretical analysis demonstrates that DMs trained via denoising score matching (DSM) exhibit constant errors in learning hidden rules, as the DSM objective is not compatible with rule conformity. To mitigate this, we introduce a common technique - incorporating additional classifier guidance during sampling, which achieves (limited) improvements. Our analysis reveals that the subtle signals of fine-grained rules are challenging for the classifier to capture, providing insights for future exploration.
Can Transformers Reason Logically? A Study in SAT Solving
Leyan Pan · Vijay Ganesh · Jacob Abernethy · Chris Esposo · Wenke Lee
We formally study the logical reasoning capabilities of decoder-only Transformers in the context of the boolean satisfiability (SAT) problem. First, we prove by construction that decoder-only Transformers can decide 3-SAT, in a non-uniform model of computation, using backtracking and deduction via Chain-of-Thought (CoT).Second, we implement our construction as a PyTorch model with a tool (PARAT) that we designed to empirically demonstrate its correctness and investigate its properties.Third, rather than \textit{programming} a transformer to reason, we evaluate empirically whether it can be \textit{trained} to do so by learning directly from algorithmic traces (``reasoning paths'') from our theoretical construction. The trained models demonstrate strong out-of-distribution generalization on problem sizes seen during training but has limited length generalization, which is consistent with the implications of our theoretical result.
Probabilistic Factorial Experimental Design for Combinatorial Interventions
Divya Shyamal · Jiaqi Zhang · Caroline Uhler
A _combinatorial intervention_, consisting of multiple treatments applied to a single unit with potential interactive effects, has substantial applications in fields such as biomedicine, engineering, and beyond. Given $p$ possible treatments, conducting all possible $2^p$ combinatorial interventions can be laborious and quickly becomes infeasible as $p$ increases. Here we introduce the _probabilistic factorial experimental design_, formalized from how scientists perform lab experiments. In this framework, the experimenter selects a dosage for each possible treatment and applies it to a group of units. Each unit independently receives a random combination of treatments, sampled from a product Bernoulli distribution determined by the dosages. Additionally, the experimenter can carry out such experiments over multiple rounds, adapting the design in an active manner. We address the optimal experimental design problem within a novel intervention model that imposes bounded-degree interactions between treatments. In the passive setting, we provide a closed-form solution for the near-optimal design. Our results prove that a dosage of $\frac{1}{2}$ for each treatment is optimal up to a factor of $1+O(\frac{\ln(n)}{n})$ for estimating any $k$-way interaction model, regardless of $k$, and imply that $O\big(kp^{3k}\ln(p)\big)$ observations are required to accurately estimate this model. For the multi-round setting, we provide a near-optimal acquisition function that can be numerically optimized. We also explore several extensions of the design problem and finally validate our findings through simulations.
High-dimensional inference addresses scenarios where the dimension of the data approaches, or even surpasses, the sample size. In these settings, the regularized $M$-estimator is a common technique for inferring parameters. (Negahban et al.,2009) establish a unified framework for establishing convergence rates in the context of high-dimensional scaling, demonstrating that estimation errors are confined within a restricted set, and revealing fast convergence rates. The key assumption underlying their work is the convexity of the loss function. However, many loss functions in high-dimensional contexts are nonconvex. This leads to the question: if the loss function is nonconvex, do estimation errors still fall within a restricted set? If yes, can we recover convergence rates of the estimation error under nonconvex situations? This paper provides affirmative answers to these critical questions.
Improved Learning via k-DTW: A Novel Dissimilarity Measure for Curves
Amer Krivosija · Alexander Munteanu · André Nusser · Chris Schwiegelshohn
This paper introduces $k$-Dynamic Time Warping ($k$-DTW), a novel dissimilarity measure for polygonal curves. $k$-DTW has stronger metric properties than Dynamic Time Warping (DTW) and is more robust to outliers than the Fréchet distance, which are the two gold standards of dissimilarity measures for polygonal curves. We show interesting properties of $k$-DTW and give an exact algorithm as well as a $(1+\varepsilon)$-approximation algorithm for $k$-DTW by a parametric search for the $k$-th largest matched distance. We prove the first dimension-free learning bounds for curves and further learning theoretic results. $k$-DTW not only admits smaller sample size than DTW for the problem of learning the median of curves, where some factors depending on the curves' complexity $m$ are replaced by $k$, but we also show a surprising separation on the associated Rademacher and Gaussian complexities: $k$-DTW admits strictly smaller bounds than DTW, by a factor $\tilde\Omega(\sqrt{m})$ when $k\ll m$. We complement our theoretical findings with an experimental illustration of the benefits of using $k$-DTW for clustering and nearest neighbor classification.
Training Dynamics of In-Context Learning in Linear Attention
Yedi Zhang · Aaditya Singh · Peter Latham · Andrew Saxe
While attention-based models have demonstrated the remarkable ability of in-context learning (ICL), the theoretical understanding of how these models acquired this ability through gradient descent training is still preliminary. Towards answering this question, we study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We examine two parametrizations of linear self-attention: one with the key and query weights merged as a single matrix (common in theoretical studies), and one with separate key and query matrices (closer to practical settings). For the merged parametrization, we show that the training dynamics has two fixed points and the loss trajectory exhibits a single, abrupt drop. We derive an analytical time-course solution for a certain class of datasets and initialization. For the separate parametrization, we show that the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations. During training, the model implements principal component regression in context with the number of principal components increasing over training time. Overall, we provide a theoretical description of how ICL abilities evolve during gradient descent training of linear attention, revealing abrupt acquisition or progressive improvements depending on how the key and query are parametrized.
Certifiably Robust Model Evaluation in Federated Learning under Meta-Distributional Shifts
Amir Najafi · Samin Mahdizadeh Sani · Farzan Farnia
We address the challenge of certifying the performance of a federated learning model on an unseen target network using only measurements from the source network that trained the model. Specifically, consider a source network "A" with $K$ clients, each holding private, non-IID datasets drawn from heterogeneous distributions, modeled as samples from a broader meta-distribution $\mu$. Our goal is to provide certified guarantees for the model’s performance on a different, unseen network "B", governed by an unknown meta-distribution $\mu'$, assuming the deviation between $\mu$ and $\mu'$ is bounded—either in Wasserstein distance or an $f$-divergence. We derive worst-case uniform guarantees for both the model’s average loss and its risk CDF, the latter corresponding to a novel, adversarially robust version of the Dvoretzky–Kiefer–Wolfowitz (DKW) inequality. In addition, we show how the vanilla DKW bound enables principled certification of the model's true performance on unseen clients within the same (source) network. Our bounds are efficiently computable, asymptotically minimax optimal, and preserve clients' privacy.We also establish non-asymptotic generalization bounds that converge to zero as $K$ grows and the minimum per-client sample size exceeds $\mathcal{O}(\log K)$. Empirical evaluations confirm the practical utility of our bounds across real-world tasks. The project code is available at: github.com/samin-mehdizadeh/Robust-Evaluation-DKW
Unified Analysis of Continuous Weak Features Learning with Applications to Learning from Missing Data
Kosuke Sugiyama · Masato Uchida
This paper addresses weak features learning (WFL), focusing on learning scenarios characterized by low-quality input features (weak features; WFs) that arise due to missingness, measurement errors, or ambiguous observations. We present a theoretical formalization and error analysis of WFL for continuous WFs (continuous WFL), which has been insufficiently explored in existing literature. A previous study established formalization and error analysis for WFL with discrete WFs (discrete WFL); however, this analysis does not extend to continuous WFs due to the inherent constraints of discreteness. To address this, we propose a theoretical framework specifically designed for continuous WFL, systematically capturing the interactions between feature estimation models for WFs and label prediction models for downstream tasks. Furthermore, we derive the theoretical conditions necessary for both sequential and iterative learning methods to achieve consistency. By integrating the findings of this study on continuous WFL with the existing theory of discrete WFL, we demonstrate that the WFL framework is universally applicable, providing a robust theoretical foundation for learning with low-quality features across diverse application domains.
A Generalization Theory for Zero-Shot Prediction
Ronak Mehta · Zaid Harchaoui
A modern paradigm for generalization in machine learning and AI consists of pre-training a task-agnostic foundation model, generally obtained using self-supervised and multimodal contrastive learning. The resulting representations can be used for prediction on a downstream task for which no labeled data is available. We present a theoretical framework to better understand this approach, called zero-shot prediction. We identify the target quantities that zero-shot prediction aims to learn, or learns in passing, and the key conditional independence relationships that enable its generalization ability.
Limitations of measure-first protocols in quantum machine learning
Casper Gyurik · Riccardo Molteni · Vedran Dunjko
In recent times, there have been major developments in two distinct yet connected domains of quantum information. On the one hand, substantial progress has been made in so-called randomized measurement protocols. Here, a number of properties of unknown quantum states can be deduced from surprisingly few measurement outcomes, using schemes such as classical shadows. On the other hand, significant progress has been made in quantum machine learning. For example, exponential advantages have been proven when the data consists of quantum states and quantum algorithms can coherently measure multiple copies of input states. In this work, we aim to understand the implications and limitations of combining randomized measurement protocols with quantum machine learning, although the implications are broader. Specifically, we investigate quantum machine learning algorithms that, when dealing with quantum data, can either process it entirely using quantum methods or measure the input data through a fixed measurement scheme and utilize the resulting classical information. We prove limitations for the general class of quantum machine learning algorithms that use fixed measurement schemes on the input quantum states.Our results have several implications. From the perspective of randomized measurement procedures, we show limitations of measure-first protocols in the average case, improving on the state-of-the-art which only focuses on worst-case scenarios. Additionally, previous lower bounds were only known for physically unrealizable states. We improve upon this by employing quantum pseudorandom functions to prove that a learning separation also exists when dealing with physically realizable states, which may be encountered in experiments. From a machine learning perspective, our results are crucial for defining a physically meaningful task that shows fully quantum machine learning processing is not only more efficient but also necessary for solving certain problems. The tasks at hand are also realistic, as the algorithms and proven separations hold when working with efficiently preparable states and remain robust in the presence of measurement and preparation errors.
Dimension-Independent Rates for Structured Neural Density Estimation
Vandermeulen · Wai Ming Tai · Bryon Aragam
We show that deep neural networks can achieve dimension-independent rates of convergence for learning structured densities typical of image, audio, video, and text data. For example, in images, where each pixel becomes independent of the rest of the image when conditioned on pixels at most $t$ steps away, a simple $L^2$-minimizing neural network can attain a rate of $n^{-1/((t+1)^2+4)}$, where $t$ is independent of the ambient dimension $d$, i.e. the total number of pixels. We further provide empirical evidence that, in real-world applications, $t$ is often a small constant, thus effectively circumventing the curse of dimensionality. Moreover, for sequential data (e.g., audio or text) exhibiting a similar local dependence structure, our analysis shows a rate of $n^{-1/(t+5)}$, offering further evidence of dimension independence in practical scenarios.
The Power of Random Features and the Limits of Distribution-Free Gradient Descent
Ari Karchmer · Eran Malach
We study the relationship between gradient-based optimization of parametric models (e.g., neural networks) and optimization of linear combinations of random features. Our main result shows that if a parametric model can be learned using mini-batch stochastic gradient descent (bSGD) without making assumptions about the data distribution, then with high probability, the target function can also be approximated using a polynomial-sized combination of random features. The size of this combination depends on the number of gradient steps and numerical precision used in the bSGD process. This finding reveals fundamental limitations of distribution-free learning in neural networks trained by gradient descent, highlighting why making assumptions about data distributions is often crucial in practice.Along the way, we also introduce a new theoretical framework called average probabilistic dimension complexity (adc), which extends the probabilistic dimension complexity developed by Kamath et al. (2020). We prove that adc has a polynomial relationship with statistical query dimension, and use this relationship to demonstrate an infinite separation between adc and standard dimension complexity.
ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning
Artavazd Maranjyan · El Mehdi Saad · Peter Richtarik · Francesco Orabona
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast and resource-efficient by assigning more tasks to faster workers. The challenge lies in achieving this optimal allocation without prior knowledge of the computation time distributions. In this paper, we propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of worker computation times. Through rigorous theoretical analysis, we show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times. Experimental results further demonstrate that ATA is resource-efficient, significantly reducing costs compared to the greedy approach, which can be arbitrarily expensive depending on the number of workers.
Learning to Incentivize in Repeated Principal-Agent Problems with Adversarial Agent Arrivals
Junyan Liu · ARNAB MAITI · Artin Tajdini · Kevin Jamieson · Lillian Ratliff
We initiate the study of a repeated principal-agent problem over a finite horizon $T$, where a principal sequentially interacts with $K\geq 2$ types of agents arriving in an *adversarial* order. At each round, the principal strategically chooses one of the $N$ arms to incentivize for an arriving agent of *unknown type*. The agent then chooses an arm based on its own utility and the provided incentive, and the principal receives a corresponding reward. The objective is to minimize regret against the best incentive in hindsight. Without prior knowledge of agent behavior, we show that the problem becomes intractable, leading to linear regret. We analyze two key settings where sublinear regret is achievable. In the first setting, the principal knows the arm each agent type would select greedily for any given incentive. Under this setting, we propose an algorithm that achieves a regret bound of $\mathcal{O}(\min(\sqrt{KT\log N},K\sqrt{T}))$ and provide a matching lower bound up to a $\log K$ factor. In the second setting, an agent's response varies smoothly with the incentive and is governed by a Lipschitz constant $L$. Under this setting, we show that there is an algorithm with a regret bound of $\tilde{\mathcal{O}}((LN)^{1/3}T^{2/3})$ and establish a matching lower bound up to logarithmic factors. Finally, we extend our algorithmic results for both settings by allowing the principal to incentivize multiple arms simultaneously in each round.
Beyond Minimax Rates in Group Distributionally Robust Optimization via a Novel Notion of Sparsity
Quan Nguyen · Nishant Mehta · Cristóbal Guzmán
The minimax sample complexity of group distributionally robust optimization (GDRO) has been determined up to a $\log(K)$ factor, where $K$ is the number of groups. In this work, we venture beyond the minimax perspective via a novel notion of sparsity that we call $(\lambda, \beta)$-sparsity. In short, this condition means that at any parameter $\theta$, there is a set of at most $\beta$ groups whose risks at $\theta$ are all at least $\lambda$ larger than the risks of the other groups. To find an $\epsilon$-optimal $\theta$, we show via a novel algorithm and analysis that the $\epsilon$-dependent term in the sample complexity can swap a linear dependence on $K$ for a linear dependence on the potentially much smaller $\beta$. This improvement leverages recent progress in sleeping bandits, showing a fundamental connection between the two-player zero-sum game optimization framework for GDRO and per-action regret bounds in sleeping bandits. We next show an adaptive algorithm which, up to logarithmic factors, obtains a sample complexity bound that adapts to the best $(\lambda, \beta)$-sparsity condition that holds. We also show how to obtain a dimension-free semi-adaptive sample complexity bound with a computationally efficient method.Finally, we demonstrate the practicality of the $(\lambda, \beta)$-sparsity condition and the improved sample efficiency of our algorithms on both synthetic and real-life datasets.
Fixing the Loose Brake: Exponential-Tailed Stopping Time in Best Arm Identification
Kapilan Balagopalan · Tuan Nguyen · Yao Zhao · Kwang-Sung Jun
The best arm identification problem requires identifying the best alternative (i.e., arm) in active experimentation using the smallest number of experiments (i.e., arm pulls), which is crucial for cost-efficient and timely decision-making processes. In the fixed confidence setting, an algorithm must stop data-dependently and return the estimated best arm with a correctness guarantee. Since this stopping time is random, we desire its distribution to have light tails. Unfortunately, many existing studies focus on high probability or in expectation bounds on the stopping time, which allow heavy tails and, for high probability bounds, even not stopping at all. We first prove that this never-stopping event can indeed happen for some popular algorithms. Motivated by this, we propose algorithms that provably enjoy an exponential-tailed stopping time, which improves upon the polynomial tail bound reported by Kalyanakrishnan et al. (2012). The first algorithm is based on a fixed budget algorithm called Sequential Halving along with a doubling trick. The second algorithm is a meta algorithm that takes in any fixed confidence algorithm with a high probability stopping guarantee and turns it into one that enjoys an exponential-tailed stopping time. Our results imply that there is much more to be desired for contemporary fixed confidence algorithms.
Many real-world bandit applications are characterized by sparse rewards, which can significantly hinder learning efficiency. Leveraging problem-specific structures for careful distribution modeling is recognized as essential for improving estimation efficiency in statistics. However, this approach remains under-explored in the context of bandits. To address this gap, we initiate the study of zero-inflated bandits, where the reward is modeled using a classic semi-parametric distribution known as the zero-inflated distribution. We develop algorithms based on the Upper Confidence Bound and Thompson Sampling frameworks for this specific structure. The superior empirical performance of these methods is demonstrated through extensive numerical studies.
Avoiding Catastrophe in Online Learning by Asking for Help
Benjamin Plaut · Hanlin Zhu · Stuart Russell
Most learning algorithms with formal regret guarantees assume that all mistakes are recoverable and essentially rely on trying all possible behaviors. This approach is problematic when some mistakes are catastrophic, i.e., irreparable. We propose an online learning problem where the goal is to minimize the chance of catastrophe. Specifically, we assume that the payoff in each round represents the chance of avoiding catastrophe in that round and try to maximize the product of payoffs (the overall chance of avoiding catastrophe) while allowing a limited number of queries to a mentor. We also assume that the agent can transfer knowledge between similar inputs. We first show that in general, any algorithm either queries the mentor at a linear rate or is nearly guaranteed to cause catastrophe. However, in settings where the mentor policy class is learnable in the standard online model, we provide an algorithm whose regret and rate of querying the mentor both approach 0 as the time horizon grows. Although our focus is the product of payoffs, we provide matching bounds for the typical additive regret. Conceptually, if a policy class is learnable in the absence of catastrophic risk, it is learnable in the presence of catastrophic risk if the agent can ask for help.
Implicit Riemannian Optimism with Applications to Min-Max Problems
Christophe Roux · David Martinez-Rubio · Sebastian Pokutta
We introduce a Riemannian optimistic online learning algorithm for Hadamard manifolds based on inexact implicit updates. Unlike prior work, our method can handle in-manifold constraints, and matches the best known regret bounds in the Euclidean setting with no dependence on geometric constants, like the minimum curvature. Building on this, we develop algorithms for g-convex, g-concave smooth min-max problems on Hadamard manifolds. Notably, one method nearly matches the gradient oracle complexity of the lower bound for Euclidean problems, for the first time.
We consider the problem of online learning where the sequence of actions played by the learner must adhere to an unknown safety constraint at every round. The goal is to minimize regret with respect to the best safe action in hindsight while simultaneously satisfying the safety constraint with high probability on each round. We provide a general meta-algorithm that leverages an online regression oracle to estimate the unknown safety constraint, and converts the predictions of an online learning oracle to predictions that adhere to the unknown safety constraint. On the theoretical side, our algorithm's regret can be bounded by the regret of the online regression and online learning oracles, the eluder dimension of the model class containing the unknown safety constraint, and a novel complexity measure that characterizes the difficulty of safe learning. We complement our result with an asymptotic lower bound that shows that the aforementioned complexity measure is necessary. When the constraints are linear, we instantiate our result to provide a concrete algorithm with $\sqrt{T}$ regret using a scaling transformation that balances optimistic exploration with pessimistic constraint satisfaction.
Unlocking the Power of Rehearsal in Continual Learning: A Theoretical Perspective
Junze Deng · Qinhang Wu · Peizhong Ju · Sen Lin · Yingbin LIANG · Ness Shroff
Rehearsal-based methods have shown superior performance in addressing catastrophic forgetting in continual learning (CL) by storing and training on a subset of past data alongside new data in current task. While such a concurrent rehearsal strategy is widely used, it remains unclear if this approach is always optimal. Inspired by human learning, where sequentially revisiting tasks helps mitigate forgetting, we explore whether sequential rehearsal can offer greater benefits for CL compared to standard concurrent rehearsal. To address this question, we conduct a theoretical analysis of rehearsal-based CL in overparameterized linear models, comparing two strategies: 1) Concurrent Rehearsal, where past and new data are trained together, and 2) Sequential Rehearsal, where new data is trained first, followed by revisiting past data sequentially. By explicitly characterizing forgetting and generalization error, we show that sequential rehearsal performs better when tasks are less similar. These insights further motivate a novel Hybrid Rehearsal method, which trains similar tasks concurrently and revisits dissimilar tasks sequentially. We characterize its forgetting and generalization performance, and our experiments with deep neural networks further confirm that the hybrid approach outperforms standard concurrent rehearsal. This work provides the first comprehensive theoretical analysis of rehearsal-based CL.
A Near Linear Query Lower Bound for Submodular Maximization
Binghui Peng · Aviad Rubinstein
We revisit the problem of selecting $k$-out-of-$n$ elements with the goal of optimizing an objective function, and ask whether it can be solved approximately with sublinear query complexity. For objective functions that are monotone submodular, [Li, Feldman, Kazemi, Karbasi, NeurIPS'22; Kuhnle, AISTATS'21] gave an $\Omega(n/k)$ query lower bound for approximating to within any constant factor. We strengthen their lower bound to a nearly tight $\tilde{\Omega}(n)$. This lower bound holds even for estimating the value of the optimal subset. When the objective function is additive, we prove that finding an approximately optimal subset still requires near-linear query complexity, but we can estimate the value of the optimal subset in $\tilde{O}(n/k)$ queries, and that this is tight up to polylog factors.
Exploiting Similarity for Computation and Communication-Efficient Decentralized Optimization
Yuki Takezawa · Xiaowen Jiang · Anton Rodomanov · Sebastian Stich
Reducing communication complexity is critical for efficient decentralized optimization. The proximal decentralized optimization (PDO) framework is particularly appealing, as methods within this framework can exploit functional similarity among nodes to reduce communication rounds. Specifically, when local functions at different nodes are similar, these methods achieve faster convergence with fewer communication steps. However, existing PDO methods often require highly accurate solutions to subproblems associated with the proximal operator, resulting in significant computational overhead. In this work, we propose the Stabilized Proximal Decentralized Optimization (SPDO) method, which achieves state-of-the-art communication and computational complexities within the PDO framework. Additionally, we refine the analysis of existing PDO methods by relaxing subproblem accuracy requirements and leveraging average functional similarity. Experimental results demonstrate that SPDO significantly outperforms existing methods.
Faster Stochastic Optimization with Arbitrary Delays via Adaptive Asynchronous Mini-Batching
Amit Attia · Ofir Gaash · Tomer Koren
We consider the problem of asynchronous stochastic optimization, where an optimization algorithm makes updates based on stale stochastic gradients of the objective that are subject to an arbitrary (possibly adversarial) sequence of delays. We present a procedure which, for any given $q \in (0,1]$, transforms any standard stochastic first-order method to an asynchronous method with convergence guarantee depending on the $q$-quantile delay of the sequence. This approach leads to convergence rates of the form $O(\tau_q/qT+\sigma/\sqrt{qT})$ for non-convex and $O(\tau_q^2/(q T)^2+\sigma/\sqrt{qT})$ for convex smooth problems, where $\tau_q$ is the $q$-quantile delay, generalizing and improving on existing results that depend on the average delay. We further show a method that automatically adapts to all quantiles simultaneously, without any prior knowledge of the delays, achieving convergence rates of the form $O(\inf_{q} \tau_q/qT+\sigma/\sqrt{qT})$ for non-convex and $O(\inf_{q} \tau_q^2/(q T)^2+\sigma/\sqrt{qT})$ for convex smooth problems. Our technique is based on asynchronous mini-batching with a careful batch-size selection and filtering of stale gradients.
General framework for online-to-nonconvex conversion: Schedule-free SGD is also effective for nonconvex optimization
Kwangjun Ahn · Gagik Magakyan · Ashok Cutkosky
This work investigates the effectiveness of schedule-free methods, developed by A. Defazio et al. (NeurIPS 2024), in nonconvex optimization settings, inspired by their remarkable empirical success in training neural networks. Specifically, we show that schedule-free SGD achieves optimal iteration complexity for nonsmooth, non-convex optimization problems. Our proof begins with the development of a general framework for online-to-nonconvex conversion, which converts a given online learning algorithm into an optimization algorithm for nonconvex losses. Our general framework not only recovers existing conversions but also leads to two novel conversion schemes. Notably, one of these new conversions corresponds directly to schedule-free SGD, allowing us to establish its optimality. Additionally, our analysis provides valuable insights into the parameter choices for schedule-free SGD, addressing a theoretical gap that the convex theory cannot explain.