Track: Poster Session 6 East

Poster

#E-1000

Understanding the difficulties of posterior predictive estimation

Abhinav Agrawal · Justin Domke

Predictive posterior densities (PPDs) are essential in approximate inference for quantifying predictive uncertainty and comparing inference methods. Typically, PPDs are estimated by simple Monte Carlo (MC) averages. In this paper, we expose a critical under-recognized issue: the signal-to-noise ratio (SNR) of the simple MC estimator can sometimes be extremely low, leading to unreliable estimates. Our main contribution is a theoretical analysis demonstrating that even with exact inference, SNR can decay rapidly with an increase in (a) the mismatch between training and test data, (b) the dimensionality of the latent space, or (c) the size of test data relative to training data. Through several examples, we empirically verify these claims and show that these factors indeed lead to poor SNR and unreliable PPD estimates (sometimes, estimates are off by hundreds of nats even with a million samples). While not the primary focus, we also explore an adaptive importance sampling approach as an illustrative way to mitigate the problem, where we learn the proposal distribution by maximizing a variational proxy to the SNR. Taken together, our findings highlight an important challenge and provide essential insights for reliable estimation.

Poster

#E-1001

Deep Bayesian Filter for Bayes-Faithful Data Assimilation

Yuta Tarumi · Keisuke Fukuda · Shin-ichi Maeda

Data assimilation for nonlinear state space models (SSMs) is inherently challenging due to non-Gaussian posteriors. We propose Deep Bayesian Filtering (DBF), a novel approach to data assimilation in nonlinear SSMs. DBF introduces latent variables $h_t$ in addition to physical variables $z_t$, ensuring Gaussian posteriors by (i) constraining state transitions in the latent space to be linear and (ii) learning a Gaussian inverse observation operator $r(h_t|o_t)$. This structured posterior design enables analytical recursive computation, avoiding the accumulation of Monte Carlo sampling errors over time steps. DBF optimizes these operators and other latent SSM parameters by maximizing the evidence lower bound. Experiments demonstrate that DBF outperforms existing methods in scenarios with highly non-Gaussian posteriors.

Poster

#E-1002

Splitting & Integrating: Out-of-Distribution Detection via Adversarial Gradient Attribution

Jiayu Zhang · Xinyi Wang · Zhibo Jin · Zhiyu Zhu · Jianlong Zhou · Fang Chen · Huaming Chen

Out-of-distribution (OOD) detection is essential for enhancing the robustness and security of deep learning models in unknown and dynamic data environments. Gradient-based OOD detection methods, such as GAIA, analyse the explanation pattern representations of in-distribution (ID) and OOD samples by examining the sensitivity of model outputs w.r.t. model inputs, resulting in superior performance compared to traditional OOD detection methods. However, we argue that the non-zero gradient behaviors of OOD samples do not exhibit significant distinguishability, especially when ID samples are perturbed by random perturbations in high-dimensional spaces, which negatively impacts the accuracy of OOD detection. In this paper, we propose a novel OOD detection method called \textbf{S \& I} based on layer \textbf{S}plitting and gradient \textbf{I}ntegration via Adversarial Gradient Attribution. Specifically, our approach involves splitting the model's intermediate layers and iteratively updating adversarial examples layer-by-layer. We then integrate the attribution gradients from each intermediate layer along the attribution path from adversarial examples to the actual input, yielding true explanation pattern representations for both ID and OOD samples. Experiments demonstrate that our S \& I algorithm achieves state-of-the-art results, with the average FPR95 of 29.05\% (ResNet34)/38.61\% (WRN40) and 37.31\% (BiT-S) on the CIFAR100 and ImageNet benchmarks, respectively. Our code is available at: https://github.com/LMBTough/S-I}{https://github.com/LMBTough/S-I

Poster

#E-1003

Low-Rank Adapting Models for Sparse Autoencoders

Matthew Chen · Josh Engels · Max Tegmark

Sparse autoencoders (SAEs) aim to decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune the *language model itself* around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% - 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3$\times$ to 20$\times$ faster on Gemma-2-2B and 2$\times$ to 10$\times$ faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.

Poster

#E-1004

B-score: Detecting biases in large language models using response history

An Vo · Mohammad Reza Taesiri · Daeyoung Kim · Anh Nguyen

Large language models (LLMs) often exhibit strong biases, e.g, against women or in favor of the number 7. We investigate whether LLMs would be able to output less biased answers when allowed to observe their prior answers to the same question in a multi-turn conversation. To understand which types of questions invite more biased answers, we test LLMs on our proposed set of questions that span 9 topics and belong to three types: (1) Subjective; (2) Random; and (3) Objective. Interestingly, LLMs are able to "de-bias" themselves in a multi-turn conversation in response to questions that seek a Random, unbiased answer. Furthermore, we propose B-score, a novel metric that is effective in detecting biases in Subjective, Random, Easy, and Hard questions. On MMLU, HLE, and CSQA, leveraging B-score substantially improves the verification accuracy of LLM answers (i.e, accepting LLM correct answers and rejecting incorrect ones) compared to using verbalized confidence scores or the frequency of single-turn answers alone. Code and data are available at: b-score.github.io.

Poster

#E-1005

How Do Transformers Learn Variable Binding in Symbolic Programs?

Yiwei Wu · Atticus Geiger · Raphaël Millière

Variable binding---the ability to associate variables with values---is fundamental to symbolic computation and cognition. Although classical architectures typically implement variable binding via addressable memory, it is not well understood how modern neural networks lacking built-in binding operations may acquire this capacity. We investigate this by training a Transformer to dereference queried variables in symbolic programs where variables are assigned either numerical constants or other variables. Each program requires following chains of variable assignments up to four steps deep to find the queried value, and also contains irrelevant chains of assignments acting as distractors. Our analysis reveals a developmental trajectory with three distinct phases during training: (1) random prediction of numerical constants, (2) a shallow heuristic prioritizing early variable assignments, and (3) the emergence of a systematic mechanism for dereferencing assignment chains.Using causal interventions, we find that the model learns to exploit the residual stream as an addressable memory space, with specialized attention heads routing information across token positions. This mechanism allows the model to dynamically track variable bindings across layers, resulting in accurate dereferencing. Our results show how Transformer models can learn to implement systematic variable binding without explicit architectural support, bridging connectionist and symbolic approaches.

Poster

#E-1006

Towards flexible perception with visual memory

Robert Geirhos · Priyank Jaini · Austin Stone · Sourabh Medapati · Xi Yi · George Toderici · Abhijit Ogale · Jonathon Shlens

Training a neural network is a monolithic endeavor, akin to carving knowledge into stone: once the process is completed, editing the knowledge in a network is nearly impossible, since all information is distributed across the network's weights. We here explore a simple, compelling alternative by marrying the representational power of deep neural networks with the flexibility of a database. Decomposing the task of image classification into image similarity (from a pre-trained embedding) and search (via fast nearest neighbor retrieval from a knowledge database), we build a simple and flexible visual memory that has the following key capabilities: (1.) The ability to flexibly add data across scales: from individual samples all the way to entire classes and billion-scale data; (2.) The ability to remove data through unlearning and memory pruning; (3.) An interpretable decision-mechanism on which we can intervene to control its behavior. Taken together, these capabilities comprehensively demonstrate the benefits of an explicit visual memory. We hope that it might contribute to a conversation on how knowledge should be represented in deep vision models---beyond carving it in "stone" weights.

Poster

#E-1007

InfoCons: Identifying Interpretable Critical Concepts in Point Clouds via Information Theory

Feifei Li · Mi Zhang · Zhaoxiang Wang · Min Yang

Interpretability of point cloud (PC) models becomes imperative given their deployment in safety-critical scenarios such as autonomous vehicles. We focus on attributing PC model outputs to interpretable critical concepts, defined as meaningful subsets of the input point cloud.To enable human-understandable diagnostics of model failures, an ideal critical subset should be faithful (preserving points that causally influence predictions) and conceptually coherent (forming semantically meaningful structures that align with human perception).We propose InfoCons, an explanation framework that applies information-theoretic principles to decompose the point cloud into 3D concepts, enabling the examination of their causal effect on model predictions with learnable priors.We evaluate InfoCons on synthetic datasets for classification, comparing it qualitatively and quantitatively with four baselines. We further demonstrate its scalability and flexibility on two real-world datasets and in two applications that utilize critical scores of PC.

Poster

#E-1008

SAE-V: Interpreting Multimodal Models for Enhanced Alignment

Hantao Lou · Changye Li · Jiaming Ji · Yaodong Yang

With the integration of image modality, the semantic space of multimodal large language models (MLLMs) is more complex than text-only models, making their interpretability more challenging and their alignment less stable, particularly susceptible to low-quality data, which can lead to inconsistencies between modalities, hallucinations, and biased outputs. As a result, developing interpretability methods for MLLMs is crucial for improving alignment quality and efficiency. In text-only LLMs, Sparse Autoencoders (SAEs) have gained attention for their ability to interpret latent representations. However, extending SAEs to multimodal settings presents new challenges due to modality fusion and the difficulty of isolating cross-modal representations. To address these challenges, we introduce SAE-V, a mechanistic interpretability framework that extends the SAE paradigm to MLLMs. By identifying and analyzing interpretable features along with their corresponding data, SAE-V enables fine-grained interpretation of both model behavior and data quality, facilitating a deeper understanding of cross-modal interactions and alignment dynamics. Moreover, by utilizing cross-modal feature weighting, SAE-V provides an intrinsic data filtering mechanism to enhance model alignment without requiring additional models. Specifically, when applied to the alignment process of MLLMs, SAE-V-based data filtering methods could achieve more than 110% performance with less than 50% data. Our results highlight SAE-V’s ability to enhance interpretability and alignment in MLLMs, providing insights into their internal mechanisms.

Poster

#E-1009

CoDy: Counterfactual Explainers for Dynamic Graphs

Zhan Qu · Daniel Gomm · Michael Färber

Temporal Graph Neural Networks (TGNNs) are widely used to model dynamic systems where relationships and features evolve over time. Although TGNNs demonstrate strong predictive capabilities in these domains, their complex architectures pose significant challenges for explainability. Counterfactual explanation methods provide a promising solution by illustrating how modifications to input graphs can influence model predictions. To address this challenge, we present CoDy—Counterfactual Explainer for Dynamic Graphs—a model-agnostic, instance-level explanation approach that identifies counterfactual subgraphs to interpret TGNN predictions. CoDy employs a search algorithm that combines Monte Carlo Tree Search with heuristic selection policies, efficiently exploring a vast search space of potential explanatory subgraphs by leveraging spatial, temporal, and local event impact information. Extensive experiments against state-of-the-art factual and counterfactual baselines demonstrate CoDy's effectiveness, with improvements of 16% in AUFSC+ over the strongest baseline. Our code is available at: https://github.com/daniel-gomm/CoDy

Poster

#E-1100

Weight matrices compression based on PDB model in deep neural networks

Xiaoling Wu · Junpeng Zhu · Zeng Li

Weight matrix compression has been demonstrated to effectively reduce overfitting and improve the generalization performance of deep neural networks. Compression is primarily achieved by filtering out noisy eigenvalues of the weight matrix. In this work, a novel Population Double Bulk (PDB) model is proposed to characterize the eigenvalue behavior of the weight matrix, which is more general than the existing Population Unit Bulk (PUB) model. Based on PDB model and Random Matrix Theory (RMT), we have discovered a new PDBLS algorithm for determining the boundary between noisy eigenvalues and information. A PDB Noise-Filtering algorithm is further introduced to reduce the rank of the weight matrix for compression. Experiments show that our PDB model fits the empirical distribution of eigenvalues of the weight matrix better than the PUB model, and our compressed weight matrices have lower rank at the same level of test accuracy. In some cases, our compression method can even improve generalization performance when labels contain noise. The code is avaliable at https://github.com/xlwu571/PDBLS.

Poster

#E-1101

Solving Linear-Gaussian Bayesian Inverse Problems with Decoupled Diffusion Sequential Monte Carlo

Filip Ekström Kelvinius · Zheng Zhao · Fredrik Lindsten

A recent line of research has exploited pre-trained generative diffusion models as priors for solving Bayesian inverse problems. We contribute to this research direction by designing a sequential Monte Carlo method for linear-Gaussian inverse problems which builds on ``decoupled diffusion", where the generative process is designed such that larger updates to the sample are possible. The method is asymptotically exact and we demonstrate the effectiveness of our Decoupled Diffusion Sequential Monte Carlo (DDSMC) algorithm on both synthetic as well as protein and image data. Further, we demonstrate how the approach can be extended to discrete data.

Poster

#E-1102

Learning Gaussian DAG Models without Condition Number Bounds

Constantinos Daskalakis · Vardis Kandiros · Rui Yao

We study the problem of learning the topology of a directed Gaussian Graphical Model under the equal-variance assumption, where the graph has $n$ nodes and maximum in-degree $d$. Prior work has established that $O(d \log n)$ samples are sufficient for this task. However, an important factor that is often overlooked in these analyses is the dependence on the condition number of the covariance matrix of the model. Indeed, all algorithms from prior work require a number of samples that grows polynomially with this condition number. In many cases this is unsatisfactory, since the condition number could grow polynomially with $n$, rendering these prior approaches impractical in high-dimensional settings. In this work, we provide an algorithm that recovers the underlying graph and prove that the number of samples required is independent of the condition number. Furthermore, we establish lower bounds that nearly match the upper bound up to a $d$-factor, thus providing an almost tight characterization of the true sample complexity of the problem. Moreover, under a further assumption that all the variances of the variables are bounded, we design a polynomial-time algorithm that recovers the underlying graph, at the cost of an additional polynomial dependence of the sample complexity on $d$. We complement our theoretical findings with simulations on synthetic datasets that confirm our predictions.

Spotlight Poster

#E-1103

Parallel Simulation for Log-concave Sampling and Score-based Diffusion Models

Huanjian Zhou · Masashi Sugiyama

Sampling from high-dimensional probability distributions is fundamental in machine learning and statistics. As datasets grow larger, computational efficiency becomes increasingly important, particularly in reducing *adaptive complexity*, namely the number of sequential rounds required for sampling algorithms. While recent works have introduced several parallelizable techniques, they often exhibit suboptimal convergence rates and remain significantly weaker than the latest lower bounds for log-concave sampling.To address this, we propose a novel parallel sampling method that improves adaptive complexity dependence on dimension $d$ reducing it from $\widetilde{\mathcal{O}}(\log^2 d)$ to $\widetilde{\mathcal{O}}(\log d)$. Our approach builds on parallel simulation techniques from scientific computing.

Poster

#E-1104

Sampling from Binary Quadratic Distributions via Stochastic Localization

Chenguang Wang · Kaiyuan Cui · Weichen Zhao · Tianshu Yu

Sampling from binary quadratic distributions (BQDs) is a fundamental but challenging problem in discrete optimization and probabilistic inference. Previous work established theoretical guarantees for stochastic localization (SL) in continuous domains, where MCMC methods efficiently estimate the required posterior expectations during SL iterations. However, achieving similar convergence guarantees for discrete MCMC samplers in posterior estimation presents unique theoretical challenges. In this work, we present the first application of SL to general BQDs, proving that after a certain number of iterations, the external field of posterior distributions constructed by SL tends to infinity almost everywhere, hence satisfy Poincaré inequalities with probability near to 1, leading to polynomial-time mixing. This theoretical breakthrough enables efficient sampling from general BQDs, even those that may not originally possess fast mixing properties. Furthermore, our analysis, covering enormous discrete MCMC samplers based on Glauber dynamics and Metropolis-Hastings algorithms, demonstrates the broad applicability of our theoretical framework.Experiments on instances with quadratic unconstrained binary objectives, including maximum independent set, maximum cut, and maximum clique problems, demonstrate consistent improvements in sampling efficiency across different discrete MCMC samplers.

Poster

#E-1105

Enabling Optimal Decisions in Rehearsal Learning under CARE Condition

Wen-Bo Du · Hao-Yi Lei · Lue Tao · Tian-Zuo Wang · Zhi-Hua Zhou

In the field of machine learning (ML), an essential type of decision-related problem is known as AUF (Avoiding Undesired Future): if an ML model predicts an undesired outcome, how can decisions be made to prevent it? Recently, a novel framework called rehearsal learning has been proposed to address the AUF problem. Despite its utility in modeling uncertainty for decision-making, it remains unclear under what conditions and how optimal actions that maximize the AUF probability can be identified. In this paper, we propose CARE (CAnonical REctangle), a condition under which the maximum AUF probability can be achieved. Under the CARE condition, we present a projection-Newton algorithm to select actions and prove that the algorithm achieves superlinear convergence to the optimal one. Besides, we provide a generalization method for adopting the algorithm to AUF scenarios beyond the CARE condition. Finally, we demonstrate that a closed-form solution exists when the outcome is a singleton variable, substantially reducing the time complexity of decision-making. Experiments validate the effectiveness and efficiency of our method.

Poster

#E-1106

Trajectory Inference with Smooth Schrödinger Bridges

Wanli Hong · Yuliang Shi · Jonathan Niles-Weed

Motivated by applications in trajectory inference and particle tracking, we introduce Smooth Schrödinger Bridges. Our proposal generalizes prior work by allowing the reference process in the multi-marginal Schrödinger Bridge problem to be a smooth Gaussian process, leading to more regular and interpretable trajectories in applications. Though naïvely smoothing the reference process leads to a computationally intractable problem, we identify a class of processes (including the Matérn processes) for which the resulting Smooth Schrödinger Bridge problem can be lifted to a simpler problem on phase space, which can be solved in polynomial time. We develop a practical approximation of this algorithm that outperforms existing methods on numerous simulated and real single-cell RNAseq datasets.

Poster

#E-1107

Neighbour-Driven Gaussian Process Variational Autoencoders for Scalable Structured Latent Modelling

Xinxing Shi · Xiaoyu Jiang · Mauricio Álvarez

Gaussian Process (GP) Variational Autoencoders (VAEs) extend standard VAEs by replacing the fully factorised Gaussian prior with a GP prior, thereby capturing richer correlations among latent variables. However, performing exact GP inference in large-scale GPVAEs is computationally prohibitive, often forcing existing approaches to rely on restrictive kernel assumptions or large sets of inducing points. In this work, we propose a neighbour-driven approximation strategy that exploits local adjacencies in the latent space to achieve scalable GPVAE inference. By confining computations to the nearest neighbours of each data point, our method preserves essential latent dependencies, allowing more flexible kernel choices and mitigating the need for numerous inducing points. Through extensive experiments on tasks including representation learning, data imputation, and conditional generation, we demonstrate that our approach outperforms other GPVAE variants in both predictive performance and computational efficiency.

Poster

#E-1108

Optimal Sensor Scheduling and Selection for Continuous-Discrete Kalman Filtering with Auxiliary Dynamics

Mohamad Al Ahdab · john leth · Zheng-Hua Tan

We study the Continuous-Discrete Kalman Filter (CD-KF) for State-Space Models (SSMs) where continuous-time dynamics are observed via multiple sensors with discrete, irregularly timed measurements. Our focus extends to scenarios in which the measurement process is coupled with the states of an auxiliary SSM. For instance, higher measurement rates may increase energy consumption or heat generation, while a sensor’s accuracy can depend on its own spatial trajectory or that of the measured target. Each sensor thus carries distinct costs and constraints associated with its measurement rate and additional constraints and costs on the auxiliary state. We model measurement occurrences as independent Poisson processes with sensor-specific rates and derive an upper bound on the mean posterior covariance matrix of the CD-KF along the mean auxiliary state. The bound is continuously differentiable with respect to the measurement rates, which enables efficient gradient-based optimization. Exploiting this bound, we propose a finite-horizon optimal control framework to optimize measurement rates and auxiliary-state dynamics jointly. We further introduce a deterministic method for scheduling measurement times from the optimized rates. Empirical results in state-space filtering and dynamic temporal Gaussian process regression demonstrate that our approach achieves improved trade-offs between resource usage and estimation accuracy.

Poster

#E-1109

Posterior Inference with Diffusion Models for High-dimensional Black-box Optimization

Taeyoung Yun · Kiyoung Om · Jaewoo Lee · Sujin Yun · Jinkyoo Park

Optimizing high-dimensional and complex black-box functions is crucial in numerous scientific applications.While Bayesian optimization (BO) is a powerful method for sample-efficient optimization, it struggles with the curse of dimensionality and scaling to thousands of evaluations. Recently, leveraging generative models to solve black-box optimization problems has emerged as a promising framework.However, those methods often underperform compared to BO methods due to limited expressivity and difficulty of uncertainty estimation in high-dimensional spaces.To overcome these issues, we introduce \textbf{DiBO}, a novel framework for solving high-dimensional black-box optimization problems.Our method iterates two stages. First, we train a diffusion model to capture the data distribution and deep ensembles to predict function values with uncertainty quantification.Second, we cast the candidate selection as a posterior inference problem to balance exploration and exploitation in high-dimensional spaces. Concretely, we fine-tune diffusion models to amortize posterior inference.Extensive experiments demonstrate that our method outperforms state-of-the-art baselines across synthetic and real-world tasks. Our code is publicly available \href{https://github.com/umkiyoung/DiBO}{here}.

Poster

#E-1200

Generalized Category Discovery via Reciprocal Learning and Class-Wise Distribution Regularization

Duo Liu · Zhiquan Tan · Linglan Zhao · Zhongqiang Zhang · Xiangzhong Fang · Weiran Huang

Generalized Category Discovery (GCD) aims to identify unlabeled samples by leveraging the base knowledge from labeled ones, where the unlabeled set consists of both base and novel classes. Since clustering methods are time-consuming at inference, parametric-based approaches have become more popular. However, recent parametric-based methods suffer from inferior base discrimination due to unreliable self-supervision. To address this issue, we propose a Reciprocal Learning Framework (RLF) that introduces an auxiliary branch devoted to base classification. During training, the main branch filters the pseudo-base samples to the auxiliary branch. In response, the auxiliary branch provides more reliable soft labels for the main branch, leading to a virtuous cycle. Furthermore, we introduce Class-wise Distribution Regularization (CDR) to mitigate the learning bias towards base classes. CDR essentially increases the prediction confidence of the unlabeled data and boosts the novel class performance. Combined with both components, our proposed method, RLCD, achieves superior performance in all classes with negligible extra computation. Comprehensive experiments across seven GCD datasets validate its superiority.Our codes are available at https://github.com/APORduo/RLCD.

Poster

#E-1201

Unsupervised Learning for Class Distribution Mismatch

Pan Du · Zhao · Xinai Lu · Nian Liu · Zhikai Li · Chaoyu Gong · Suyun Zhao · Hong Chen · Cuiping Li · Kai Wang · Yang You

Class distribution mismatch (CDM) refers to the discrepancy between class distributions in training data and target tasks. Previous methods address this by designing classifiers to categorize classes known during training, while grouping unknown or new classes into an "other" category. However, they focus on semi-supervised scenarios and heavily rely on labeled data, limiting their applicability and performance. To address this, we propose Unsupervised Learning for Class Distribution Mismatch (UCDM), which constructs positive-negative pairs from unlabeled data for classifier training. Our approach randomly samples images and uses a diffusion model to add or erase semantic classes, synthesizing diverse training pairs. Additionally, we introduce a confidence-based labeling mechanism that iteratively assigns pseudo-labels to valuable real-world data and incorporates them into the training process. Extensive experiments on three datasets demonstrate UCDM’s superiority over previous semi-supervised methods. Specifically, with a 60\% mismatch proportion on Tiny-ImageNet dataset, our approach, without relying on labeled data, surpasses OpenMatch (with 40 labels per class) by 35.1%, 63.7%, and 72.5% in classifying known, unknown, and new classes.

Spotlight Poster

#E-1202

Learning Dynamics under Environmental Constraints via Measurement-Induced Bundle Structures

Dongzhe Zheng · Wenjie Mei

Learning unknown dynamics under environmental (or external) constraints is fundamental to many fields (e.g., modern robotics), particularly challenging when constraint information is only locally available and uncertain. Existing approaches requiring global constraints or using probabilistic filtering fail to fully exploit the geometric structure inherent in local measurements (by using, e.g., sensors) and constraints. This paper presents a geometric framework unifying measurements, constraints, and dynamics learning through a fiber bundle structure over the state space. This naturally induced geometric structure enables measurement-aware Control Barrier Functions that adapt to local sensing (or measurement) conditions. By integrating Neural ODEs, our framework learns continuous-time dynamics while preserving geometric constraints, with theoretical guarantees of learning convergence and constraint satisfaction dependent on sensing quality. The geometric framework not only enables efficient dynamics learning but also suggests promising directions for integration with reinforcement learning approaches. Extensive simulations demonstrate significant improvements in both learning efficiency and constraint satisfaction over traditional methods, especially under limited and uncertain sensing conditions.

Poster

#E-1203

Online Differentially Private Conformal Prediction for Uncertainty Quantification

Qiangqiang Zhang · Ting Li · Xinwei Feng · Xiaodong Yan · Jinhan Xie

Traditional conformal prediction faces significant challenges with the rise of streaming data and increasing concerns over privacy. In this paper, we introduce a novel online differentially private conformal prediction framework, designed to construct dynamic, model-free private prediction sets. Unlike existing approaches that either disregard privacy or require full access to the entire dataset, our proposed method ensures individual privacy with a one-pass algorithm, ideal for real-time, privacy-preserving decision-making. Theoretically, we establish guarantees for long-run coverage at the nominal confidence level. Moreover, we extend our method to conformal quantile regression, which is fully adaptive to heteroscedasticity. We validate the effectiveness and applicability of the proposed method through comprehensive simulations and real-world studies on the ELEC2 and PAMAP2 datasets.

Poster

#E-1204

Large Continual Instruction Assistant

Jingyang Qiao · zhizhong zhang · Xin Tan · Yanyun Qu · Shouhong Ding · Yuan Xie

Continual Instruction Tuning (CIT) is adopted to continually instruct Large Models to follow human intent data by data. It is observed that existing gradient update would heavily destroy the performance on previous datasets during CIT process. Instead, Exponential Moving Average (EMA), owns the ability to trace previous parameters, which can aid in decreasing forgetting. Nonetheless, its stable balance weight fails to deal with the ever-changing datasets, leading to the out-of-balance between plasticity and stability. In this paper, we propose a general continual instruction tuning framework to address the challenge. Starting from the trade-off prerequisite and EMA update, we propose the plasticity and stability ideal condition. Based on Taylor expansion in the loss function, we find the optimal balance weight can be automatically determined by the gradients and learned parameters. Therefore, we propose a stable-plasticity balanced coefficient to avoid knowledge interference. Based on the semantic similarity of the instructions, we can determine whether to retrain or expand the training parameters and allocate the most suitable parameters for the testing instances. Extensive experiments across multiple continual instruction tuning benchmarks demonstrate that our approach not only enhances anti-forgetting capabilities but also significantly improves overall continual tuning performance. Our code is available at https://github.com/JingyangQiao/CoIN.

Poster

#E-1205

On the Diversity of Adversarial Ensemble Learning

Jun-Qi Guo · Meng-Zhang Qian · Wei Gao · Zhi-Hua Zhou

Diversity has been one of the most crucial factors on the design of adversarial ensemble methods. This work focuses on the fundamental problems: How to define the diversity for the adversarial ensemble, and how to correlate with algorithmic performance. We first show that it is an NP-Hard problem to precisely calculate the diversity of two networks in adversarial ensemble learning, which makes it different from prior diversity analysis. We present the first diversity decomposition under the first-order approximation for the adversarial ensemble learning. Specifically, the adversarial ensemble loss can be decomposed into average of individual adversarial losses, gradient diversity, prediction diversity and cross diversity. Hence, it is not sufficient to merely consider the gradient diversity on the characterization of diversity as in previous adversarial ensemble methods. We present diversity decomposition for classification with cross-entropy loss similarly. Based on the theoretical analysis, we develop new ensemble method via orthogonal adversarial predictions to simultaneously improve gradient diversity and cross diversity. We finally conduct experiments to validate the effectiveness of our method.

Poster

#E-1206

Efficient Bisection Projection to Ensure Neural-Network Solution Feasibility for Optimization over General Set

Enming Liang · Minghua Chen

Neural networks (NNs) have emerged as promising tools for solving constrained optimization problems in real-time. However, ensuring constraint satisfaction for NN-generated solutions remains challenging due to prediction errors. Existing methods to ensure NN feasibility either suffer from high computational complexity or are limited to specific constraint types.We present Bisection Projection, an efficient approach to ensure NN solution feasibility for optimization over general compact sets with non-empty interiors.Our method comprises two key components:(i) a dedicated NN (called IPNN) that predicts interior points (IPs) with low eccentricity, which naturally accounts for approximation errors;(ii) a bisection algorithm that leverages these IPs to recover solution feasibility when initial NN solutions violate constraints.We establish theoretical guarantees by providing sufficient conditions for IPNN feasibility and proving bounded optimality loss of the bisection operation under IP predictions. Extensive evaluations on real-world non-convex problems demonstrate that Bisection Projection achieves superior feasibility and computational efficiency compared to existing methods, while maintaining comparable optimality gaps.

Poster

#E-1207

Probabilistic Group Mask Guided Discrete Optimization for Incremental Learning

Fengqiang Wan · Yang Yang

Incremental learning (IL) aims to sequentially learn new tasks while mitigating catastrophic forgetting. Among various IL strategies, parameter-isolation methods stand out by using mask techniques to allocate distinct parameters to each task, explicitly addressing forgetting. However, existing approaches often disregard parameter dependencies, resulting in an over-reliance on newly allocated parameters. To address this issue, we propose Probabilistic Group Mask selection (PGM), a group-wise approach that captures parameter dependencies by exploring candidate masks within each group. Specifically, PGM partitions parameters into groups with multiple candidate masks, assigning probabilities to these masks and leveraging Gumbel-Softmax for differentiable sampling, enabling efficient optimization of the discrete mask selection process. Our theoretical analysis demonstrates that incorporating parameter dependencies enhances sub-network selection. Experiments conducted on standard benchmarks confirm its superior effectiveness compared to existing IL approaches. The source code is available at: \url{https://github.com/njustkmg/ICML25-PGM}.

Poster

#E-1208

Learning from True-False Labels via Multi-modal Prompt Retrieving

Zhongnian Li · Jinghao Xu · Peng Ying · Meng Wei · Xinzheng Xu

Pre-trained Vision-Language Models (VLMs) exhibit strong zero-shot classification abilities, demonstrating great potential for generating weakly supervised labels. Unfortunately, existing weakly supervised learning methods are short of ability in generating accurate labels via VLMs. In this paper, we propose a novel weakly supervised labeling setting, namely True-False Labels (TFLs) which can achieve high accuracy when generated by VLMs. The TFL indicates whether an instance belongs to the label, which is randomly and uniformly sampled from the candidate label set. Specifically, we theoretically derive a risk-consistent estimator to explore and utilize the conditional probability distribution information of TFLs. Besides, we propose a convolutional-based Multi-modal Prompt Retrieving (MRP) method to bridge the gap between the knowledge of VLMs and target learning tasks. Experimental results demonstrate the effectiveness of the proposed TFL setting and MRP learning method. The code to reproduce the experiments is at https://github.com/Tranquilxu/TMP.

Poster

#E-1209

Random Policy Evaluation Uncovers Policies of Generative Flow Networks

Haoran He · Emmanuel Bengio · Qingpeng Cai · Ling Pan

The Generative Flow Network (GFlowNet) is a probabilistic framework in which an agent learns a stochastic policy and flow functions to sample objects with probability proportional to an unnormalized reward function. GFlowNets share a strong connection with reinforcement learning (RL) that typically aims to maximize reward. A number of recent works explored connections between GFlowNets and maximum entropy (MaxEnt) RL, which incorporates entropy regularization into the standard RL objective. However, the relationship between GFlowNets and standard RL remains largely unexplored, despite the inherent similarities in their sequential decision-making nature.While GFlowNets can discover diverse solutions through specialized flow-matching objectives, connecting them to standard RL can simplify their implementation through well-established RL principles and also improve RL's capabilities in diverse solution discovery (a critical requirement in many real-world applications), and bridging this gap can further unlock the potential of both fields. In this paper, we bridge this gap by revealing a fundamental connection between GFlowNets and one of the most basic components of RL -- policy evaluation. Surprisingly, we find that the value function obtained from evaluating a uniform policy is closely associated with the flow functions in GFlowNets. Building upon these insights, we introduce a rectified random policy evaluation (RPE) algorithm, which achieves the same reward-matching effect as GFlowNets based on simply evaluating a fixed random policy, offering a new perspective. Empirical results across extensive benchmarks demonstrate that RPE achieves competitive results compared to previous approaches, shedding light on the previously overlooked connection between (non-MaxEnt) RL and GFlowNets.

Poster

#E-1300

Towards the Efficient Inference by Incorporating Automated Computational Phenotypes under Covariate Shift

chao ying · Jun Jin · Yi Guo · Xiudi Li · Muxuan Liang · Jiwei Zhao

Collecting gold-standard phenotype data via manual extraction is typically labor-intensive and slow, whereas automated computational phenotypes (ACPs) offer a systematic and much faster alternative.However, simply replacing the gold-standard with ACPs, without acknowledging their differences, could lead to biased results and misleading conclusions.Motivated by the complexity of incorporating ACPs while maintaining the validity of downstream analyses, in this paper, we consider a semi-supervised learning setting that consists of both labeled data (with gold-standard) and unlabeled data (without gold-standard), under the covariate shift framework.We develop doubly robust and semiparametrically efficient estimators that leverage ACPs for general target parameters in the unlabeled and combined populations. In addition,we carefully analyze the efficiency gains achieved by incorporating ACPs, comparing scenarios with and without their inclusion.Notably, we identify that ACPs for the unlabeled data, instead of for the labeled data, drive the enhanced efficiency gains. To validate our theoretical findings, we conduct comprehensive synthetic experiments and apply our method to multiple real-world datasets, confirming the practical advantages of our approach.

Poster

#E-1301

Socialized Coevolution: Advancing a Better World through Cross-Task Collaboration

Xinjie Yao · Yu Wang · Pengfei Zhu · Wanyu LIN · Ruipu Zhao · Zhoupeng Guo · Weihao Li · Qinghua Hu

Traditional machine societies rely on data-driven learning, overlooking interactions and limiting knowledge acquisition from model interplay. To address these issues, we revisit the development of machine societies by drawing inspiration from the evolutionary processes of human societies. Motivated by Social Learning (SL), this paper introduces a practical paradigm of Socialized Coevolution (SC). Compared to most existing methods focused on knowledge distillation and multi-task learning, our work addresses a more challenging problem: not only enhancing the capacity to solve new downstream tasks but also improving the performance of existing tasks through inter-model interactions. Inspired by cognitive science, we propose Dynamic Information Socialized Collaboration (DISC), which achieves SC through interactions between models specialized in different downstream tasks. Specifically, we introduce the dynamic hierarchical collaboration and dynamic selective collaboration modules to enable dynamic and effective interactions among models, allowing them to acquire knowledge from these interactions. Finally, we explore potential future applications of combining SL and SC, discuss open questions, and propose directions for future research, aiming to spark interest in this emerging and exciting interdisciplinary field. Our code will be publicly available at https://github.com/yxjdarren/SC.

Spotlight Poster

#E-1302

Upweighting Easy Samples in Fine-Tuning Mitigates Forgetting

Sunny Sanyal · Hayden Prairie · Rudrajit Das · Ali Kavis · Sujay Sanghavi

Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a *sample weighting scheme for the fine-tuning data* solely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace, which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a $0.8$% drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving $5.4$% more accuracy on the pre-training datasets.

Poster

#E-1303

Fast and Robust: Task Sampling with Posterior and Diversity Synergies for Adaptive Decision-Makers in Randomized Environments

Yun Qu · Cheems Wang · Yixiu Mao · Yiqin Lv · Xiangyang Ji

Task robust adaptation is a long-standing pursuit in sequential decision-making.Some risk-averse strategies, e.g., the conditional value-at-risk principle, are incorporated in domain randomization or meta reinforcement learning to prioritize difficult tasks in optimization, which demand costly intensive evaluations.The efficiency issue prompts the development of robust active task sampling to train adaptive policies, where risk-predictive models can surrogate policy evaluation. This work characterizes robust active task sampling as a secret Markov decision process, posits theoretical and practical insights, and constitutes robustness concepts in risk-averse scenarios.Importantly, we propose an easy-to-implement method, referred to as Posterior and Diversity Synergized Task Sampling (PDTS), to accommodate fast and robust sequential decision-making.Extensive experiments show that PDTS unlocks the potential of robust active task sampling, significantly improves the zero-shot and few-shot adaptation robustness in challenging tasks, and even accelerates the learning process under certain scenarios.

Poster

#E-1304

Modeling Multi-Task Model Merging as Adaptive Projective Gradient Descent

Yongxian Wei · Anke Tang · Li Shen · Zixuan Hu · Chun Yuan · Xiaochun Cao

Merging multiple expert models offers a promising approach for performing multi-task learning without accessing their original data. Existing methods attempt to alleviate task conflicts by sparsifying task vectors or promoting orthogonality among them. However, they overlook the fundamental target of model merging: the merged model performs as closely as possible to task-specific models on respective tasks. We find these methods inevitably discard task-specific information that, while causing conflicts, is crucial for performance. Based on our findings, we frame model merging as a constrained optimization problem ($\textit{i.e.}$, minimizing the gap between the merged model and individual models, subject to the constraint of retaining shared knowledge) and solve it via adaptive projective gradient descent. Specifically, we align the merged model with individual models by decomposing and reconstituting the loss function, alleviating conflicts through $\textit{data-free}$ optimization of task vectors. To retain shared knowledge, we optimize this objective by projecting gradients within a $\textit{shared subspace}$ spanning all tasks. Moreover, we view merging coefficients as adaptive learning rates and propose a task-aware, training-free strategy. Experiments show that our plug-and-play approach consistently outperforms previous methods, achieving state-of-the-art results across diverse architectures and tasks in both vision and NLP domains.

Poster

#E-1305

BSLoRA: Enhancing the Parameter Efficiency of LoRA with Intra-Layer and Inter-Layer Sharing

Yuhua Zhou · Ruifeng Li · Changhai Zhou · Fei Yang · Aimin PAN

Low-Rank Adaptation (LoRA) is a widely adopted parameter-efficient fine-tuning method for large language models (LLMs) to adapt to downstream tasks. However, in scenarios where multiple LoRA models are deployed simultaneously, standard LoRA introduces substantial trainable parameters, resulting in significant memory overhead and inference latency, particularly when supporting thousands of downstream tasks on a single server. While existing methods reduce stored parameters via parameter sharing, they fail to capture both local and global information simultaneously. To address this issue, we propose the Bi-Share LoRA (BSLoRA), which extends local LoRA with intra-LoRA and inter-LoRA parameter sharing to better capture local and global information. This approach reduces trainable parameters while maintaining or even enhancing model performance. Additionally, we design three transformation methods to improve the compatibility and collaborative efficiency of shared parameters with varying shapes, enhancing overall adaptability.Experiments on the 7B, 8B, and 13B versions of Llama show that BSLoRA, with only 44.59% of the parameters of standard LoRA, outperforms LoRA by approximately 0.33% on commonsense reasoning and 2.08% on MMLU benchmarks. Code is available at https://github.com/yuhua-zhou/BSLoRA.git.

Poster

#E-1306

WeGeFT: Weight‑Generative Fine‑Tuning for Multi‑Faceted Efficient Adaptation of Large Models

Chinmay Savadikar · Xi Song · Tianfu Wu

Fine-tuning large pretrained Transformer models can focus on either introducing a small number of new learnable parameters (parameter efficiency) or editing representations of a small number of tokens using lightweight modules (representation efficiency). While the pioneering method LoRA (Low-Rank Adaptation) inherently balances parameter, compute, and memory efficiency, many subsequent variants trade off compute and memory efficiency and/or performance to further reduce fine-tuning parameters. To address this limitation and unify parameter-efficient and representation-efficient fine-tuning, we propose Weight-Generative Fine-Tuning (WeGeFT, pronounced wee-gift), a novel approach that learns to generate fine-tuning weights directly from the pretrained weights. WeGeFT employs a simple low-rank formulation consisting of two linear layers, either shared across multiple layers of the pretrained model or individually learned for different layers. This design achieves multi-faceted efficiency in parameters, representations, compute, and memory, while maintaining or exceeding the performance of LoRA and its variants. Extensive experiments on commonsense reasoning, arithmetic reasoning, instruction following, code generation, and visual recognition verify the effectiveness of our proposed WeGeFT.

Poster

#E-1307

Self-Bootstrapping for Versatile Test-Time Adaptation

Shuaicheng Niu · Guohao Chen · Peilin Zhao · Tianyi Wang · Pengcheng Wu · Zhiqi Shen

In this paper, we seek to develop a versatile test-time adaptation (TTA) objective for a variety of tasks — classification and regression across image-, object-, and pixel-level predictions. We achieve this through a self-bootstrapping scheme that optimizes prediction consistency between the test image (as target) and its deteriorated view. The key challenge lies in devising effective augmentations/deteriorations that: i) preserve the image’s geometric information, e.g., object sizes and locations, which is crucial for TTA on object/pixel-level tasks, and ii) provide sufficient learning signals for TTA. To this end, we analyze how common distribution shifts affect the image's information power across spatial frequencies in the Fourier domain, and reveal that low-frequency components carry high power and masking these components supplies more learning signals, while masking high-frequency components can not. In light of this, we randomly mask the low-frequency amplitude of an image in its Fourier domain for augmentation. Meanwhile, we also augment the image with noise injection to compensate for missing learning signals at high frequencies, by enhancing the information power there. Experiments show that, either independently or as a plug-and-play module, our method achieves superior results across classification, segmentation, and 3D monocular detection tasks with both transformer and CNN models.

Poster

#E-1308

Learnable Spatial-Temporal Positional Encoding for Link Prediction

Katherine Tieu · Dongqi Fu · Zihao Li · Ross Maciejewski · Jingrui He

Accurate predictions rely on the expressiveness power of graph deep learning frameworks like graph neural networks and graph transformers, where a positional encoding mechanism has become much more indispensable in recent state-of-the-art (SOTA) works to record the canonical position information. However, the current positional encoding limits in three aspects, at least: (1) most positional encodings are pre-defined, and fixed functions, which are inadequate to adapt to the complex attributed graphs; (2) a few pioneering works propose the learnable positional encoding but still limited to the structural information, leaving the real-world time-evolving topological and feature information untouched; (3) most positional encodings should be equipped with transformer's attention mechanism to fully release the power, where the dense or relational attention is often unaffordable on large-scale structured data.Hence, we study the possibility of Learnable Spatial-Temporal Positional Encoding in an effective and efficient manner and then propose a simple temporal link prediction model named L-STEP. Briefly, for L-STEP, we (1) prove the proposed positional learning scheme can preserve the graph property from the spatial-temporal spectral viewpoint, (2) verify that MLPs can fully exploit the expressiveness and reach Transformers' performance on that encoding, (3) change different initial positional encoding inputs to show robustness, (4) analyze the theoretical complexity and obtain less empirical running time than SOTA, and (5) demonstrate its temporal link prediction out-performance on 13 classic datasets and with 10 algorithms in both transductive and inductive settings using 3 different sampling strategies. Also, L-STEP obtains the leading performance in the newest large-scale TGB benchmark.

Poster

#E-1309

Sparse Causal Discovery with Generative Intervention for Unsupervised Graph Domain Adaptation

Junyu Luo · Yuhao Tang · Yiwei Fu · Xiao Luo · Zhizhuo KOU · Zhiping Xiao · Wei Ju · Wentao Zhang · Ming Zhang

Unsupervised Graph Domain Adaptation (UGDA) leverages labeled source domain graphs to achieve effective performance in unlabeled target domains despite distribution shifts. However, existing methods often yield suboptimal results due to the entanglement of causal-spurious features and the failure of global alignment strategies. We propose SLOGAN (Sparse Causal Discovery with Generative Intervention), a novel approach that achieves stable graph representation transfer through sparse causal modeling and dynamic intervention mechanisms. Specifically, SLOGAN first constructs a sparse causal graph structure, leveraging mutual information bottleneck constraints to disentangle sparse, stable causal features while compressing domain-dependent spurious correlations through variational inference. To address residual spurious correlations, we innovatively design a generative intervention mechanism that breaks local spurious couplings through cross-domain feature recombination while maintaining causal feature semantic consistency via covariance constraints. Furthermore, to mitigate error accumulation in target domain pseudo-labels, we introduce a category-adaptive dynamic calibration strategy, ensuring stable discriminative learning. Extensive experiments on multiple real-world datasets demonstrate that SLOGAN significantly outperforms existing baselines.

Poster

#E-1400

Improving Multi-Class Calibration through Normalization-Aware Isotonic Techniques

Alon Arad · Saharon Rosset

Accurate and reliable probability predictions are essential for multi-class supervised learning tasks, where well-calibrated models enable rational decision-making. While isotonic regression has proven effective for binary calibration, its extension to multi-class problems via one-vs-rest calibration often produces suboptimal results, limiting its practical adoption. In this work, we propose novel isotonic normalization-aware techniques for multi-class calibration, grounded in natural and intuitive assumptions expected by practitioners. Unlike prior approaches, our methods inherently account for probability normalization by either incorporating normalization directly into the optimization process (NA-FIR) or modeling the problem as a cumulative bivariate isotonic regression (SCIR). Empirical evaluations on a variety of text and image classification datasets across different model architectures reveal that our approach consistently improves log loss and expected calibration error (ECE) metrics. These findings underscore the potential of our approach to enhance a-parametric multi-class calibration practices, offering an adaptable solution for real-world applications.

Spotlight Poster

#E-1401

TLLC: Transfer Learning-based Label Completion for Crowdsourcing

Wenjun Zhang · Liangxiao Jiang · Chaoqun Li

Label completion serves as a preprocessing approach to handling the sparse crowdsourced label matrix problem, significantly boosting the effectiveness of the downstream label aggregation. In recent advances, worker modeling has been proved to be a powerful strategy to further improve the performance of label completion. However, in real-world scenarios, workers typically annotate only a few instances, leading to insufficient worker modeling and thus limiting the improvement of label completion. To address this issue, we propose a novel transfer learning-based label completion (TLLC) method. Specifically, we first identify all high-confidence instances from the whole crowdsourced data as a source domain and use it to pretrain a Siamese network. The abundant annotated instances in the source domain provide essential knowledge for worker modeling. Then, we transfer the pretrained network to the target domain with the instances annotated by each worker separately, ensuring worker modeling captures unique characteristics of each worker. Finally, we leverage the new embeddings learned by the transferred network to complete each worker’s missing labels. Extensive experiments on several widely used real-world datasets demonstrate the effectiveness of TLLC. Our codes and datasets are available at https://github.com/jiangliangxiao/TLLC.

Poster

#E-1402

Cut out and Replay: A Simple yet Versatile Strategy for Multi-Label Online Continual Learning

Xinrui Wang · Shao-Yuan Li · Jiaqiang Zhang · Songcan Chen

Multi-Label Online Continual Learning (MOCL) requires models to learn continuously from endless multi-label data streams, facing complex challenges including persistent catastrophic forgetting, potential missing labels, and uncontrollable imbalanced class distributions. While existing MOCL methods attempt to address these challenges through various techniques, \textit{they all overlook label-specific region identifying and feature learning} - a fundamental solution rooted in multi-label learning but challenging to achieve in the online setting with incremental and partial supervision. To this end, we first leverage the inherent structural information of input data to evaluate and verify the innate localization capability of different pre-trained models. Then, we propose CUTER (CUT-out-and-Experience-Replay), a simple yet versatile strategy that provides fine-grained supervision signals by further identifying, strengthening and cutting out label-specific regions for efficient experience replay. It not only enables models to simultaneously address catastrophic forgetting, missing labels, and class imbalance challenges, but also serves as an orthogonal solution that seamlessly integrates with existing approaches. Extensive experiments on multiple multi-label image benchmarks demonstrate the superiority of our proposed method. The code is available at \href{https://github.com/wxr99/Cut-Replay}{https://github.com/wxr99/Cut-Replay}

Poster

#E-1403

Gradient Aligned Regression via Pairwise Losses

Dixian Zhu · Tianbao Yang · Livnat Jerby

Regression is a fundamental task in machine learning that has garnered extensive attention over the past decades. The conventional approach for regression involves employing loss functions that primarily concentrate on aligning model prediction with the ground truth for each individual data sample. Recent research endeavors have introduced novel perspectives by incorporating label similarity into regression through the imposition of additional pairwise regularization or contrastive learning on the latent feature space, demonstrating their effectiveness. However, there are two drawbacks to these approaches: (i) their pairwise operations in the latent feature space are computationally more expensive than conventional regression losses; (ii) they lack theoretical insights behind these methods. In this work, we propose GAR (Gradient Aligned Regression) as a competitive alternative method in label space, which is constituted by a conventional regression loss and two pairwise label difference losses for gradient alignment including magnitude and direction. GAR enjoys: i) the same level efficiency as conventional regression loss because the quadratic complexity for the proposed pairwise losses can be reduced to linear complexity; ii) theoretical insights from learning the pairwise label difference to learning the gradient of the ground truth function. We limit our current scope as regression on the clean data setting without noises, outliers or distributional shifts, etc. We demonstrate the effectiveness of the proposed method practically on two synthetic datasets and on eight extensive real-world tasks from six benchmark datasets with other eight competitive baselines. Running time experiments demonstrate the superior efficiency of the proposed GAR compared to existing methods with pairwise regularization or contrastive learning in the latent feature space. Additionally, ablation studies confirm the effectiveness of each component of GAR. The code is open sourced at https://github.com/DixianZhu/GAR.

Poster

#E-1404

Learnware Specification via Dual Alignment

Wei Chen · Jun-Xiang Mao · Xiaozheng Wang · Min-Ling Zhang

The learnware paradigm aims to establish a learnware dock system that contains numerous leanwares, each consisting of a well-trained model and a specification, enabling users to reuse high-performing models for their tasks instead of training from scratch. The specification, as a unique characterization of the model's specialties, dominates the effectiveness of model reuse. Existing specification methods mainly employ distribution alignment to generate specifications. However, this approach overlooks the model's discriminative performance, hindering an adequate specialty characterization. In this paper, we claim that it is beneficial to incorporate such discriminative performance for high-quality specification generation. Accordingly, a novel specification approach named Dali, i.e., Learnware Specification via Dual ALIgnment, is proposed. In Dali, the characterization of the model's discriminative performance is modeled as discriminative alignment, which is considered along with distribution alignment in the specification generation process. Theoretical and empirical analyses clearly demonstrate that the proposed approach is capable of facilitating model reuse in the learnware paradigm with high-quality specification generation.

Poster

#E-1405

Scalable Model Merging with Progressive Layer-wise Distillation

Jing Xu · Jiazheng Li · Jingzhao Zhang

Model merging offers an effective way to integrate the capabilities of multiple fine-tuned models. However, the performance degradation of the merged model remains a challenge, particularly when none or few data are available. This paper first highlights the necessity of domain-specific data for model merging by proving that data-agnostic algorithms can have arbitrarily bad worst-case performances. Building on this theoretical insight, we explore the relationship between model merging and distillation, introducing a novel few-shot merging algorithm, ProDistill (Progressive Layer-wise Distillation). Unlike common belief that layer-wise training hurts performance, we show that layer-wise teacher-student distillation not only enhances the scalability but also improves model merging performance. We conduct extensive experiments to show that compared to existing few-shot merging methods,ProDistill achieves state-of-the-art performance, with up to 6.14\% and 6.61\% improvements in vision and NLU tasks. Furthermore, we extend the experiments to models with over 10B parameters, showcasing the exceptional scalability of ProDistill.

Poster

#E-1406

Zero-shot Meta-learning for Tabular Prediction Tasks with Adversarially Pre-trained Transformer

Yulun Wu · Doron Bergman

We present an Adversarially Pre-trained Transformer (APT) that is able to perform zero-shot meta-learning on tabular prediction tasks without using any real-world dataset to pre-train the model, extending on the recent development of Prior-Data Fitted Networks (PFNs) and TabPFN. Specifically, APT is pre-trained with adversarial synthetic data agents, who continue to shift their underlying data generating distribution and deliberately challenge the model with different synthetic datasets. In addition, we propose a mixture block model architecture that is able to handle classification tasks with arbitrary number of classes, addressing the class size limitation -- a crucial weakness of prior tabular zero-shot learning algorithms. In experiments, we show that our framework matches state-of-the-art performance on small tabular classification tasks without filtering on dataset characteristics such as number of classes and number of missing values, while maintaining an average runtime under one second. On common benchmark dataset suites in both classification and regression, we show that adversarial pre-training was able to enhance TabPFN's performance. In our analysis, we demonstrate that the adversarial synthetic data agents were able to generate a more diverse collection of data compared to the ordinary random generator in TabPFN. In addition, we demonstrate that our mixture block neural design has improved generalizability and greatly accelerated pre-training.

Poster

#E-1407

When Data-Free Knowledge Distillation Meets Non-Transferable Teacher: Escaping Out-of-Distribution Trap is All You Need

Ziming Hong · Runnan Chen · Zengmao Wang · Bo Han · Bo Du · Tongliang Liu

Data-free knowledge distillation (DFKD) transfers knowledge from a teacher to a student without access the real in-distribution (ID) data. Its common solution is to use a generator to synthesize fake data and use them as a substitute for real ID data. However, existing works typically assume teachers are trustworthy, leaving the robustness and security of DFKD from untrusted teachers largely unexplored. In this work, we conduct the first investigation into distilling non-transferable learning (NTL) teachers using DFKD, where the transferability from an ID domain to an out-of-distribution (OOD) domain is prohibited. We find that NTL teachers fool DFKD through divert the generator's attention from the useful ID knowledge to the misleading OOD knowledge. This hinders ID knowledge transfer but prioritizes OOD knowledge transfer. To mitigate this issue, we propose Adversarial Trap Escaping (ATEsc) to benefit DFKD by identifying and filtering out OOD-like synthetic samples. Specifically, inspired by the evidence that NTL teachers show stronger adversarial robustness on OOD samples than ID samples, we split synthetic samples into two groups according to their robustness. The fragile group is treated as ID-like data and used for normal knowledge distillation, while the robust group is seen as OOD-like data and utilized for forgetting OOD knowledge. Extensive experiments demonstrate the effectiveness of ATEsc for improving DFKD against NTL teachers.

Poster

#E-1408

Representation Surgery in Model Merging with Probabilistic Modeling

Qi Wei · Shuo He · Enneng Yang · Tingcong Liu · Haobo Wang · Lei Feng · Bo An

Model merging aims to achieve multitask performance by merging multiple expert models without the need to access the raw training data.Recent research identified the \textit{representation bias} of model merging, characterized by a discrepancy in the representation distribution between the merged and individual models, hindering the performance of model merging methods. To mitigate the representation bias, a task-specific MLP, Surgery, was built to model the bias that is subsequently decreased on the merged representation. However, this strategy is still suboptimal due to the limited modeling capability within the deterministic manner. To address this issue, we present ProbSurgery, a probabilistic module specifically designed to accurately model the representation bias.This module generates an embedding distribution for each sample and outputs the representation bias through a sampling process.ProbSurgery offers superior representational capacity by naturally handling the uncertainty resulting from parameter interference of merging multiple models. Besides, we provide a theoretical analysis to reveal the advance of the probabilistic manner and propose an extension of ProSurgery for adapting to the task-sharing setting. Extensive experiments verify the effectiveness of ProbSurgery for representation surgery while maintaining generalization capabilities in real-world scenarios, including out-of-distribution and domain shift challenges.

Poster

#E-1409

CABS: Conflict-Aware and Balanced Sparsification for Enhancing Model Merging

Zongzhen Yang · Binhang Qi · Hailong Sun · Wenrui Long · Ruobing Zhao · Xiang Gao

Model merging based on task vectors, i.e., the parameter differences between fine-tuned models and a shared base model, provides an efficient way to integrate multiple task-specific models into a multitask model without retraining. Recent works have endeavored to address the conflicts between task vectors, one of the significant challenges faced by model merging, through sparsification; however, two issues significantly limit their performance: *high parameter overlap* and *unbalanced weight distribution*. To address these issues, we propose a simple yet effective framework called **CABS** (Conflict-Aware and Balanced Sparsification), consisting of **C**onflict-**A**ware Sparsification (CA) and **B**alanced **S**parsification (BS). CA reduces parameter overlap by applying masks during sequential pruning, ensuring that each task vector retains distinct, non-overlapping parameters. BS leverages $n$:$m$ pruning to preserve critical weights while maintaining an even distribution across layers. Our comprehensive experiments demonstrate that CABS outperforms state-of-the-art methods across diverse tasks and model sizes.

Poster

#E-1500

Learning Compact Semantic Information for Incomplete Multi-View Missing Multi-Label Classification

Jie Wen · Yadong Liu · Zhanyan Tang · Yuting He · Yulong Chen · Mu Li · Chengliang Liu

Multi-view data involves various data forms, such as multi-feature, multi-sequence and multimodal data, providing rich semantic information for downstream tasks. The inherent challenge of incomplete multi-view missing multi-label learning lies in how to effectively utilize limited supervision and insufficient data to learn discriminative representation. Starting from the sufficiency of multi-view shared information for downstream tasks, we argue that the existing contrastive learning paradigms on missing multi-view data show limited consistency representation learning ability, leading to the bottleneck in extracting multi-view shared information. In response, we propose to minimize task-independent redundant information by pursuing the maximization of cross-view mutual information. Additionally, to alleviate the hindrance caused by missing labels, we develop a dual-branch soft pseudo-label cross-imputation strategy to improve classification performance. Extensive experiments on multiple benchmarks validate our advantages and demonstrate strong compatibility with both missing and complete data.

Poster

#E-1501

Towards Escaping from Class Dependency Modeling for Multi-Dimensional Classification

Teng Huang · Bin-Bin Jia · Min-Ling Zhang

In multi-dimensional classification (MDC), the semantics of objects are characterized by multiple class variables from different dimensions. Existing MDC approaches focus on designing effective class dependency modeling strategies to enhance classification performance. However, the intercoupling of multiple class variables poses a significant challenge to the precise modeling of class dependencies. In this paper, we make the first attempt towards escaping from class dependency modeling for addressing MDC problems. Accordingly, a novel MDC approach named DCOM is proposed by decoupling the interactions of different dimensions in MDC. Specifically, DCOM endeavors to identify a latent factor that encapsulates the most salient and critical feature information. This factor will facilitate partial conditional independence among class variables conditioned on both the original feature vector and the learned latent embedding. Once the conditional independence is established, classification models can be readily induced by employing simple neural networks on each dimension. Extensive experiments conducted on benchmark data sets demonstrate that DCOM outperforms other state-of-the-art MDC approaches.

Poster

#E-1502

Preserving AUC Fairness in Learning with Noisy Protected Groups

Mingyang Wu · Li Lin · Wenbin Zhang · Xin Wang · Zhenhuan Yang · Shu Hu

The Area Under the ROC Curve (AUC) is a key metric for classification, especially under class imbalance, with growing research focus on optimizing AUC over accuracy in applications like medical image analysis and deepfake detection. This leads to fairness in AUC optimization becoming crucial as biases can impact protected groups. While various fairness mitigation techniques exist, fairness considerations in AUC optimization remain in their early stages, with most research focusing on improving AUC fairness under theassumption of clean protected groups. However, these studies often overlook the impact of noisy protected groups, leading to fairness violations in practice. To address this, we propose the first robust AUC fairness approach under noisy protected groups with fairness theoretical guarantees using distributionally robust optimization. Extensive experiments on tabular and image datasets show that our method outperforms state-of-the-art approaches in preserving AUC fairness. The code is in https://github.com/Purdue-M2/AUCFairnesswithNoisyGroups.

Poster

#E-1503

KAN-AD: Time Series Anomaly Detection with Kolmogorov–Arnold Networks

Quan Zhou · Changhua Pei · Fei Sun · HanJing · Zhengwei Gao · haiming zhang · Gaogang Xie · Dan Pei · Jianhui LI

Time series anomaly detection (TSAD) underpins real-time monitoring in cloud services and web systems, allowing rapid identification of anomalies to prevent costly failures. Most TSAD methods driven by forecasting models tend to overfit by emphasizing minor fluctuations. Our analysis reveals that effective TSAD should focus on modeling "normal" behavior through smooth local patterns. To achieve this, we reformulate time series modeling as approximating the series with smooth univariate functions. The local smoothness of each univariate function ensures that the fitted time series remains resilient against local disturbances. However, a direct KAN implementation proves susceptible to these disturbances due to the inherently localized characteristics of B-spline functions. We thus propose KAN-AD, replacing B-splines with truncated Fourier expansions and introducing a novel lightweight learning mechanism that emphasizes global patterns while staying robust to local disturbances. On four popular TSAD benchmarks, KAN-AD achieves an average 15% improvement in detection accuracy (with peaks exceeding 27%) over state-of-the-art baselines. Remarkably, it requires fewer than 1,000 trainable parameters, resulting in a 50% faster inference speed compared to the original KAN, demonstrating the approach's efficiency and practical viability.

Poster

#E-1504

TreeLoRA: Efficient Continual Learning via Layer-Wise LoRAs Guided by a Hierarchical Gradient-Similarity Tree

Yu-Yang Qian · Yuan-Ze Xu · Zhen-Yu Zhang · Peng Zhao · Zhi-Hua Zhou

Many real-world applications collect data in a streaming environment, where learning tasks are encountered sequentially. This necessitates continual learning (CL) to update models online, enabling adaptation to new tasks while preserving past knowledge to prevent catastrophic forgetting. Nowadays, with the flourish of large pre-trained models (LPMs), efficiency has become increasingly critical for CL, due to their substantial computational demands and growing parameter sizes. In this paper, we introduce TreeLoRA (K-D Tree of Low-Rank Adapters), a novel approach that constructs layer-wise adapters by leveraging hierarchical gradient similarity to enable efficient CL, particularly for LPMs. To reduce the computational burden of task similarity estimation, we employ bandit techniques to develop an algorithm based on lower confidence bounds to efficiently explore the task structure. Furthermore, we use sparse gradient updates to facilitate parameter optimization, making the approach better suited for LPMs. Theoretical analysis is provided to justify the rationale behind our approach, and experiments on both vision transformers (ViTs) and large language models (LLMs) demonstrate the effectiveness and efficiency of our approach across various domains, including vision and natural language processing tasks.

Poster

#E-1505

Slimming the Fat-Tail: Morphing-Flow for Adaptive Time Series Modeling

Tianyu Liu · kai sun · Fuchun Sun · Yu Luo · Yuanlong Zhang

Temporal sequences, even after stationarization, often exhibit leptokurtic distributions with fat tails and persistent distribution shifts. These properties destabilize feature dynamics, amplify model variance, and hinder model convergence in time series forecasting. To address this, we propose Morphing-Flow (MoF), a framework that combines a spline-based transform layer (Flow) and a test-time-trained method (Morph), which adaptively normalizes non-stationary, fat-tailed distributions while preserving critical extreme features. MoF ensures that inputs remain within a network’s effective activation space—a structured, normal-like distribution—even under distributional drift. Experiments across eight datasets show that MoF achieves state-of-the-art performance: With a simple linear backbone architecture, it matches the performance of state-of-the-art models on datasets such as Electricity and ETTh2. When paired with a patch-based Mamba architecture, MoF outperforms its closest competitor by 6.3% on average and reduces forecasting errors in fat-tailed datasets such as Exchange by 21.7%. Moreover, MoF acts as a plug-and-play module, boosting performance in existing models without architectural changes.

Poster

#E-1506

TimeDART: A Diffusion Autoregressive Transformer for Self-Supervised Time Series Representation

Daoyu Wang · Mingyue Cheng · Zhiding Liu · Qi Liu

Self-supervised learning has garnered increasing attention in time series analysis for benefiting various downstream tasks and reducing reliance on labeled data. Despite its effectiveness, existing methods often struggle to comprehensively capture both long-term dynamic evolution and subtle local patterns in a unified manner. In this work, we propose \textbf{TimeDART}, a novel self-supervised time series pre-training framework that unifies two powerful generative paradigms to learn more transferable representations. Specifically, we first employ a causal Transformer encoder, accompanied by a patch-based embedding strategy, to model the evolving trends from left to right. Building on this global modeling, we further introduce a denoising diffusion process to capture fine-grained local patterns through forward diffusion and reverse denoising. Finally, we optimize the model in an autoregressive manner. As a result, TimeDART effectively accounts for both global and local sequence features in a coherent way.We conduct extensive experiments on public datasets for time series forecasting and classification. The experimental results demonstrate that TimeDART consistently outperforms previous compared methods, validating the effectiveness of our approach.Our code is available at \url{https://github.com/Melmaphother/TimeDART}.

Poster

#E-1507

DTZO: Distributed Trilevel Zeroth Order Learning with Provable Non-Asymptotic Convergence

Yang Jiao · Kai Yang · Chengtao Jian

Trilevel learning (TLL) with zeroth order constraints is a fundamental problem in machine learning, arising in scenarios where gradient information is inaccessible due to data privacy or model opacity, such as in federated learning, healthcare, and financial systems. These problems are notoriously difficult to solve due to their inherent complexity and the lack of first order information. Moreover, in many practical scenarios, data may be distributed across various nodes, necessitating strategies to address trilevel learning problems without centralizing data on servers to uphold data privacy. To this end, an effective distributed trilevel zeroth order learning framework DTZO is proposed in this work to address the trilevel learning problems with level-wise zeroth order constraints in a distributed manner. The proposed DTZO is versatile and can be adapted to a wide range of (grey-box) trilevel learning problems with partial zeroth order constraints. In DTZO, the cascaded polynomial approximation can be constructed without relying on gradients or sub-gradients, leveraging a novel cut, i.e., zeroth order cut. Furthermore, we theoretically carry out the non-asymptotic convergence rate analysis for the proposed DTZO in achieving the $\epsilon$-stationary point. Extensive experiments have been conducted to demonstrate and validate the superior performance of the proposed DTZO.

Poster

#E-1508

Efficient Network Automatic Relevance Determination

Hongwei Zhang · Ziqi Ye · Xinyuan Wang · Xin Guo · Zenglin Xu · Yuan Cheng · Zixin Hu · Yuan Qi

We propose Network Automatic Relevance Determination (NARD), an extension of ARD for linearly probabilistic models, to simultaneously model sparse relationships between inputs $X \in \mathbb R^{d \times N}$ and outputs $Y \in \mathbb R^{m \times N}$, while capturing the correlation structure among the $Y$. NARD employs a matrix normal prior which contains a sparsity-inducing parameter to identify and discard irrelevant features, thereby promoting sparsity in the model. Algorithmically, it iteratively updates both the precision matrix and the relationship between $Y$ and the refined inputs. To mitigate the computational inefficiencies of the $\mathcal O(m^3 + d^3)$ cost per iteration, we introduce Sequential NARD, which evaluates features sequentially, and a Surrogate Function Method, leveraging an efficient approximation of the marginal likelihood and simplifying the calculation of determinant and inverse of an intermediate matrix. Combining the Sequential update with the Surrogate Function method further reduces computational costs. The computational complexity per iteration for these three methods is reduced to $\mathcal O(m^3+p^3)$, $\mathcal O(m^3 + d^2)$, $\mathcal O(m^3+p^2)$ respectively, where $p \ll d$ is the final number of features in the model. Our methods demonstrate significant improvements in computational efficiency with comparable performance on both synthetic and real-world datasets.

Poster

#E-1509

Balancing Model Efficiency and Performance: Adaptive Pruner for Long-tailed Data

Zhe Zhao · HaiBin Wen · Pengkun Wang · ShuangWang · Zhenkun Wang · Qingfu Zhang · Yang Wang

Long-tailed distribution datasets are prevalent in many machine learning tasks, yet existing neural network models still face significant challenges when handling such data. This paper proposes a novel adaptive pruning strategy, LTAP (Long-Tailed Adaptive Pruner), aimed at balancing model efficiency and performance to better address the challenges posed by long-tailed data distributions. LTAP introduces multi-dimensional importance scoring criteria and designs a dynamic weight adjustment mechanism to adaptively determine the pruning priority of parameters for different classes. By focusing on protecting parameters critical for tail classes, LTAP significantly enhances computational efficiency while maintaining model performance. This method combines the strengths of long-tailed learning and neural network pruning, overcoming the limitations of existing approaches in handling imbalanced data. Extensive experiments demonstrate that LTAP outperforms existing methods on various long-tailed datasets, achieving a good balance between model compression rate, computational efficiency, and classification accuracy. This research provides new insights into solving model optimization problems in long-tailed learning and is significant for improving the performance of neural networks on imbalanced datasets. The code is available at https://github.com/DataLab-atom/LT-VOTE.

Poster

#E-1600

OrcaLoca: An LLM Agent Framework for Software Issue Localization

Zhongming Yu · Hejia Zhang · Yujie Zhao · Hanxian Huang · Matrix Yao · Ke Ding · Jishen Zhao

Recent developments in Large Language Model (LLM) agents are revolutionizing Autonomous Software Engineering (ASE), enabling automated coding, problem fixes, and feature improvements. However, localization -- precisely identifying software problems by navigating to relevant code sections -- remains a significant challenge. Current approaches often yield suboptimal results due to a lack of effective integration between LLM agents and precise code search mechanisms. This paper introduces OrcaLoca, an LLM agent framework that improves accuracy for software issue localization by integrating priority-based scheduling for LLM-guided action, action decomposition with relevance scoring, and distance-aware context pruning. Experimental results demonstrate that OrcaLoca becomes the new open-source state-of-the-art (SOTA) in function match rate (65.33%) on SWE-bench Lite. It also improves the final resolved rate of an open-source framework by 6.33 percentage points through its patch generation integration.

Poster

#E-1601

Nested Expectations with Kernel Quadrature

Zonghao Chen · Masha Naslidnyk · Francois-Xavier Briol

This paper considers the challenging computational task of estimating nested expectations. Existing algorithms, such as nested Monte Carlo or multilevel Monte Carlo, are known to be consistent but require a large number of samples at both inner and outer levels to converge. Instead, we propose a novel estimator consisting of nested kernel quadrature estimators and we prove that it has a faster convergence rate than all baseline methods when the integrands have sufficient smoothness. We then demonstrate empirically that our proposed method does indeed require the fewest number of samples to estimate nested expectations over a range of real-world application areas from Bayesian optimisation to option pricing and health economics.

Poster

#E-1602

An Instrumental Value for Data Production and its Application to Data Pricing

Rui Ai · Boxiang Lyu · Zhaoran Wang · Zhuoran Yang · Haifeng Xu

We develop a framework for capturing the instrumentalvalue of data production processes, whichaccounts for two key factors: (a) the context ofthe agent’s decision-making; (b) how much dataor information the buyer already possesses. We"micro-found" our data valuation function by establishingits connection to classic notions of signalsand information design in economics. Wheninstantiated in Bayesian linear regression, ourvalue naturally corresponds to information gain.Applying our proposed data value in Bayesian linearregression for monopoly pricing, we show thatif the seller can fully customize data production,she can extract the first-best revenue (i.e., full surplus)from any population of buyers, i.e., achievingfirst-degree price discrimination. If data canonly be constructed from an existing data pool,this limits the seller’s ability to customize, andachieving first-best revenue becomes generallyimpossible. However, we design a mechanismthat achieves seller revenue at most $\log(\kappa)$ lessthan the first-best, where $\kappa$ is the condition numberassociated with the data matrix. As a corollary,the seller extracts the first-best revenue in themulti-armed bandits special case.

Poster

#E-1603

On the Dynamic Regret of Following the Regularized Leader: Optimism with History Pruning

Naram Mhaisen · George Iosifidis

We revisit the Follow the Regularized Leader (FTRL) framework for Online Convex Optimization (OCO) over compact sets, focusing on achieving dynamic regret guarantees. Prior work has highlighted the framework’s limitations in dynamic environments due to its tendency to produce "lazy" iterates. However, building on insights showing FTRL's ability to produce "agile" iterates, we show that it can indeed recover known dynamic regret bounds through optimistic composition of future costs and careful linearization of past costs, which can lead to pruning some of them. This new analysis of FTRL against dynamic comparators yields a principled way to interpolate between greedy and agile updates and offers several benefits, including refined control over regret terms, optimism without cyclic dependence, and the application of minimal recursive regularization akin to AdaFTRL. More broadly, we show that it is not the "lazy" projection style of FTRL that hinders (optimistic) dynamic regret, but the decoupling of the algorithm’s state (linearized history) from its iterates, allowing the state to grow arbitrarily. Instead, pruning synchronizes these two when necessary.

Poster

#E-1604

AutoAL: Automated Active Learning with Differentiable Query Strategy Search

Yifeng Wang · Xueying Zhan · Siyu Huang

As deep learning continues to evolve, the need for data efficiency becomes increasingly important. Considering labeling large datasets is both time-consuming and expensive, active learning (AL) provides a promising solution to this challenge by iteratively selecting the most informative subsets of examples to train deep neural networks, thereby reducing the labeling cost. However, the effectiveness of different AL algorithms can vary significantly across data scenarios, and determining which AL algorithm best fits a given task remains a challenging problem. This work presents the first differentiable AL strategy search method, named AutoAL, which is designed on top of existing AL sampling strategies. AutoAL consists of two neural nets, named SearchNet and FitNet, which are optimized concurrently under a differentiable bi-level optimization framework. For any given task, SearchNet and FitNet are iteratively co-optimized using the labeled data, learning how well a set of candidate AL algorithms perform on that task. With the optimal AL strategies identified, SearchNet selects a small subset from the unlabeled pool for querying their annotations, enabling efficient training of the task model. Experimental results demonstrate that AutoAL consistently achieves superior accuracy compared to all candidate AL algorithms and other selective AL approaches, showcasing its potential for adapting and integrating multiple existing AL methods across diverse tasks and domains.

Poster

#E-1605

Improved and Oracle-Efficient Online $\ell_1$-Multicalibration

Rohan Ghuge · Vidya Muthukumar · Sahil Singla

We study *online multicalibration*, a framework for ensuring calibrated predictions across multiple groups in adversarial settings, across $T$ rounds. Although online calibration is typically studied in the $\ell_1$ norm, prior approaches to online multicalibration have taken the indirect approach of obtaining rates in other norms (such as $\ell_2$ and $\ell_{\infty}$) and then transferred these guarantees to $\ell_1$ at additional loss. In contrast, we propose a direct method that achieves improved and oracle-efficient rates of $\widetilde{\mathcal{O}}(T^{-1/3})$ and $\widetilde{\mathcal{O}}(T^{-1/4})$ respectively, for online $\ell_1$-multicalibration. Our key insight is a novel reduction of online $\ell_1$-multicalibration to an online learning problem with product-based rewards, which we refer to as *online linear-product optimization* ($\mathtt{OLPO}$). To obtain the improved rate of $\widetilde{\mathcal{O}}(T^{-1/3})$, we introduce a linearization of $\mathtt{OLPO}$ and design a no-regret algorithm for this linearized problem. Although this method guarantees the desired sublinear rate (nearly matching the best rate for online calibration), it is computationally expensive when the group family $\mathcal{H}$ is large or infinite, since it enumerates all possible groups. To address scalability, we propose a second approach to $\mathtt{OLPO}$ that makes only a polynomial number of calls to an offline optimization (*multicalibration evaluation*) oracle, resulting in *oracle-efficient* online $\ell_1$-multicalibration with a corresponding rate of $\widetilde{\mathcal{O}}(T^{-1/4})$. Our framework also extends to certain infinite families of groups (e.g., all linear functions on the context space) by exploiting a $1$-Lipschitz property of the $\ell_1$-multicalibration error with respect to $\mathcal{H}$.

Poster

#E-1606

Online Clustering of Dueling Bandits

Zhiyong Wang · Jiahang Sun · Mingze Kong · Jize Xie · Qinghua Hu · John C. S. Lui · Zhongxiang Dai

The contextual multi-armed bandit (MAB) is a widely used framework for problems requiring sequential decision-making under uncertainty, such as recommendation systems. In applications involving a large number of users, the performance of contextual MAB can be significantly improved by facilitating collaboration among multiple users. This has been achieved by the clustering of bandits (CB) methods, which adaptively group the users into different clusters and achieve collaboration by allowing the users in the same cluster to share data. However, classical CB algorithms typically rely on numerical reward feedback, which may not be practical in certain real-world applications. For instance, in recommendation systems, it is more realistic and reliable to solicit preference feedback between pairs of recommended items rather than absolute rewards. To address this limitation, we introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions. Both algorithms are supported by rigorous theoretical analyses, demonstrating that user collaboration leads to improved regret bounds. Extensive empirical evaluations on synthetic and real-world datasets further validate the effectiveness of our methods, establishing their potential in real-world applications involving multiple users with preference-based feedback.

Poster

#E-1607

SDMG: Smoothing Your Diffusion Models for Powerful Graph Representation Learning

Junyou Zhu · Langzhou He · Chao Gao · Dongpeng Hou · Zhen Su · Philip Yu · Juergen Kurths · Frank Hellmann

Diffusion probabilistic models (DPMs) have recently demonstrated impressive generative capabilities. There is emerging evidence that their sample reconstruction ability can yield meaningful representations for recognition tasks. In this paper, we demonstrate that the objectives underlying generation and representation learning are not perfectly aligned. Through a spectral analysis, we find that minimizing the mean squared error (MSE) between the original graph and its reconstructed counterpart does not necessarily optimize representations for downstream tasks. Instead, focusing on reconstructing a small subset of features, specifically those capturing global information, proves to be more effective for learning powerful representations. Motivated by these insights, we propose a novel framework, the Smooth Diffusion Model for Graphs (SDMG), which introduces a multi-scale smoothing loss and low-frequency information encoders to promote the recovery of global, low-frequency details, while suppressing irrelevant high-frequency noise. Extensive experiments validate the effectiveness of our method, suggesting a promising direction for advancing diffusion models in graph representation learning.

Poster

#E-1608

TraceGrad: a Framework Learning Expressive SO(3)-equivariant Non-linear Representations for Electronic-Structure Hamiltonian Prediction

Shi Yin · Xinyang Pan · fengyan wang · Lixin He

We propose a framework to combine strong non-linear expressiveness with strict SO(3)-equivariance in prediction of the electronic-structure Hamiltonian, by exploring the mathematical relationships between SO(3)-invariant and SO(3)-equivariant quantities and their representations. The proposed framework, called TraceGrad, first constructs theoretical SO(3)-invariant trace quantities derived from the Hamiltonian targets, and use these invariant quantities as supervisory labels to guide the learning of high-quality SO(3)-invariant features. Given that SO(3)-invariance is preserved under non-linear operations, the learning of invariant features can extensively utilize non-linear mappings, thereby fully capturing the non-linear patterns inherent in physical systems. Building on this, we propose a gradient-based mechanism to induce SO(3)-equivariant encodings of various degrees from the learned SO(3)-invariant features. This mechanism can incorporate powerful non-linear expressive capabilities into SO(3)-equivariant features with correspondence of physical dimensions to the regression targets, while theoretically preserving equivariant properties, establishing a strong foundation for predicting electronic-structure Hamiltonian. Experimental results on eight challenging benchmark databases demonstrate that our method achieves state-of-the-art performance in Hamiltonian prediction.

Poster

#E-1609

FEAT-KD: Learning Concise Representations for Single and Multi-Target Regression via TabNet Knowledge Distillation

Kei Sen Fong · Mehul Motani

In this work, we propose a novel approach that combines the strengths of FEAT and TabNet through knowledge distillation (KD), which we term FEAT-KD. FEAT is an intrinsically interpretable machine learning (ML) algorithm that constructs a weighted linear combination of concisely-represented features discovered via genetic programming optimization, which can often be inefficient. FEAT-KD leverages TabNet's deep-learning-based optimization and feature selection mechanisms instead. FEAT-KD finds a weighted linear combination of concisely-represented, symbolic features that are derived from piece-wise distillation of a trained TabNet model. We analyze FEAT-KD on regression tasks from two perspectives: (i) compared to TabNet, FEAT-KD significantly reduces model complexity while retaining competitive predictive performance, effectively converting a black-box deep learning model into a more interpretable white-box representation, (ii) compared to FEAT, our method consistently outperforms in prediction accuracy, produces more compact models, and reduces the complexity of learned symbolic expressions. In addition, we demonstrate that FEAT-KD easily supports multi-target regression, in which the shared features contribute to the interpretability of the system. Our results suggest that FEAT-KD is a promising direction for interpretable ML, bridging the gap between deep learning's predictive power and the intrinsic transparency of symbolic models.

Poster

#E-1700

AKRMap: Adaptive Kernel Regression for Trustworthy Visualization of Cross-Modal Embeddings

Yilin Ye · Junchao Huang · Xingchen ZENG · Jiazhi Xia · Wei Zeng

Cross-modal embeddings form the foundation for multi-modal models. However, visualization methods for interpreting cross-modal embeddings have been primarily confined to traditional dimensionality reduction (DR) techniques like PCA and t-SNE. These DR methods primarily focus on feature distributions within a single modality, whilst failing to incorporate metrics (e.g., CLIPScore) across multiple modalities. This paper introduces AKRMap, a new DR technique designed to visualize cross-modal embeddings metric with enhanced accuracy by learning kernel regression of the metric landscape in the projection space. Specifically, AKRMap constructs a supervised projection network guided by a post-projection kernel regression loss, and employs adaptive generalized kernels that can be jointly optimized with the projection. This approach enables AKRMap to efficiently generate visualizations that capture complex metric distributions, while also supporting interactive features such as zoom and overlay for deeper exploration. Quantitative experiments demonstrate that AKRMap outperforms existing DR methods in generating more accurate and trustworthy visualizations. We further showcase the effectiveness of AKRMap in visualizing and comparing cross-modal embeddings for text-to-image models. Code and demo are available at https://github.com/yilinye/AKRMap.

Poster

#E-1701

How Do Large Language Monkeys Get Their Power (Laws)?

Rylan Schaeffer · Joshua Kazdan · John Hughes · Jordan Juravsky · Sara Price · Aengus Lynch · Erik Jones · Robert Kirk · Azalia Mirhoseini · Sanmi Koyejo

Recent research across mathematical problem solving, proof assistant programming and multimodal jailbreaking documents a striking finding: when (multimodal) language model tackle a suite of tasks with multiple attempts per task -- succeeding if any attempt is correct -- then the negative log of the average success rate scales a power law in the number of attempts.In this work, we identify an apparent puzzle: a simple mathematical calculation predicts that on each problem, the failure rate should fall exponentially with the number of attempts.We confirm this prediction empirically, raising a question: from where does aggregate polynomial scaling emerge?We then answer this question by demonstrating per-problem exponential scaling can be made consistent with aggregate polynomial scaling if the distribution of single-attempt success probabilities is heavy tailed such that a small fraction of tasks with extremely low success probabilities collectively warp the aggregate success trend into a power law - even as each problem scales exponentially on its own.We further demonstrate that this distributional perspective explains previously observed deviations from power law scaling, and provides a simple method for forecasting the power law exponent with an order of magnitude lower relative error, or equivalently, ${\sim}2-4$ orders of magnitude less inference compute.Overall, our work contributes to a better understanding of how neural language model performance improves with scaling inference compute and the development of scaling-predictable evaluations of (multimodal) language models.

Poster

#E-1702

AAAR-1.0: Assessing AI’s Potential to Assist Research

Renze Lou · Hanzi Xu · Sijia Wang · Jiangshu Du · Ryo Kamoi · Xiaoxin Lu · Jian Xie · Yuxuan Sun · Yusen Zhang · Jihyun Ahn · Hongchao Fang · Zhuoyang Zou · Wenchao Ma · Xi Li · Kai Zhang · Congying Xia · Lifu Huang · Wenpeng Yin

Numerous studies have assessed the proficiency of AI systems, particularly large language models (LLMs), in facilitating everyday tasks such as email writing, question answering, and creative content generation. However, researchers face unique challenges and opportunities in leveraging LLMs for their own work, such as brainstorming research ideas, designing experiments, and writing or reviewing papers. In this study, we introduce AAAR-1.0, a benchmark dataset designed to evaluate LLM performance in three fundamental, expertise-intensive research tasks: (i) EquationInference, assessing the correctness of equations based on the contextual information in paper submissions; (ii) ExperimentDesign, designing experiments to validate research ideas and solutions; and (iii) PaperWeakness, identifying weaknesses in paper submissions. AAAR-1.0 differs from prior benchmarks in two key ways: first, it is explicitly research-oriented, with tasks requiring deep domain expertise; second, it is researcher-oriented, mirroring the primary activities that researchers engage in on a daily basis. An evaluation of both open-source and proprietary LLMs reveals their potential as well as limitations in conducting sophisticated research tasks. We will release the AAAR-1.0 and keep iterating it to new versions.

Poster

#E-1703

Do Vision-Language Models Really Understand Visual Language?

Yifan Hou · Buse Giledereli · Yilei Tu · Mrinmaya Sachan

Visual language is a system of communication that conveys information through symbols, shapes, and spatial arrangements. Diagrams are a typical example of a visual language depicting complex concepts and their relationships in the form of an image. The symbolic nature of diagrams presents significant challenges for building models capable of understanding them. Yet, recent studies seem to suggest that Large Vision-Language Models (LVLMs) can even tackle complex reasoning tasks involving diagrams. In this paper, we investigate this phenomenon by developing a comprehensive test suite to evaluate the diagram comprehension capability of LVLMs. Our test suite uses a variety of questions focused on concept entities and their relationships over a set of synthetic as well as real diagrams across several domains to evaluate the recognition and reasoning abilities of models. Our evaluation of six LVLMs shows that while these models can accurately identify and reason about entities, their ability to understand relationships is notably limited. Further testing reveals that the decent performance on diagram understanding largely stems from leveraging their background knowledge as shortcuts to identify and reason about the relational information. Thus, we conclude that LVLMs have a limited capability for genuine diagram understanding, and their impressive performance in diagram reasoning is an illusion emanating from other confounding factors, such as the background knowledge in the models.

Poster

#E-1704

Reflection-Bench: Evaluating Epistemic Agency in Large Language Models

Lingyu Li · Yixu Wang · Haiquan Zhao · Shuqi Kong · Yan Teng · Chunbo Li · Yingchun Wang

With large language models (LLMs) increasingly deployed as cognitive engines for AI agents, the reliability and effectiveness critically hinge on their intrinsic epistemic agency, which remains understudied. Epistemic agency, the ability to flexibly construct, adapt, and monitor beliefs about dynamic environments, represents a base-model-level capacity independent of specific tools, modules, or applications. We characterize the holistic process underlying epistemic agency, which unfolds in seven interrelated dimensions: prediction, decision-making, perception, memory, counterfactual thinking, belief updating, and meta-reflection. Correspondingly, we propose Reflection-Bench, a cognitive-psychology-inspired benchmark consisting of seven tasks with long-term relevance and minimization of data leakage. Through a comprehensive evaluation of 16 models using three prompting strategies, we identify a clear three-tier performance hierarchy and significant limitations of current LLMs, particularly in meta-reflection capabilities. While state-of-the-art LLMs demonstrate rudimentary signs of epistemic agency, our findings suggest several promising research directions, including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms. Our code and data are available at https://github.com/AI45Lab/ReflectionBench.

Poster

#E-1705

Minerva: A Programmable Memory Test Benchmark for Language Models

Menglin Xia · Victor Ruehle · Saravanakumar Rajmohan · Reza Shokri

How effectively can LLM-based AI assistants utilize their memory (context) to perform various tasks? Traditional data benchmarks, which are often manually crafted, suffer from several limitations: they are static, susceptible to overfitting, difficult to interpret, and lack actionable insights--failing to pinpoint the specific capabilities a model lacks when it does not pass a test. In this paper, we present a framework for automatically generating a comprehensive set of tests to evaluate models' abilities to use their memory effectively. Our framework extends the range of capability tests beyond the commonly explored (passkey, key-value, needle in the haystack) search, a dominant focus in the literature. Specifically, we evaluate models on atomic tasks such as searching, recalling, editing, matching, comparing information in context memory, performing basic operations when inputs are structured into distinct blocks, and maintaining state while operating on memory, simulating real-world data. Additionally, we design composite tests to investigate the models' ability to perform more complex, integrated tasks. Our benchmark enables an interpretable, detailed assessment of memory capabilities of LLMs.

Poster

#E-1706

High-Dimensional Tensor Regression With Oracle Properties

Wenbin Wang · Yu Shi · Ziping Zhao

The emergence of multi-dimensional data presents significant challenges for traditional regression models based on matrices or vectors, particularly in capturing multi-directional correlations. In response, tensor regression has been proposed as a powerful framework for modeling linear relationships among multi-dimensional variables. In this paper, we introduce a high-dimensional tensor-response tensor regression model under low-dimensional structural assumptions, such as sparsity and low-rankness. Assuming the underlying tensor lies within an unknown low-dimensional subspace, we consider a least squares estimation framework with non-convex penalties. Theoretically, we derive general risk bounds for the resulting estimators and demonstrate that they achieve the oracle statistical rates under mild technical conditions. To compute the proposed estimators efficiently, we introduce an accelerated proximal gradient algorithm demonstrating rapid convergence in practice. Extensive experiments on synthetic and real-world datasets validate the effectiveness of the proposed regression model and showcase the practical utility of the theoretical findings.

Spotlight Poster

#E-1707

SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?

Samuel Miserendino · Michele Wang · Tejal Patwardhan · Johannes Heidecke

We introduce SWE-Lancer, a benchmark of over 1400 freelance software engineering tasks from Upwork, valued at \\\$1 million USD total in real-world payouts. SWE-Lancer encompasses both independent engineering tasks — ranging from \\\$50 bug fixes to \\\$32000 feature implementations — and managerial tasks, where models choose between technical implementation proposals. Independent tasks are graded with end-to-end tests triple-verified by experienced software engineers, while managerial decisions are assessed against the choices of the original hired engineering managers. We evaluate model performance and find that frontier models are still unable to solve the majority of tasks. To facilitate future research, we open-source a unified Docker image and a public evaluation split. By mapping model performance to monetary value, we hope SWE-Lancer enables greater research into the economic impact of AI model development.

Poster

#E-1708

Federated Node-Level Clustering Network with Cross-Subgraph Link Mending

Jingxin Liu · Renda Han · Wenxuan Tu · Haotian Wang · Junlong Wu · Jieren Cheng

Subgraphs of a complete graph are usually distributed across multiple devices and can only be accessed locally because the raw data cannot be directly shared. However, existing node-level federated graph learning suffers from at least one of the following issues: 1) heavily relying on labeled graph samples that are difficult to obtain in real-world applications, and 2) partitioning a complete graph into several subgraphs inevitably causes missing links, leading to sub-optimal sample representations. To solve these issues, we propose a novel $\underline{\text{Fed}}$erated $\underline{\text{N}}$ode-level $\underline{\text{C}}$lustering $\underline{\text{N}}$etwork (FedNCN), which mends the destroyed cross-subgraph links using clustering prior knowledge. Specifically, within each client, we first design an MLP-based projector to implicitly preserve key clustering properties of a subgraph in a denoising learning-like manner, and then upload the resultant clustering signals that are hard to reconstruct for subsequent cross-subgraph links restoration. In the server, we maximize the potential affinity between subgraphs stemming from clustering signals by graph similarity estimation and minimize redundant links via the N-Cut criterion. Moreover, we employ a GNN-based generator to learn consensus prototypes from this mended graph, enabling the MLP-GNN joint-optimized learner to enhance data privacy during data transmission and further promote the local model for better clustering. Extensive experiments demonstrate the superiority of FedNCN.

Poster

#E-1709

A Peer-review Look on Multi-modal Clustering: An Information Bottleneck Realization Method

Zhengzheng Lou · Hang Xue · Chaoyang Zhang · Shizhe Hu

Despite the superior capability in complementary information exploration and consistent clustering structure learning, most current weight-based multi-modal clustering methods still contain three limitations: 1) lack of trustworthiness in learned weights; 2) isolated view weight learning; 3) extra weight parameters. Motivated by the peer-review mechanism in the academia, we in this paper give a new peer-review look on the multi-modal clustering problem and propose to iteratively treat one modality as "author" and the remaining modalities as "reviewers" so as to reach a peer-review score for each modality. It essentially explores the underlying relationships among modalities. To improve the trustworthiness, we further design a new trustworthy score with a self-supervision working mechanism. Following that, we propose a novel Peer-review Trustworthy Information Bottleneck (PTIB) method for weighted multi-modal clustering, where both the above scores are simultaneously taken into account for accurate and parameter-free modality weight learning. Extensive experiments on eight multi-modal datasets suggest that PTIB can outperform the state-of-the-art multi-modal clustering methods.

Poster

#E-1800

Generalizing Causal Effects from Randomized Controlled Trials to Target Populations across Diverse Environments

Baohong Li · Yingrong Wang · Anpeng Wu · ma ming · Ruoxuan Xiong · Kun Kuang

Generalizing causal effects from Randomized Controlled Trials (RCTs) to target populations across diverse environments is of significant practical importance, as RCTs are often costly and logistically complex to conduct. A key challenge is environmental shift, defined as changes in the distribution and availability of covariates between source and target environments. A common approach addressing this challenge is to identify a separating set--covariates that govern both treatment effect heterogeneity and environmental differences--and combine RCT samples with target populations matched on this set. However, this approach assumes that the separating set is fully observed and shared across datasets, an assumption often violated in practice. We propose a novel Two-Stage Doubly Robust (2SDR) method that relaxes this assumption by allowing the separating set to be observed in only one of the two datasets. 2SDR leverages shadow variables to impute missing components of the separating set and generalize treatment effects across environments in a two-stage procedure. We show the identification of causal effects in target environments under 2SDR and demonstrate its effectiveness through extensive experiments on both synthetic and real-world datasets.

Poster

#E-1801

Unraveling the Interplay between Carryover Effects and Reward Autocorrelations in Switchback Experiments

Qianglin Wen · Chengchun Shi · Ying Yang · Niansheng Tang · Hongtu Zhu

A/B testing has become the gold standard for modern technological industries for policy evaluation. Motivated by the widespread use of switchback experiments in A/B testing, this paper conducts a comprehensive comparative analysis of various switchback designs in Markovian environments. Unlike many existing works which derive the optimal design based on specific and relatively simple estimators, our analysis covers a range of state-of-the-art estimators developed in the reinforcement learning literature. It reveals that the effectiveness of different switchback designs depends crucially on (i) the size of the carryover effect and (ii) the autocorrelations among reward errors over time. Meanwhile, these findings are estimator-agnostic, i.e., they apply to all the aforementioned estimators. Based on these insights, we provide a workflow to offer guidelines for practitioners on designing switchback experiments in A/B testing.

Spotlight Poster

#E-1802

Counterfactual Graphical Models: Constraints and Inference

Juan Correa · Elias Bareinboim

Graphical models have been widely used as parsimonious encoders of constraints of the underlying probability models. When organized in a structured way, these models can facilitate the derivation of non-trivial constraints, the inference of quantities of interest, and the optimization of their estimands. In particular, causal diagrams allow for the efficient representation of structural constraints of the underlying causal system. In this paper, we introduce an efficient graphical construction called Ancestral Multi-world Networks that is sound and complete for reading counterfactual independences from a causal diagram using d-separation. Moreover, we introduce the counterfactual (ctf-) calculus, which can be used to transform counterfactual quantities using three rules licensed by the constraints encoded in the diagram. This result generalizes Pearl’s celebrated do-calculus from interventional to counterfactual reasoning.

Poster

#E-1803

Latent Variable Causal Discovery under Selection Bias

Haoyue Dai · Yiwen Qiu · Ignavier Ng · Xinshuai Dong · Peter Spirtes · Kun Zhang

Addressing selection bias in latent variable causal discovery is important yet underexplored, largely due to a lack of suitable statistical tools: While various tools beyond basic conditional independencies have been developed to handle latent variables, none have been adapted for selection bias. We make an attempt by studying rank constraints, which, as a generalization to conditional independence constraints, exploits the ranks of covariance submatrices in linear Gaussian models. We show that although selection can significantly complicate the joint distribution, interestingly, the ranks in the biased covariance matrices still preserve meaningful information about both causal structures and selection mechanisms. We provide a graph-theoretic characterization of such rank constraints. Using this tool, we demonstrate that the one-factor model, a classical latent variable model, can be identified under selection bias. Simulations and real-world experiments confirm the effectiveness of using our rank constraints.

Poster

#E-1804

Bivariate Causal Discovery with Proxy Variables: Integral Solving and Beyond

Yong Wu · Yanwei Fu · Shouyan Wang · Xinwei Sun

Bivariate causal discovery is challenging when unmeasured confounders exist. To adjust for the bias, previous methods employed the proxy variable (i.e., negative control outcome (NCO)) to test the treatment-outcome relationship through integral equations -- and assumed that violation of this equation indicates the causal relationship. Upon this, they could establish asymptotic properties for causal hypothesis testing. However, these methods either relied on parametric assumptions or required discretizing continuous variables, which may lead to information loss. Moreover, it is unclear when this underlying integral-related assumption holds, making it difficult to justify the utility in practice. To address these problems, we first consider the scenario where only NCO is available. We propose a novel non-parametric procedure, which enjoys asymptotic properties and preserves more information. Moreover, we find that when NCO affects the outcome, the above integral-related assumption may not hold, rendering the causal relation unidentifiable. Informed by this, we further consider the scenario when the negative control exposure (NCE) is also available. In this scenario, we construct another integral restriction aided by this proxy, which can discover causation when NCO affects the outcome. We demonstrate these findings and the effectiveness of our proposals through comprehensive numerical studies.

Poster

#E-1805

Federated Causal Structure Learning with Non-identical Variable Sets

Yunxia Wang · Fuyuan CAO · Kui Yu · Jiye Liang

Federated causal structure learning aims to infer causal relationships from data stored on individual clients, with privacy concerns. Most existing methods assume identical variable sets across clients and present federated strategies for aggregating local updates. However, in practice, clients often observe overlapping but non-identical variable sets, and non-overlapping variables may introduce spurious dependencies. Moreover, existing strategies typically reflect only the overall quality of local graphs, ignoring the varying importance of relationships within each graph. In this paper, we study federated causal structure learning with non-identical variable sets, aiming to design an effective strategy for aggregating “correct” and “good” (non-)causal relationships across distributed datasets. Specifically, we first develop theories for detecting spurious dependencies, examining whether the learned relationships are “correct” or not. Furthermore, we define stable relationships as those that are both “correct” and “good” across multiple graphs, and finally design a two-level priority selection strategy for aggregating local updates, obtaining a global causal graph over the integrated variables. Experimental results on synthetic, benchmark and real-world data demonstrate the effectiveness of our method.

Poster

#E-1806

LLMScan: Causal Scan for LLM Misbehavior Detection

Mengdi Zhang · Goh Kiat · Peixin Zhang · Jun Sun · Lin Rose · Hongyu Zhang

Despite the success of Large Language Models (LLMs) across various fields, their potential to generate untruthful and harmful responses poses significant risks, particularly in critical applications. This highlights the urgent need for systematic methods to detect and prevent such misbehavior. While existing approaches target specific issues such as harmful responses, this work introduces LLMScan, an innovative LLM monitoring technique based on causality analysis, offering a comprehensive solution. LLMScan systematically monitors the inner workings of an LLM through the lens of causal inference, operating on the premise that the LLM's `brain' behaves differently when generating harmful or untruthful responses. By analyzing the causal contributions of the LLM's input tokens and transformer layers, LLMScan effectively detects misbehavior. Extensive experiments across various tasks and models reveal clear distinctions in the causal distributions between normal behavior and misbehavior, enabling the development of accurate, lightweight detectors for a variety of misbehavior detection tasks.

Poster

#E-1807

Causal Abstraction Learning based on the Semantic Embedding Principle

Gabriele DAcunto · Fabio Massimo Zennaro · Yorgos Felekis · Paolo Di Lorenzo

Structural causal models (SCMs) allow us to investigate complex systems at multiple levels of resolution.The causal abstraction (CA) framework formalizes the mapping between high- and low-level SCMs. We address CA learning in a challenging and realistic setting, where SCMs are inaccessible, interventional data is unavailable, and sample data is misaligned.A key principle of our framework is semantic embedding, formalized as the high-level distribution lying on a subspace of the low-level one. This principle naturally links linear CA to the geometry of the Stiefel manifold.We present a category-theoretic approach to SCMs that enables the learning of a CA by finding a morphism between the low- and high-level probability measures, adhering to the semantic embedding principle.Consequently, we formulate a general CA learning problem.As an application, we solve the latter problem for linear CA; considering Gaussian measures and the Kullback-Leibler divergence as an objective.Given the nonconvexity of the learning task, we develop three algorithms building upon existing paradigms for Riemannian optimization.We demonstrate that the proposed methods succeed on both synthetic and real-world brain data with different degrees of prior information about the structure of CA.

Poster

#E-1808

Measuring Variable Importance in Heterogeneous Treatment Effects with Confidence

Joseph Paillard · Angel REYERO LOBO · Vitaliy Kolodyazhniy · Thirion Bertrand · Denis-Alexander Engemann

Causal machine learning (ML) promises to provide powerful tools for estimating individual treatment effects. While causal methods have placed some emphasis on heterogeneity in treatment response, it is of paramount importance to clarify the nature of this heterogeneity, by highlighting which variables drive it.We propose PermuCATE, an algorithm based on the Conditional Permutation Importance (CPI) method, for statistically rigorous global variable importance assessment in the estimation of the Conditional Average Treatment Effect (CATE).Theoretical analysis of the finite sample regime and empirical studies show that PermuCATE has lower variance than the Leave-One-Covariate-Out (LOCO) method and provides a reliable measure of variable importance.This property increases statistical power, which is crucial for causal inference applications with finite sample sizes.We empirically demonstrate the benefits of PermuCATE in simulated and real datasets, including complex settings with high-dimensional, correlated variables.

Poster

#E-1809

TANGO: Clustering with Typicality-Aware Nonlocal Mode-Seeking and Graph-Cut Optimization

Haowen Ma · Zhiguo Long · Hua Meng

Density-based mode-seeking methods generate a density-ascending dependency from low-density points towards higher-density neighbors.Current mode-seeking methods identify modes by breaking some dependency connections, but relying heavily on local data characteristics, requiring case-by-case threshold settings or human intervention to be effective for different datasets. To address this issue, we introduce a novel concept called typicality, by exploring the locally defined dependency from a global perspective, to quantify how confident a point would be a mode. We devise an algorithm that effectively and efficiently identifies modes with the help of the global-view typicality. To implement and validate our idea, we design a clustering method called TANGO, which not only leverages typicality to detect modes, but also utilizes graph-cut with an improved path-based similarity to aggregate data into the final clusters. Moreover, this paper also provides some theoretical analysis on the proposed algorithm. Experimental results on several synthetic and extensive real-world datasets demonstrate the effectiveness and superiority of TANGO. The code is available at https://github.com/SWJTU-ML/TANGO_code.

Poster

#E-1810

Learning from Sample Stability for Deep Clustering

Zhixin Li · Yuheng Jia · Hui LIU · Junhui Hou

Deep clustering, an unsupervised technique independent of labels, necessitates tailored supervision for model training. Prior methods explore supervision like similarity and pseudo labels, yet overlook individual sample training analysis. Our study correlates sample stability during unsupervised training with clustering accuracy and network memorization on a per-sample basis. Unstable representations across epochs often lead to mispredictions, indicating difficulty in memorization and atypicality. Leveraging these findings, we introduce supervision signals for the first time based on sample stability at the representation level. Our proposed strategy serves as a versatile tool to enhance various deep clustering techniques. Experiments across benchmark datasets showcase that incorporating sample stability into training can improve the performance of deep clustering. The code is available at https://github.com/LZX-001/LFSS.

Poster

#E-1811

Unified K-Means Clustering with Label-Guided Manifold Learning

Qianqian Wang · Mengping Jiang · Zhengming Ding · Quanxue Gao

K-Means clustering is a classical and effective unsupervised learning method attributed to its simplicity and efficiency. However, it faces notable challenges, including sensitivity to random initial centroid selection, a limited ability to discover the intrinsic manifold structures within nonlinear datasets, and difficulty in achieving balanced clustering in practical scenarios. To overcome these weaknesses, we introduce a novel framework for K-Means that leverages manifold learning. This approach eliminates the need for centroid calculation and utilizes a cluster indicator matrix to align the manifold structures, thereby enhancing clustering accuracy. Beyond the traditional Euclidean distance, our model incorporates Gaussian kernel distance, K-nearest neighbor distance, and low-pass filtering distance to effectively manage data that is not linearly separable. Furthermore, we introduce a balanced regularizer to achieve balanced clustering results. The detailed experimental results demonstrate the efficacy of our proposed methodology.

Poster

#E-1812

PROTOCOL: Partial Optimal Transport-enhanced Contrastive Learning for Imbalanced Multi-view Clustering

Xuqian Xue · Yiming Lei · Qi Cai · Hongming Shan · Junping Zhang

While contrastive multi-view clustering has achieved remarkable success, it implicitly assumes balanced class distribution. However, real-world multi-view data primarily exhibits class imbalance distribution. Consequently, existing methods suffer performance degradation due to their inability to perceive and model such imbalance. To address this challenge, we present the first systematic study of imbalanced multi-view clustering, focusing on two fundamental problems: i. perceiving class imbalance distribution, and ii. mitigating representation degradation of minority samples. We propose PROTOCOL, a novel PaRtial Optimal TranspOrt-enhanced COntrastive Learning framework for imbalanced multi-view clustering. First, for class imbalance perception, we map multi-view features into a consensus space and reformulate the imbalanced clustering as a partial optimal transport (POT) problem, augmented with progressive mass constraints and weighted KL divergence for class distributions. Second, we develop a POT-enhanced class-rebalanced contrastive learning at both feature and class levels, incorporating logit adjustment and class-sensitive learning to enhance minority sample representations. Extensive experiments demonstrate that PROTOCOL significantly improves clustering performance on imbalanced multi-view data, filling a critical research gap in this field.

Poster

#E-1900

Variational Counterfactual Intervention Planning to Achieve Target Outcomes

Xin Wang · Shengfei Lyu · Luo Chi · Xiren Zhou · Huanhuan Chen

A key challenge in personalized healthcare is identifying optimal intervention sequences to guide temporal systems toward target outcomes, a novel problem we formalize as counterfactual target achievement. In addressing this problem, directly adopting counterfactual estimation methods face compounding errors due to the unobservability of counterfactuals. To overcome this, we propose Variational Counterfactual Intervention Planning (VCIP), which reformulates the problem by modeling the conditional likelihood of achieving target outcomes, implemented through variational inference. By leveraging the g-formula to bridge the gap between interventional and observational log-likelihoods, VCIP enables reliable training from observational data. Experiments on both synthetic and real-world datasets show that VCIP significantly outperforms existing methods in target achievement accuracy.

Poster

#E-1901

Bayesian Active Learning for Bivariate Causal Discovery

Yuxuan Wang · Mingzhou Liu · Xinwei Sun · Wei Wang · Yizhou Wang

Determining the direction of relationships between variables is fundamental for understanding complex systems across scientific domains. While observational data can uncover relationships between variables, it cannot distinguish between cause and effect without experimental interventions. To effectively uncover causality, previous works have proposed intervention strategies that sequentially optimize the intervention values. However, most of these approaches primarily maximized information-theoretic gains that may not effectively measure the reliability of direction determination. In this paper, we formulate the causal direction identification as a hypothesis-testing problem, and propose a Bayes factor-based intervention strategy, which can quantify the evidence strength of one hypothesis (e.g., causal) over the other (e.g., non-causal). To balance the immediate and future gains of testing strength, we propose a sequential intervention objective over intervention values in multiple steps. By analyzing the objective function, we develop a dynamic programming algorithm that reduces the complexity from non-polynomial to polynomial. Experimental results on bivariate systems, tree-structured graphs, and an embodied AI environment demonstrate the effectiveness of our framework in direction determination and its extensibility to both multivariate settings and real-world applications.

Poster

#E-1902

FedClean: A General Robust Label Noise Correction for Federated Learning

Xiaoqian Jiang · Jing Zhang

Many federated learning scenarios encounter label noises in the client-side datasets. The resulting degradation in global model performance raises the urgent need to address label noise. This paper proposes FedClean -- a novel general robust label noise correction for federated learning. FedClean first uses the local centralized noisy label learning to select clean samples to train a global model. Then, it employs a two-stage correction scheme to correct the noisy labels from two distinct perspectives of local noisy label learning and the global model. FedClean also proposes a novel model aggregation method, further reducing the impact of label noises. FedClean neither assumes the existence of clean clients nor the specific noise distributions, showing the maximum versatility. Extensive experimental results show that FedClean effectively identifies and rectifies label noises even if all clients exhibit label noises, which outperforms the state-of-the-art noise-label learning methods for federated learning.

Poster

#E-1903

OpenworldAUC: Towards Unified Evaluation and Optimization for Open-world Prompt Tuning

Cong Hua · Qianqian Xu · Zhiyong Yang · Zitai Wang · Shilong Bao · Qingming Huang

Prompt tuning adapts Vision-Language Models like CLIP to open-world tasks with minimal training costs. In this direction, one typical paradigm evaluates model performance **separately** on known classes (*i.e.*, base domain) and unseen classes (*i.e.*, new domain). However, real-world scenarios require models to handle inputs **without prior domain knowledge**. This practical challenge has spurred the development of **open-world prompt tuning**, which demands a unified evaluation of two stages: 1) detecting whether an input belongs to the base or new domain (**P1**), and 2) classifying the sample into its correct class (**P2**). What's more, as domain distributions are generally unknown, a proper metric should be insensitive to varying base/new sample ratios (**P3**). However, we find that current metrics, including HM, overall accuracy, and AUROC, fail to satisfy these three properties simultaneously. To bridge this gap, we propose $\mathsf{OpenworldAUC}$, a unified metric that jointly assesses detection and classification through pairwise instance comparisons. To optimize $\mathsf{OpenworldAUC}$ effectively, we introduce **Gated Mixture-of-Prompts (GMoP)**, which employs domain-specific prompts and a gating mechanism to dynamically balance detection and classification. Theoretical guarantees ensure generalization of GMoP under practical conditions. Experiments on 15 benchmarks in open-world scenarios show GMoP achieves SOTA performance on $\mathsf{OpenworldAUC}$ and other metrics.

Poster

#E-1904

Understanding Model Reprogramming for CLIP via Decoupling Visual Prompts

Chengyi Cai · Zesheng Ye · Lei Feng · Jianzhong Qi · Feng Liu

Model reprogramming adapts pretrained models to downstream tasks by modifying only the input and output spaces. Visual reprogramming (VR) is one instance for vision tasks that adds a trainable noise pattern (i.e., a visual prompt) to input images to facilitate downstream classification. The existing VR approaches for CLIP train a single visual prompt using all descriptions of different downstream classes. However, the limited learning capacity may result in (1) a failure to capture diverse aspects of the descriptions (e.g., shape, color, and texture), and (2) a possible bias toward less informative attributes that do not help distinguish between classes. In this paper, we introduce a decoupling-and-reweighting framework. Our decoupled visual prompts (DVP) are optimized using descriptions grouped by explicit causes (DVP-cse) or unsupervised clusters (DVP-cls). Then, we integrate the outputs of these visual prompts with a probabilistic reweighting matrix (PRM) that measures their contributions to each downstream class. Theoretically, DVP lowers the empirical risk bound. Experimentally, DVP outperforms baselines on average across 11 downstream datasets. Notably, the DVP-PRM integration enables insights into how individual visual prompts influence classification decisions, providing a probabilistic framework for understanding reprogramming.

Poster

#E-1905

Testing Conditional Mean Independence Using Generative Neural Networks

Yi Zhang · Linjun Huang · Yun Yang · Xiaofeng Shao

Conditional mean independence (CMI) testing is crucial for statistical tasks including model determination and variable importance evaluation. In this work, we introduce a novel population CMI measure and a bootstrap-based testing procedure that utilizes deep generative neural networks to estimate the conditional mean functions involved in the population measure. The test statistic is thoughtfully constructed to ensure that even slowly decaying nonparametric estimation errors do not affect the asymptotic accuracy of the test. Our approach demonstrates strong empirical performance in scenarios with high-dimensional covariates and response variable, can handle multivariate responses, and maintains nontrivial power against local alternatives outside an $n^{-1/2}$ neighborhood of the null hypothesis. We also use numerical simulations and real-world imaging data applications to highlight the efficacy and versatility of our testing procedure.

Poster

#E-1906

Autoencoder-Based Hybrid Replay for Class-Incremental Learning

Milad Khademi Nori · Il-Min Kim · Guanghui Wang

In class-incremental learning (CIL), effective incremental learning strategies are essential to mitigate task confusion and catastrophic forgetting, especially as the number of tasks $t$ increases. Current exemplar replay strategies impose $\mathcal{O}(t)$ memory/compute complexities. We propose an autoencoder-based hybrid replay (AHR) strategy that leverages our new hybrid autoencoder (HAE) to function as a compressor to alleviate the requirement for large memory, achieving $\mathcal{O}(0.1 t)$ at the worst case with the computing complexity of $\mathcal{O}(t)$ while accomplishing state-of-the-art performance. The decoder later recovers the exemplar data stored in the latent space, rather than in raw format. Additionally, HAE is designed for both discriminative and generative modeling, enabling classification and replay capabilities, respectively. HAE adopts the charged particle system energy minimization equations and repulsive force algorithm for the incremental embedding and distribution of new class centroids in its latent space. Our results demonstrate that AHR consistently outperforms recent baselines across multiple benchmarks while operating with the same memory/compute budgets. The source code is included in the supplementary material and will be open-sourced upon publication.

Poster

#E-1907

Parametric Scaling Law of Tuning Bias in Conformal Prediction

Hao Zeng · Kangdao Liu · Bingyi Jing · Hongxin Wei

Conformal prediction is a popular framework of uncertainty quantification that constructs prediction sets with coverage guarantees. To uphold the exchangeability assumption, many conformal prediction methods necessitate an additional hold-out set for parameter tuning. Yet, the impact of violating this principle on coverage remains underexplored, making it ambiguous in practical applications. In this work, we empirically find that the tuning bias - the coverage gap introduced by leveraging the same dataset for tuning and calibration, is negligible for simple parameter tuning in many conformal prediction methods. In particular, we observe the scaling law of the tuning bias: this bias increases with parameter space complexity and decreases with calibration set size. Formally, we establish a theoretical framework to quantify the tuning bias and provide rigorous proof for the scaling law of the tuning bias by deriving its upper bound. In the end, we discuss how to reduce the tuning bias, guided by the theories we developed.

Poster

#E-1908

Conformal Prediction with Cellwise Outliers: A Detect-then-Impute Approach

Qian Peng · Yajie Bao · Haojie Ren · Zhaojun Wang · Changliang Zou

Conformal prediction is a powerful tool for constructing prediction intervals for black-box models, providing a finite sample coverage guarantee for exchangeable data. However, this exchangeability is compromised when some entries of the test feature are contaminated, such as in the case of cellwise outliers. To address this issue, this paper introduces a novel framework called *detect-then-impute conformal prediction*. This framework first employs an outlier detection procedure on the test feature and then utilizes an imputation method to fill in those cells identified as outliers. To quantify the uncertainty in the processed test feature, we adaptively apply the detection and imputation procedures to the calibration set, thereby constructing exchangeable features for the conformal prediction interval of the test label. We develop two practical algorithms, $\texttt{PDI-CP}$ and $\texttt{JDI-CP}$, and provide a distribution-free coverage analysis under some commonly used detection and imputation procedures. Notably, $\texttt{JDI-CP}$ achieves a finite sample $1-2\alpha$ coverage guarantee. Numerical experiments on both synthetic and real datasets demonstrate that our proposed algorithms exhibit robust coverage properties and comparable efficiency to the oracle baseline.

Poster

#E-1909

Supercharging Graph Transformers with Advective Diffusion

Qitian Wu · Chenxiao Yang · Kaipeng Zeng · Michael Bronstein

The capability of generalization is a cornerstone for the success of modern learning systems. For non-Euclidean data, e.g., graphs, that particularly involves topological structures, one important aspect neglected by prior studies is how machine learning models generalize under topological shifts. This paper proposes AdvDIFFormer, a physics-inspired graph Transformer model designed to address this challenge. The model is derived from advective diffusion equations which describe a class of continuous message passing process with observed and latent topological structures. We show that AdvDIFFormer has provable capability for controlling generalization error with topological shifts, which in contrast cannot be guaranteed by graph diffusion models, i.e., the generalization of common graph neural networks in continuous space. Empirically, the model demonstrates superiority in various predictive tasks across information networks, molecular screening and protein interactions

Poster

#E-1910

When Dynamic Data Selection Meets Data Augmentation: Achieving Enhanced Training Acceleration

Suorong Yang · Peng Ye · Furao Shen · Dongzhan Zhou

Dynamic data selection aims to accelerate training with lossless performances.However, reducing training data inherently limits data diversity, potentially hindering generalization.While data augmentation is widely used to enhance diversity, it is typically not optimized in conjunction with selection.As a result, directly combining these techniques fails to fully exploit their synergies.To tackle the challenge, we propose a novel online data training framework that, for the first time, unifies dynamic data selection and augmentation, achieving both training efficiency and enhanced performance.Our method estimates each sample's joint distribution of local density and multimodal semantic consistency, allowing for the targeted selection of augmentation-suitable samples while suppressing the inclusion of noisy or ambiguous data.This enables a more significant reduction in dataset size without sacrificing model generalization.Experimental results demonstrate that our method outperforms existing state-of-the-art approaches on various benchmark datasets and architectures, e.g., reducing 50% training costs on ImageNet-1k with lossless performance.Furthermore, our approach enhances noise resistance and improves model robustness, reinforcing its practical utility in real-world scenarios.

Poster

#E-1911

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency

Dongzhi Jiang · Renrui Zhang · Ziyu Guo · Yanwei Li · Yu Qi · Xinyan Chen · Liuhui Wang · Jianhan Jin · Claire Guo · Shen Yan · Bo Zhang · Chaoyou Fu · Peng Gao · Hongsheng Li

Answering questions with Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs), yet its impact on Large Multimodal Models (LMMs) still lacks a systematic assessment and in-depth investigation. In this paper, we introduce MME-CoT, a specialized benchmark evaluating the CoT reasoning performance of LMMs, spanning six domains: math, science, OCR, logic, space-time, and general scenes. As the first comprehensive study in this area, we propose a thorough evaluation suite incorporating three novel metrics that assess the reasoning quality, robustness, and efficiency at a fine-grained level.Leveraging curated high-quality data and a unique evaluation strategy, we conduct an in-depth analysis of state-of-the-art LMMs, uncovering several key insights: 1) Models with reflection mechanism demonstrate a superior CoT quality, with Kimi k1.5 outperforming GPT-4o and demonstrating the highest quality results; 2) CoT prompting often degrades LMM performance on perception-heavy tasks, suggesting a potentially harmful overthinking behavior; and 3) Although the CoT quality is high, LMMs with reflection exhibit significant inefficiency in both normal response and self-correction phases.We hope MME-CoT serves as a foundation for advancing multimodal reasoning in LMMs.

Poster

#E-1912

Efficient Logit-based Knowledge Distillation of Deep Spiking Neural Networks for Full-Range Timestep Deployment

Chengting Yu · Xiaochen Zhao · Lei Liu · Shu Yang · Gaoang Wang · Erping Li · Aili Wang

Spiking Neural Networks (SNNs) are emerging as a brain-inspired alternative to traditional Artificial Neural Networks (ANNs), prized for their potential energy efficiency on neuromorphic hardware. Despite this, SNNs often suffer from accuracy degradation compared to ANNs and face deployment challenges due to fixed inference timesteps, which require retraining for adjustments, limiting operational flexibility. To address these issues, our work considers the spatio-temporal property inherent in SNNs, and proposes a novel distillation framework for deep SNNs that optimizes performance across full-range timesteps without specific retraining, enhancing both efficacy and deployment adaptability. We provide both theoretical analysis and empirical validations to illustrate that training guarantees the convergence of all implicit models across full-range timesteps. Experimental results on CIFAR-10, CIFAR-100, CIFAR10-DVS, and ImageNet demonstrate state-of-the-art performance among distillation-based SNNs training methods. Our code is available at https://github.com/Intelli-Chip-Lab/snntemporaldecoupling_distillation.

Poster

#E-2000

WAVE: Weighted Autoregressive Varying Gate for Time Series Forecasting

Jiecheng Lu · Xu Han · Yan Sun · Shihao Yang

We propose a Weighted Autoregressive Varying gatE (WAVE) attention mechanism equipped with both Autoregressive (AR) and Moving-average (MA) components. It can adapt to various attention mechanisms, enhancing and decoupling their ability to capture long-range and local temporal patterns in time series data. In this paper, we first demonstrate that, for the time series forecasting (TSF) task, the previously overlooked decoder-only autoregressive Transformer model can achieve results comparable to the best baselines when appropriate tokenization and training methods are applied. Moreover, inspired by the ARMA model from statistics and recent advances in linear attention, we introduce the full ARMA structure into existing autoregressive attention mechanisms. By using an indirect MA weight generation method, we incorporate the MA term while maintaining the time complexity and parameter size of the underlying efficient attention models. We further explore how indirect parameter generation can produce implicit MA weights that align with the modeling requirements for local temporal impacts. Experimental results show that WAVE attention that incorporates the ARMA structure consistently improves the performance of various AR attentions on TSF tasks, achieving state-of-the-art results.

Poster

#E-2001

TimeStacker: A Novel Framework with Multilevel Observation for Capturing Nonstationary Patterns in Time Series Forecasting

Qinglong Liu · Cong Xu · Wenhao Jiang · Kaixuan Wang · Lin Ma · Haifeng Li

Real-world time series inherently exhibit significant non-stationarity, posing substantial challenges for forecasting. To address this issue, this paper proposes a novel prediction framework, TimeStacker, designed to overcome the limitations of existing models in capturing the characteristics of non-stationary signals. By employing a unique stacking mechanism, TimeStacker effectively captures global signal features while thoroughly exploring local details. Furthermore, the framework integrates a frequency-based self-attention module, significantly enhancing its feature modeling capabilities. Experimental results demonstrate that TimeStacker achieves outstanding performance across multiple real-world datasets, including those from the energy, finance, and weather domains. It not only delivers superior predictive accuracy but also exhibits remarkable advantages with fewer parameters and higher computational efficiency.

Poster

#E-2002

TimeFilter: Patch-Specific Spatial-Temporal Graph Filtration for Time Series Forecasting

Yifan Hu · Guibin Zhang · Peiyuan Liu · Disen Lan · Naiqi Li · Dawei Cheng · Tao Dai · Shutao Xia · Shirui Pan

Time series forecasting methods generally fall into two main categories: Channel Independent (CI) and Channel Dependent (CD) strategies. While CI overlooks important covariate relationships, CD captures all dependencies without distinction, introducing noise and reducing generalization. Recent advances in Channel Clustering (CC) aim to refine dependency modeling by grouping channels with similar characteristics and applying tailored modeling techniques. However, coarse-grained clustering struggles to capture complex, time-varying interactions effectively. To address these challenges, we propose TimeFilter, a GNN-based framework for adaptive and fine-grained dependency modeling. After constructing the graph from the input sequence, TimeFilter refines the learned spatial-temporal dependencies by filtering out irrelevant correlations while preserving the most critical ones in a patch-specific manner. Extensive experiments on 13 real-world datasets from diverse application domains demonstrate the state-of-the-art performance of TimeFilter. The code is available at https://github.com/TROUBADOUR000/TimeFilter.

Poster

#E-2003

Implicit Bias of Gradient Descent for Non-Homogeneous Deep Networks

Yuhang Cai · Kangjie Zhou · Jingfeng Wu · Song Mei · Michael Lindsey · Peter Bartlett

We establish the asymptotic implicit bias of gradient descent (GD) for generic non-homogeneous deep networks under exponential loss. Specifically, we characterize three key properties of GD iterates starting from a sufficiently small empirical risk, where the threshold is determined by a measure of the network's non-homogeneity. First, we show that a normalized margin induced by the GD iterates increases nearly monotonically. Second, we prove that while the norm of the GD iterates diverges to infinity, the iterates themselves converge in direction. Finally, we establish that this directional limit satisfies the Karush–Kuhn–Tucker (KKT) conditions of a margin maximization problem. Prior works on implicit bias have focused exclusively on homogeneous networks; in contrast, our results apply to a broad class of non-homogeneous networks satisfying a mild near-homogeneity condition. In particular, our results apply to networks with residual connections and non-homogeneous activation functions, thereby resolving an open problem posed byJi & Telgarsky (2020).

Poster

#E-2004

Simplicity Bias and Optimization Threshold in Two-Layer ReLU Networks

Etienne Boursier · Nicolas Flammarion

Understanding generalization of overparametrized models remains a fundamental challenge in machine learning. The literature mostly studies generalization from an interpolation point of view, taking convergence towards a global minimum of the training loss for granted. This interpolation paradigm does not seem valid for complex tasks such as in-context learning or diffusion. It has instead been empirically observed that the trained models go from global minima to spurious local minima of the training loss as the number of training samples becomes larger than some level we call optimization threshold. This paper explores theoretically this phenomenon in the context of two-layer ReLU networks. We demonstrate that, despite overparametrization, networks might converge towards simpler solutions rather than interpolating training data, which leads to a drastic improvement on the test loss. Our analysis relies on the so called early alignment phase, during which neurons align toward specific directions. This directional alignment leads to a simplicity bias, wherein the network approximates the ground truth model without converging to the global minimum of the training loss. Our results suggest this bias, resulting in an optimization threshold from which interpolation is not reached anymore, is beneficial and enhances the generalization of trained models.

Poster

#E-2005

Task Generalization with Autoregressive Compositional Structure: Can Learning from $D$ Tasks Generalize to $D^T$ Tasks?

Amirhesam Abedsoltan · Huaqing Zhang · Kaiyue Wen · Hongzhou Lin · Jingzhao Zhang · Misha Belkin

Large language models (LLMs) exhibit remarkable task generalization, solving tasks they were never explicitly trained on with only a few demonstrations. This raises a fundamental question: When can learning from a small set of tasks generalize to a large task family? In this paper, we investigate task generalization through the lens of autoregressive compositional structure, where each task is a composition of T operations, and each operation is among a finite family of D subtasks. This yields a total class of size~D^T. We first show that generalization to all D^T tasks is theoretically achievable by training on only \tilde{O}(D) tasks. Empirically, we demonstrate that Transformers achieve such exponential task generalization on sparse parity functions via In-context Learning (ICL) and chain-of-thought (CoT) reasoning. We further demonstrate this exponential generalization in arithmetic and language translation, extending beyond parity functions.

Spotlight Poster

#E-2006

Sharp Generalization for Nonparametric Regression by Over-Parameterized Neural Networks: A Distribution-Free Analysis in Spherical Covariate

Yingzhen Yang

Sharp generalization bound for neural networks trained by gradient descent (GD) is of central interest in statistical learning theory and deep learning. In this paper, we consider nonparametric regressionby an over-parameterized two-layer NN trained by GD. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $O(\epsilon_n^2)$, which is the same rate as that for the classical kernel regression trained by GD with early stopping, where $\epsilon_n$ is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and $n$ is the size of the training data. It is remarked that our result does not require distributional assumptions on the covariate as long as the covariate lies on the unit sphere, in a strong contrast with many existing results which rely on specific distributions such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions.As a special case of our general result, when the eigenvalues of the associated NTKdecay at a rate of $\lambda_j \asymp j^{-\frac{d}{d-1}}$ for $j \ge 1$ which happens under certain distributional assumption such as the training features follow the spherical uniform distribution, we immediately obtain the minimax optimal rate of$O(n^{-\frac{d}{2d-1}})$, which is the major results of several existing works in this direction. The neural network width in our general result is lower bounded by a function of only $d$ and $\epsilon_n$, and such width does not depend on the minimum eigenvalue of the empirical NTK matrix whose lower bound usually requires additional assumptions on the training data.Our results are built upon two significant technical results which are of independent interest. First, uniform convergence to the NTK is established during the training process by GD, so that we can have a nice decomposition of the neural network function at any step of the GD into a function in the ReproducingKernel Hilbert Space associated with the NTK and an error function with a small $L^{\infty}$-norm. Second, local Rademacher complexity is employedto tightly bound the Rademacher complexity of the function class comprising all the possible neural network functions obtained by GD. Our resultformally fills the gap between training a classical kernel regression model and training an over-parameterized but finite-width neural network by GD for nonparametric regression without distributional assumptions about the spherical covariate.

Poster

#E-2007

Understanding Sharpness Dynamics in NN Training with a Minimalist Example: The Effects of Dataset Difficulty, Depth, Stochasticity, and More

Geonhui Yoo · Minhak Song · Chulhee Yun

When training deep neural networks with gradient descent, sharpness often increases---a phenomenon known as progressive sharpening---before saturating at the edge of stability. Although commonly observed in practice, the underlying mechanisms behind progressive sharpening remain poorly understood. In this work, we study this phenomenon using a minimalist model: a deep linear network with a single neuron per layer. We show that this simple model effectively captures the sharpness dynamics observed in recent empirical studies, offering a simple testbed to better understand neural network training. Moreover, we theoretically analyze how dataset properties, network depth, stochasticity of optimizers, and step size affect the degree of progressive sharpening in the minimalist model. We then empirically demonstrate how these theoretical insights extend to practical scenarios. This study offers a deeper understanding of sharpness dynamics in neural network training, highlighting the interplay between depth, training data, and optimizers.

Poster

#E-2008

Tree-Sliced Wasserstein Distance with Nonlinear Projection

Thanh Tran · Viet Hoang Tran · Thanh Chu · Trang Pham · Laurent Ghaoui · Tam Le · Tan Nguyen

Tree-Sliced methods have recently emerged as an alternative to the traditional Sliced Wasserstein (SW) distance, replacing one-dimensional lines with tree-based metric spaces and incorporating a splitting mechanism for projecting measures. This approach enhances the ability to capture the topological structures of integration domains in Sliced Optimal Transport while maintaining low computational costs. Building on this foundation, we propose a novel nonlinear projectional framework for the Tree-Sliced Wasserstein (TSW) distance, substituting the linear projections in earlier versions with general projections, while ensuring the injectivity of the associated Radon Transform and preserving the well-definedness of the resulting metric. By designing appropriate projections, we construct efficient metrics for measures on both Euclidean spaces and spheres. Finally, we validate our proposed metric through extensive numerical experiments for Euclidean and spherical datasets. Applications include gradient flows, self-supervised learning, and generative models, where our methods demonstrate significant improvements over recent SW and TSW variants.

Poster

#E-2009

Widening the Network Mitigates the Impact of Data Heterogeneity on FedAvg

Like Jian · Dong Liu

Federated learning (FL) enables decentralized clients to train a model collaboratively without sharing local data. A key distinction between FL and centralized learning is that clients' data are non-independent and identically distributed, which poses significant challenges in training a global model that generalizes well across heterogeneous local data distributions. In this paper, we analyze the convergence of overparameterized FedAvg with gradient descent (GD). We prove that the impact of data heterogeneity diminishes as the width of neural networks increases, ultimately vanishing when the width approaches infinity. In the infinite-width regime, we further prove that both the global and local models in FedAvg behave as linear models, and that FedAvg achieves the same generalization performance as centralized learning with the same number of GD iterations. Extensive experiments validate our theoretical findings across various network architectures, loss functions, and optimization methods.

Poster

#E-2010

Sketch to Adapt: Fine-Tunable Sketches for Efficient LLM Adaptation

Tianyi Zhang · Junda Su · Aditya Desai · Oscar Wu · Zhaozhuo Xu · Anshumali Shrivastava

Adapting pre-trained large language models (LLMs) is crucial but challenging due to their enormous size. Parameter-efficient fine-tuning (PEFT) techniques typically employ additive adapters applied to frozen model weights. To further reduce memory usage, model weights are often compressed through quantization. However, existing PEFT methods often yield suboptimal model quality because they rely on restrictive assumptions, such as low-rank constraints on adapters to limit the number of trainable parameters. We find that sketching, a popular data compression technique, can serve as an efficient LLM adaptation strategy while avoiding the low-rank assumption. We introduce SketchTune, a compressive adaptation strategy that compresses LLM weights into compact fine-tunable sketches, integrating compression and adaptation into a unified framework. This integration eliminates the need for complex two-path computation in existing PEFT techniques, enabling faster and more memory-efficient training and inference. SketchTune is supported by mathematical insights into matrix classes that are better approximated using sketching rather than low-rank methods. Our extensive evaluations with Llama and Mistral models demonstrate that SketchTune outperforms leading PEFT methods across diverse tasks while using substantially smaller base models and comparable trainable parameters. As a highlight, SketchTune outperforms LoRA, DoRA, and S2FT on commonsense and math benchmarks using 2.6-3.5$\times$ smaller base models and exceeds LoftQ in accuracy by 14.48\% on GSM8K with 7.3$\times$ fewer trainable parameters.

Poster

#E-2011

LaCache: Ladder-Shaped KV Caching for Efficient Long-Context Modeling of Large Language Models

Dachuan Shi · Yonggan Fu · Xiangchi Yuan · Zhongzhi Yu · Haoran You · Sixu Li · Xin Dong · Jan Kautz · Pavlo Molchanov · Yingyan (Celine) Lin

Recent advancements in Large Language Models (LLMs) have spurred interest in numerous applications requiring robust long-range capabilities, essential for processing extensive input contexts and continuously generating extended outputs. As sequence lengths increase, the number of Key-Value (KV) pairs in LLMs escalates, creating a significant efficiency bottleneck.In this paper, we propose a new KV cache optimization paradigm called LaCache, a training-free method for efficient and accurate generative inference of LLMs. LaCache enables LLMs to simultaneously address both of the critical challenges in long-range modeling: robust long-range capabilities and continuous generation without running out-of-memory (OOM). Specifically, LaCache integrates two key innovations: (1) a ladder-shaped KV cache pattern that stores KV pairs not only sequentially (left-to-right within each layer) but also across layers (from shallow to deep), providing an extended span for capturing long-range dependencies under a fixed storage budget, thereby boosting long-range capabilities; and (2) an iterative compaction mechanism that progressively compresses older caches, freeing up space for new tokens within a fixed cache size. This token distance-based dynamic compression enables more effective continuous generation under constrained cache budgets.Experiments across various tasks, benchmarks, and LLM models consistently validate LaCache's effectiveness in enhancing LLMs' long-range capabilities. Our code is available at https://github.com/GATECH-EIC/LaCache.

Poster

#E-2012

Linear Mode Connectivity between Multiple Models modulo Permutation Symmetries

Akira Ito · Masanori Yamada · Atsutoshi Kumagai

Ainsworth et al. empirically demonstrated that linear mode connectivity (LMC) can be achieved between two independently trained neural networks (NNs) by applying an appropriate parameter permutation. LMC is satisfied if a linear path with non-increasing test loss exists between the models, suggesting that NNs trained with stochastic gradient descent (SGD) converge to a single approximately convex low-loss basin under permutation symmetries. However, Ainsworth et al. verified LMC for two models and provided only limited discussion on its extension to multiple models. In this paper, we conduct a more detailed empirical analysis. First, we show that existing permutation search methods designed for two models can fail to transfer multiple models into the same convex low-loss basin. Next, we propose a permutation search method using a straight-through estimator for multiple models (STE-MM). We then experimentally demonstrate that even when multiple models are given, the test loss of the merged model remains nearly the same as the losses of the original models when using STE-MM, and the loss barriers between all permuted model pairs are also small. Additionally, from the perspective of the trace of the Hessian matrix, we show that the loss sharpness around the merged model decreases as the number of models increases with STE-MM, indicating that LMC for multiple models is more likely to hold. The source code implementing our method is available at https://github.com/e5-a/STE-MM.

Poster

#E-2100

Linear Transformers as VAR Models: Aligning Autoregressive Attention Mechanisms with Autoregressive Forecasting

Jiecheng Lu · Shihao Yang

Autoregressive attention-based time series forecasting (TSF) has drawn increasing interest, with mechanisms like linear attention often outperforming vanilla attention. However, deeper Transformer architectures frequently misalign with autoregressive objectives, obscuring the underlying VAR structure embedded within linear attention and hindering their ability to capture the data generative processes in TSF. In this work, we first show that a single linear attention layer can be interpreted as a dynamic vector autoregressive (VAR) structure. We then explain that existing multi-layer Transformers have structural mismatches with the autoregressive forecasting objective, which impair interpretability and generalization ability. To address this, we show that by rearranging the MLP, attention, and input-output flow, multi-layer linear attention can also be aligned as a VAR model. Then, we propose Structural Aligned Mixture of VAR (SAMoVAR), a linear Transformer variant that integrates interpretable dynamic VAR weights for multivariate TSF. By aligning the Transformer architecture with autoregressive objectives, SAMoVAR delivers improved performance, interpretability, and computational efficiency, comparing to SOTA TSF models.

Poster

#E-2101

Measuring In-Context Computation Complexity via Hidden State Prediction

Vincent Herrmann · Róbert Csordás · Jürgen Schmidhuber

Detecting when a neural sequence model does "interesting" computation is an open problem. The next token prediction loss is a poor indicator: Low loss can stem from trivially predictable sequences that are uninteresting, while high loss may reflect unpredictable but also irrelevant information that can be ignored by the model. We propose a better metric: measuring the model's ability to predict its own future hidden states. We show empirically that this metric–in contrast to the next token prediction loss–correlates with the intuitive interestingness of the task. To measure predictability, we introduce the architecture-agnostic "prediction of hidden states" (PHi) layer that serves as an information bottleneck on the main pathway of the network (e.g., the residual stream in Transformers). We propose a novel learned predictive prior that enables us to measure the novel information gained in each computation step, which serves as our metric. We show empirically that our metric predicts the description length of formal languages learned in-context, the complexity of mathematical reasoning problems, and the correctness of self-generated reasoning chains.

Poster

#E-2102

Time-VLM: Exploring Multimodal Vision-Language Models for Augmented Time Series Forecasting

Siru Zhong · Weilin Ruan · Ming Jin · Huan Li · Qingsong Wen · Yuxuan Liang

Recent advancements in time series forecasting have explored augmenting models with text or vision modalities to improve accuracy. While text provides contextual understanding, it often lacks fine-grained temporal details. Conversely, vision captures intricate temporal patterns but lacks semantic context, limiting the complementary potential of these modalities. To address this, we propose Time-VLM, a novel multimodal framework that leverages pre-trained Vision-Language Models (VLMs) to bridge temporal, visual, and textual modalities for enhanced forecasting. Our framework comprises three key components: (1) a Retrieval-Augmented Learner, which extracts enriched temporal features through memory bank interactions; (2) a Vision-Augmented Learner, which encodes time series as informative images; and (3) a Text-Augmented Learner, which generates contextual textual descriptions. These components collaborate with frozen pre-trained VLMs to produce multimodal embeddings, which are then fused with temporal features for final prediction. Extensive experiments demonstrate that Time-VLM achieves superior performance, particularly in few-shot and zero-shot scenarios, thereby establishing a new direction for multimodal time series forecasting. Code is available at https://github.com/CityMind-Lab/ICML25-TimeVLM.

Poster

#E-2103

A Closer Look at Transformers for Time Series Forecasting: Understanding Why They Work and Where They Struggle

Yu Chen · Nathalia Céspedes · Payam Barnaghi

Time-series forecasting is crucial across various domains, including finance, healthcare, and energy. Transformer models, originally developed for natural language processing, have demonstrated significant potential in addressing challenges associated with time-series data. These models utilize different tokenization strategies, point-wise, patch-wise, and variate-wise, to represent time-series data, each resulting in different scope of attention maps. Despite the emergence of sophisticated architectures, simpler transformers consistently outperform their more complex counterparts in widely used benchmarks. This study examines why point-wise transformers are generally less effective, why intra- and inter-variate attention mechanisms yield similar outcomes, and which architectural components drive the success of simpler models. By analyzing mutual information and evaluating models on synthetic datasets, we demonstrate that intra-variate dependencies are the primary contributors to prediction performance on benchmarks, while inter-variate dependencies have a minor impact. Additionally, techniques such as Z-score normalization and skip connections are also crucial. However, these results are largely influenced by the self-dependent and stationary nature of benchmark datasets. By validating our findings on real-world healthcare data, we provide insights for designing more effective transformers for practical applications.

Poster

#E-2104

ML$^2$-GCL: Manifold Learning Inspired Lightweight Graph Contrastive Learning

Jianqing Liang · Zhiqiang Li · Xinkai Wei · Yuan Liu · Zhiqiang Wang

Graph contrastive learning has attracted great interest as a dominant and promising self-supervised representation learning approach in recent years. While existing works follow the basic principle of pulling positive pairs closer and pushing negative pairs far away, they still suffer from several critical problems, such as the underlying semantic disturbance brought by augmentation strategies, the failure of GCN in capturing long-range dependence, rigidness and inefficiency of node sampling techniques. To address these issues, we propose Manifold Learning Inspired Lightweight Graph Contrastive Learning (ML$^2$-GCL), which inherits the merits of both manifold learning and GCN. ML$^2$-GCL avoids the potential risks of semantic disturbance with only one single view. It achieves global nonlinear structure recovery from locally linear fits, which can make up for the defects of GCN. The most amazing advantage is about the lightweight due to its closed-form solution of positive pairs weights and removal of pairwise distances calculation. Theoretical analysis proves the existence of the optimal closed-form solution. Extensive empirical results on various benchmarks and evaluation protocols demonstrate effectiveness and lightweight of ML$^2$-GCL. We release the code at https://github.com/a-hou/ML2-GCL.

Poster

#E-2105

Clustering Properties of Self-Supervised Learning

Xi Weng · Jianing An · Xudong Ma · Binhang Qi · Jie Luo · Xi Yang · Jin Song Dong · Lei Huang

Self-supervised learning (SSL) methods via joint embedding architectures have proven remarkably effective at capturing semantically rich representations with strong clustering properties, magically in the absence of label supervision. Despite this, few of them have explored leveraging these untapped properties to improve themselves. In this paper, we provide an evidence through various metrics that the encoder's output encoding exhibits superior and more stable clustering properties compared to other components. Building on this insight, we propose a novel positive-feedback SSL method, termed Representation Self-Assignment (ReSA), which leverages the model's clustering properties to promote learning in a self-guided manner. Extensive experiments on standard SSL benchmarks reveal that models pretrained with ReSA outperform other state-of-the-art SSL methods by a significant margin. Finally, we analyze how ReSA facilitates better clustering properties, demonstrating that it effectively enhances clustering performance at both fine-grained and coarse-grained levels, shaping representations that are inherently more structured and semantically meaningful.

Poster

#E-2106

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Hongyang Lei · Xiaolong Cheng · Qi Qin · Dan Wang · Huazhen Huang · Qingqing Gu · Yetao Wu · Luo Ji

Current multimodal learning strategies primarily optimize in the original token space. Such a framework is easy to incorporate with the backbone of pretrained language model, but might result in modality collapse. To alleviate such issues, we leverage the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks, which converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space. We implement this predictor by a Multi-Gate Mixture of Experts (MMoE) and name the framework as M3-JEPA, accordingly. The gating function disentangles the modality-specific and shared information and derives information-theoretic optimality. The framework is implemented with both contrastive and regularization loss, and solved by alternative gradient descent (AGD) between different multimodal tasks. By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. Our observation suggests that M3-JEPA might become a new basis to self-supervised learning in the open world.

Poster

#E-2107

Structure-informed Risk Minimization for Robust Ensemble Learning

Fengchun Qiao · Yanlin Chen · Xi Peng

Ensemble learning is a powerful approach for improving generalization under distribution shifts, but its effectiveness heavily depends on how individual models are combined. Existing methods often optimize ensemble weights based on validation data, which may not represent unseen test distributions, leading to suboptimal performance in out-of-distribution (OoD) settings. Inspired by Distributionally Robust Optimization (DRO), we propose Structure-informed Risk Minimization (SRM), a principled framework that learns robust ensemble weights without access to test data. Unlike standard DRO, which defines uncertainty sets based on divergence metrics alone, SRM incorporates structural information of training distributions, ensuring that the uncertainty set aligns with plausible real-world shifts. This approach mitigates the over-pessimism of traditional worst-case optimization while maintaining robustness. We introduce a computationally efficient optimization algorithm with theoretical guarantees and demonstrate that SRM achieves superior OoD generalization compared to existing ensemble combination strategies across diverse benchmarks. Code is available at: https://github.com/deep-real/SRM.

Poster

#E-2108

Adversarial Inputs for Linear Algebra Backends

Jonas Möller · Lukas Pirch · Felix Weissberg · Sebastian Baunsgaard · Thorsten Eisenhofer · Konrad Rieck

Linear algebra is a cornerstone of neural network inference. The efficiency of popular frameworks, such as TensorFlow and PyTorch, critically depends on backend libraries providing highly optimized matrix multiplications and convolutions. A diverse range of these backends exists across platforms, including Intel MKL, Nvidia CUDA, and Apple Accelerate. Although these backends provide equivalent functionality, subtle variations in their implementations can lead to seemingly negligible differences during inference. In this paper, we investigate these minor discrepancies and demonstrate how they can be selectively amplified by adversaries. Specifically, we introduce Chimera examples, inputs to models that elicit conflicting predictions depending on the employed backend library. These inputs can even be constructed with integer values, creating a vulnerability exploitable from real-world input domains. We analyze the prevalence and extent of the underlying attack surface and propose corresponding defenses to mitigate this threat.

Spotlight Poster

#E-2109

Improving Zero-Shot Adversarial Robustness in Vision-Language Models by Closed-form Alignment of Adversarial Path Simplices

Junhao Dong · Piotr Koniusz · Yifei Zhang · Hao Zhu · Weiming Liu · Xinghua Qu · Yew Soon ONG

Vision-Language Models (VLMs) such as CLIP excel at zero-shot classification due to large-scale pre-training but are vulnerable to adversarial examples. Adversarial fine-tuning robustifies zero-shot models by aligning prediction scores of individual adversaries with their clean counterparts, which typically overlooks intermediate adversarial samples along the adversarial trajectory crossing the decision boundary. Such intermediate adversaries and their vicinity produce informative representations capturing the decision boundary in detail. They can be improved by sampling adversarial candidates from simplices formed by joining two consecutive vertices on the adversarial trajectory and their clean counterpart. However, sampling simplices for adversaries is very costly. To train robust VLM, we overcome these limitations by Taylor expansion and formulating an upper-bound of alignment loss that depends on the Jacobian/Hessian obtained at clean samples. As regions between clean and intermediate adversarial samples capture a larger decision landscape, we robustify VLM by plausible adversaries from simplices by our closed-form formulation equivalent to infinite uniform sampling of the simplex. We obtain state-of-the-art robustness across 15 datasets and diverse vision-language tasks.

Poster

#E-2110

ROME is Forged in Adversity: Robust Distilled Datasets via Information Bottleneck

Zheng Zhou · Wenquan Feng · Qiaosheng Zhang · Shuchang Lyu · Qi Zhao · Guangliang Cheng

Dataset Distillation (DD) compresses large datasets into smaller, synthetic subsets, enabling models trained on them to achieve performance comparable to those trained on the full data. However, these models remain vulnerable to adversarial attacks, limiting their use in safety-critical applications. While adversarial robustness has been extensively studied in related fields, research on improving DD robustness is still limited. To address this, we propose ROME, a novel method that enhances the adversarial RObustness of DD by leveraging the InforMation BottlenEck (IB) principle. ROME includes two components: a performance-aligned term to preserve accuracy and a robustness-aligned term to improve robustness by aligning feature distributions between synthetic and perturbed images. Furthermore, we introduce the Improved Robustness Ratio (I-RR), a refined metric to better evaluate DD robustness. Extensive experiments on CIFAR-10 and CIFAR-100 demonstrate that ROME outperforms existing DD methods in adversarial robustness, achieving maximum I-RR improvements of nearly 40% under white-box attacks and nearly 35% under black-box attacks. Our code is available at https://github.com/zhouzhengqd/ROME.

Poster

#E-2111

You Always Recognize Me (YARM): Robust Texture Synthesis Against Multi-View Corruption

Weihang Ran · Wei Yuan · Yinqiang Zheng

Damage to imaging systems and complex external environments often introduce corruption, which can impair the performance of deep learning models pretrained on high-quality image data. Previous methods have focused on restoring degraded images or fine-tuning models to adapt to out-of-distribution data. However, these approaches struggle with complex, unknown corruptions and often reduce model accuracy on high-quality data. Inspired by the use of warning colors and camouflage in the real world, we propose designing a robust appearance that can enhance model recognition of low-quality image data. Furthermore, we demonstrate that certain universal features in radiance fields can be applied across objects of the same class with different geometries. We also examine the impact of different proxy models on the transferability of robust appearances. Extensive experiments demonstrate the effectiveness of our proposed method, which outperforms existing image restoration and model fine-tuning approaches across different experimental settings, and retains effectiveness when transferred to models with different architectures. Code will be available at https://github.com/SilverRAN/YARM.

Poster

#E-2112

Geometric Median (GM) Matching for Robust k-Subset Selection from Noisy Data

Anish Acharya · Sujay Sanghavi · Alex Dimakis · Inderjit Dhillon

Data pruning -- the combinatorial task of selecting a small and representative subset from a large dataset, is crucial for mitigating the enormous computational costs associated with training data-hungry modern deep learning models at scale. Since large-scale data collections are invariably noisy, developing data pruning strategies that remain robust even in the presence of corruption is critical in practice. Existing data pruning methods often fail under high corruption rates due to their reliance on empirical mean estimation, which is highly sensitive to outliers. In response, this work proposes Geometric Median (GM) Matching, a novel k-subset selection strategy that leverages the Geometric Median (GM) , a robust estimator with an optimal breakdown point of 1/2; to enhance resilience against noisy data. Our method iteratively selects a $k$-subset such that the mean of the subset approximates the GM of the (potentially) noisy dataset, ensuring robustness even under arbitrary corruption. We provide theoretical guarantees, showing that GM Matching enjoys an improved $\mathcal{O}(1/k)$ convergence rate, outperforming $\mathcal{O}(1/\sqrt{k})$ scaling of uniform sampling, even under arbitrary corruption. Extensive experiments across image classification and image generation tasks demonstrate that GM Matching consistently outperforms existing pruning approaches, particularly in high-corruption settings; making it a strong baseline for robust data pruning.

Poster

#E-2200

HaploVL: A Single-Transformer Baseline for Multi-Modal Understanding

Rui Yang · Lin Song · Yicheng Xiao · Runhui Huang · Yixiao Ge · Ying Shan · Hengshuang Zhao

Recent advancements in large language models (LLMs) have significantly propelled the development of large multi-modal models (LMMs), highlighting the potential for general and intelligent assistants. However, most LMMs model visual and textual modalities separately, leading to recent efforts to develop native LMMs using a single transformer. Despite the promise, these native models are resource-intensive and often exhibit performance gaps compared to their compositional counterparts. To alleviate this issue, we propose a simple yet efficient method to construct a baseline for the native and end-to-end large multi-modal model in a single transformer. First, we propose a new early-fusion LMM that can fuse multi-modal inputs in the early stage and respond to visual instructions in an auto-regressive manner. Second, we devise an efficient training recipe for the proposed model, which harnesses the prior knowledge of the pre-trained models, addressing both the performance limitations and the challenge of resource consumption. The proposed model demonstrates superior performance compared to other LMMs using one transformer and significantly narrows the performance gap with compositional LMMs.

Poster

#E-2201

MAS-GPT: Training LLMs to Build LLM-based Multi-Agent Systems

Rui Ye · shuo tang · Rui Ge · Yaxin Du · Zhenfei Yin · Siheng Chen · Jing Shao

LLM-based multi-agent systems (MAS) have shown significant potential in tackling diverse tasks.However, to design effective MAS, existing approaches heavily rely on manual configurations or multiple calls of advanced LLMs, resulting in inadaptability and high inference costs.In this paper, we simplify the process of building an MAS by reframing it as a generative language task, where the input is a user query and the output is a corresponding MAS.To address this novel task, we unify the representation of MAS as executable code and propose a consistency-oriented data construction pipeline to create a high-quality dataset comprising coherent and consistent query-MAS pairs.Using this dataset, we train MAS-GPT, an open-source medium-sized LLM that is capable of generating query-adaptive MAS within a single LLM inference. The generated MAS can be seamlessly applied to process user queries and deliver high-quality responses. Extensive experiments on 9 benchmarks and 5 LLMs show that the proposed MAS-GPT consistently outperforms 10+ baseline MAS methods on diverse settings, indicating MAS-GPT's high effectiveness, efficiency and strong generalization ability.The codes are released at \url{https://github.com/rui-ye/MAS-GPT}.

Poster

#E-2202

Reward Modeling with Ordinal Feedback: Wisdom of the Crowd

Shang Liu · Yu Pan · Guanting Chen · Xiaocheng Li

The canonical setup of learning a reward model (RM) from human preferences with binary feedback discards potentially useful samples (such as "tied" between the two responses) and loses fine-grained information (such as "slightly better'"). This paper proposes a framework for learning RMs under ordinal feedback, generalizing the binary feedback to arbitrary granularity. We first identify a marginal unbiasedness condition, which generalizes the existing assumption of the binary feedback. The condition is validated via the sociological concept called "wisdom of the crowd". Under this condition, we develop a natural probability model and prove the benefits of fine-grained feedback in terms of reducing the Rademacher complexity, which may be of independent interest to another problem: the bias-variance trade-off in knowledge distillation. The framework also sheds light on designing guidelines for human annotators. Our numerical experiments validate that: (1) fine-grained feedback leads to better RM learning for both in- and out-of-distribution settings; (2) incorporating a certain proportion of tied samples boosts RM learning.

Poster

#E-2203

The Berkeley Function Calling Leaderboard (BFCL): From Tool Use to Agentic Evaluation of Large Language Models

Shishir G. Patil · Huanzhi Mao · Fanjia Yan · Charlie Ji · Vishnu Suresh · Ion Stoica · Joseph E Gonzalez

Function calling, also called tool use, refers to an LLM's ability to invoke external functions, APIs, or user-defined tools in response to user queries—an essential capability for agentic LLM applications. Despite its prominence, there did not exist a standard benchmark to evaluate function calling abilities, due to two reasons – the challenging nature of evaluating when a function call is valid, and the challenge of acquiring diverse, real-world functions. We present the Berkeley Function Calling Leaderboard (BFCL), a comprehensive benchmark designed to evaluate function calling capabilities in a wide range of real-world settings. The BFCL benchmark evaluates serial and parallel function calls, across various programming languages using a novel Abstract Syntax Tree (AST) evaluation method that can easily scale to thousands of functions. We construct the benchmark using a combination of expert curated, and user-contributed functions and associated prompts. Finally, BFCL benchmark evaluates the ability of models to abstain and reason in stateful multi-step agentic setting. Evaluating a wide range of models, we observe that while state-of-the-art LLMs excel at singleturn calls, memory, dynamic decision-making, and long-horizon reasoning remain open challenges. Since its preview, BFCL has become the defacto standard for evaluating function-calls, and can be accessed at gorilla.cs.berkeley.edu/leaderboard.html.

Poster

#E-2204

La RoSA: Enhancing LLM Efficiency via Layerwise Rotated Sparse Activation

Kai Liu · Bowen Xu · Shaoyu Wu · Xin Chen · Hao Zhou · Yongliang Tao · lulu hu

Activation sparsity can reduce the computational overhead and memory transfers during the forward pass of Large Language Model (LLM) inference. Existing methods face limitations, either demanding time-consuming recovery training that hinders real-world adoption, or relying on empirical magnitude-based pruning, which causes fluctuating sparsity and unstable inference speed-up. This paper introduces LaRoSA (Layerwise Rotated Sparse Activation), a novel method for activation sparsification designed to improve LLM efficiency without requiring additional training or magnitude-based pruning. We leverage layerwise orthogonal rotations to transform input activations into rotated forms that are more suitable for sparsification. By employing a Top-K selection approach within the rotated activations, we achieve consistent model-level sparsity and reliable wall-clock time speed-up. LaRoSA is effective across various sizes and types of LLMs, demonstrating minimal performance degradation and robust inference acceleration. Specifically, for LLaMA2-7B at 40\% sparsity, LaRoSA achieves a mere 0.17 perplexity gap with a consistent 1.30× wall-clock time speed-up, and reduces the accuracy gap in zero-shot tasks compared to the dense model to just 0.54\%, while surpassing TEAL by 1.77\% and CATS by 17.14\%.

Poster

#E-2205

Data Mixing Optimization for Supervised Fine-Tuning of Large Language Models

Yuan Li · Zhengzhong Liu · Eric Xing

Optimizing data mixtures for supervised fine-tuning (SFT) of large language models (LLMs) is critical for developing general-purpose models, yet this area remains underexplored. In this paper, we frame data mixing as an optimization problem and introduce a novel method designed to minimize validation loss. Our approach parametrizes the loss by modeling effective data transferred and leveraging scaling laws for fine-tuning. By experimenting with various small-scale data mixtures, we fit these parameters and derive the optimal weights. We provide both mathematical proofs and empirical results demonstrating that our algorithm achieves excellent overall and individual performance across all domains. Through controlled experiments, we show that models trained with our optimized weights perform on par with those using optimal weights determined via grid search, with per-domain loss only 0.66\% higher than the best domain loss from grid search on average. Additionally, we show that reweighting popular SFT datasets using our method improves both validation loss and downstream performance. Finally, we discuss how our method can generalize to guide data selection for domain-specific models and provide insights into SFT.

Poster

#E-2206

Gumiho: A Hybrid Architecture to Prioritize Early Tokens in Speculative Decoding

Jinze Li · Yixing Xu · Haiduo Huang · Xuanwu Yin · Dong Li · Edith Ngai · Emad Barsoum

Speculative decoding (SPD) aims to accelerate the auto-regressive token generation process of a target Large Language Model (LLM). Some approaches employ a draft model with multiple heads to predict a sequence of future tokens, where each head handles a token in the sequence. The target LLM verifies the predicted sequence and accepts aligned tokens, enabling efficient multi-token generation. However, existing methods assume that all tokens within a sequence are equally important, employing identical head structures and relying on a single-generation paradigm, either serial or parallel. To this end, we theoretically demonstrate that initial tokens in the draft sequence are more important than later ones. Building on this insight, we propose Gumiho, a hybrid model combining serial and parallel heads. Specifically, given the critical importance of early tokens, we employ a sophisticated Transformer architecture for the early draft heads in a serial configuration to improve accuracy. For later tokens, we utilize multiple lightweight MLP heads operating in parallel to enhance efficiency. By allocating more advanced model structures and longer running times to the early heads, Gumiho achieves improved overall performance. The experimental results demonstrate that our method outperforms existing approaches, fully validating its effectiveness. Our code is available at https://github.com/AMD-AIG-AIMA/Gumiho.

Poster

#E-2207

unMORE: Unsupervised Multi-Object Segmentation via Center-Boundary Reasoning

Yafei YANG · Zihui Zhang · Bo Yang

We study the challenging problem of unsupervised multi-object segmentation on single images. Existing methods, which rely on image reconstruction objectives to learn objectness or leverage pretrained image features to group similar pixels, often succeed only in segmenting simple synthetic objects or discovering a limited number of real-world objects. In this paper, we introduce unMORE, a novel two-stage pipeline designed to identify many complex objects in real-world images. The key to our approach involves explicitly learning three levels of carefully defined object-centric representations in the first stage. Subsequently, our multi-object reasoning module utilizes these learned object priors to discover multiple objects in the second stage. Notably, this reasoning module is entirely network-free and does not require human labels. Extensive experiments demonstrate that unMORE significantly outperforms all existing unsupervised methods across 6 real-world benchmark datasets, including the challenging COCO dataset, achieving state-of-the-art object segmentation results. Remarkably, our method excels in crowded images where all baselines collapse. Our code and data are available at https://github.com/vLAR-group/unMORE.

Poster

#E-2208

Controllable Data Generation with Hierarchical Neural Representations

Sheyang Tang · xiaoyu xu · Jiayan Qiu · Zhou Wang

Implicit Neural Representations (INRs) represent data as continuous functions using the parameters of a neural network, where data information is encoded in the parameter space. Therefore, modeling the distribution of such parameters is crucial for building generalizable INRs. Existing approaches learn a joint distribution of these parameters via a latent vector to generate new data, but such a flat latent often fails to capture the inherent hierarchical structure of the parameter space, leading to entangled data semantics and limited control over the generation process. Here, we propose a Controllable Hierarchical Implicit Neural Representation (CHINR) framework, which explicitly models conditional dependencies across layers in the parameter space. Our method consists of two stages: In Stage-1, we construct a Layers-of-Experts (LoE) network, where each layer modulates distinct semantics through a unique latent vector, enabling disentangled and expressive representations. In Stage-2, we introduce a Hierarchical Conditional Diffusion Model (HCDM) to capture conditional dependencies across layers, allowing for controllable and hierarchical data generation at various semantic granularities. Extensive experiments across different modalities demonstrate that CHINR improves generalizability and offers flexible hierarchical control over the generated content.

Spotlight Poster

#E-2209

ActionPiece: Contextually Tokenizing Action Sequences for Generative Recommendation

Yupeng Hou · Jianmo Ni · Zhankui He · Noveen Sachdeva · Wang-Cheng Kang · Ed Chi · Julian McAuley · Derek Cheng

Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as a set of item features. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Our code is available at: https://github.com/google-deepmind/action_piece.

Poster

#E-2210

Reaction Graph: Towards Reaction-Level Modeling for Chemical Reactions with 3D Structures

Yingzhao Jian · Yue Zhang · Ying Wei · Hehe Fan · Yi Yang

Accurately modeling chemical reactions using Artificial Intelligence (AI) can accelerate discovery and development, especially in fields like drug design and material science. Although AI has made remarkable advancements in single molecule recognition, such as predicting molecular properties, the study of interactions between molecules, particularly chemical reactions, has been relatively overlooked. In this paper, we introduce Reaction Graph (RG), a unified graph representation that encapsulates the 3D molecular structures within chemical reactions. RG integrates the molecular graphs of reactants and products into a cohesive framework, effectively capturing the interatomic relationships pertinent to the reaction process. Additionally, it incorporates the 3D structure information of molecules in a simple yet effective manner. We conduct experiments on a range of tasks, including chemical reaction classification, condition prediction, and yield prediction. RG achieves the highest accuracy across six datasets, demonstrating its effectiveness. The code is available at https://github.com/Shadow-Dream/Reaction-Graph.

Poster

#E-2211

TtBA: Two-third Bridge Approach for Decision-Based Adversarial Attack

Feiyang Wang · Xingquan Zuo · Hai Huang · Gang Chen

A key challenge in black-box adversarial attacks is the high query complexity in hard-label settings, where only the top-1 predicted label from the target deep model is accessible. In this paper, we propose a novel normal-vector-based method called Two-third Bridge Attack (TtBA). A innovative bridge direction is introduced which is a weighted combination of the current unit perturbation direction and its unit normal vector, controlled by a weight parameter $k$. We further use binary search to identify $k=k_\text{bridge}$, which has identical decision boundary as the current direction. Notably, we observe that $k=2/3 k_\text{bridge}$ yields a near-optimal perturbation direction, ensuring the stealthiness of the attack. In addition, we investigate the critical importance of local optima during the perturbation direction optimization process and propose a simple and effective approach to detect and escape such local optima. Experimental results on MNIST, FASHION-MNIST, CIFAR10, CIFAR100, and ImageNet datasets demonstrate the strong performance and scalability of our approach. Compared to state-of-the-art non-targeted and targeted attack methods, TtBA consistently delivers superior performance across most experimented datasets and deep learning models. Code is available at https://anonymous.4open.science/r/TtBA-6ECF.

Poster

#E-2212

DiffAdvMAP: Flexible Diffusion-Based Framework for Generating Natural Unrestricted Adversarial Examples

Zhengzhao Pan · Hua Chen · Xiaogang Zhang

Unrestricted adversarial examples(UAEs) have posed greater threats to deep neural networks(DNNs) than perturbation-based adversarial examples(AEs) because they can make extensive changes to images without being restricted in a fixed norm perturbation budget. Although current diffusion-based methods can generate more natural UAEs than other unrestricted attack methods, the overall effectiveness of such methods is restricted since they are designed for specific attack conditions. Additionally, the naturalness of UAEs still has room for improvement, as these methods primarily focus on leveraging diffusion models as strong priors to enhance the generation process. This paper proposes a flexible framework named Diffusion-based Adversarial Maximum a Posterior(DiffAdvMAP) to generate more natural UAEs for various scenarios. DiffAdvMAP approaches the generation of UAEs by sampling images from posterior distributions, which is achieved by approximating the posterior distribution of UAEs using the prior distribution of real data learned by the diffusion model. This process enhances the naturalness of the UAEs. By incorporating an adversarial constraint to ensure the effectiveness of the attack, DiffAdvMAP exhibits excellent attack ability and defense robustness. A reconstruction constraint is designed to enhance its flexibility, which allows DiffAdvMAP to be tailored to various attack scenarios. Experimental results on Imagenet show that we achieve a better trade-off between image quality, flexibility, and transferability than baseline unrestricted adversarial attack methods.

Poster

#E-2300

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking

Runquan Gui · Zhihai Wang · Jie Wang · Chi Ma · Huiling Zhen · Mingxuan Yuan · Jianye Hao · Defu Lian · Enhong Chen · Feng Wu

Recent advancements have significantly enhanced the performance of large language models (LLMs) in tackling complex reasoning tasks, achieving notable success in domains like mathematical and logical reasoning.However, these methods encounter challenges with complex planning tasks, primarily due to extended reasoning steps, diverse constraints, and the challenge of handling multiple distinct sub-tasks.To address these challenges, we propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning.The hypertree structure enables LLMs to engage in hierarchical thinking by flexibly employing the divide-and-conquer strategy, effectively breaking down intricate reasoning steps, accommodating diverse constraints, and managing multiple distinct sub-tasks in a well-organized manner.We further introduce an autonomous planning framework that completes the planning process by iteratively refining and expanding the hypertree-structured planning outlines.Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6$\times$ performance improvement over o1-preview.

Spotlight Poster

#E-2301

Everything Everywhere All at Once: LLMs can In-Context Learn Multiple Tasks in Superposition

Zheyang Xiong · Jack Cai · John Cooper · Albert Ge · Vasilis Papageorgiou · Zack Sifakis · Angeliki Giannou · Ziqian Lin · Liu Yang · Saurabh Agarwal · Grigorios Chrysos · Samet Oymak · Kangwook Lee · Dimitris Papailiopoulos

Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term task superposition". We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of "LLMs as superposition of simulators", and raise questions about the mechanisms enabling simultaneous task execution.

Poster

#E-2302

Earley-Driven Dynamic Pruning for Efficient Structured Decoding

Xintong Sun · Chi Wei · Minghao Tian · Shiwen Ni

Large Language Models (LLMs) have shown remarkable capabilities, yet ensuring their outputs conform to strict structural or grammatical constraints remains challenging, which is critical in function calls and domain-specific language (DSL) generation. Constrained decoding with context-free grammar is a flexible approach to guarantee LLMs' adherence to a specific format by dynamically building a token logits mask. However, creating this mask requires checking the validity of all tokens in the LLM vocabulary at every decoding step, which often incurs significant overheads in existing constrained decoding engines. To address this challenge, we propose $\textbf{ZapFormat}$, a novel $\textbf{dynamic pruning}$ strategy based on the Earley algorithm that identifies and eliminates invalid or redundant Earley states in real-time, significantly reducing memory occupation of the Earley algorithm's states. This further enables us to use a state cache to speed up structured generations on a large number of queries. We implemented ZapFormat in a new constrained decoding engine called Formatron which also incorporates existing optimizations. Through comprehensive experiments on structured generation tasks, including JSON generation, JSON Schema validation, and semantic parsing, we demonstrate that Formatron not only $\textbf{consistently maintains}$ high-precision compliant outputs but also achieves $\textbf{significant improvements}$ in inference speed up to 2x compared to state-of-the-art implementations. More importantly, Formatron is generally applicable across various LLM architectures. We release Formatron as open source at https://github.com/Dan-wanna-M/formatron.

Poster

#E-2303

Imagine While Reasoning in Space: Multimodal Visualization-of-Thought

Chengzu Li · Wenshan Wu · Huanyu Zhang · Yan Xia · Shaoguang Mao · Li Dong · Ivan Vulić · Furu Wei

Chain-of-Thought (CoT) prompting has proven highly effective for enhancing complex reasoning in Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs). Yet, it struggles in complex spatial reasoning tasks. Nonetheless, human cognition extends beyond language alone, enabling the remarkable capability to think in both words and images. Inspired by this mechanism, we propose a new reasoning paradigm, Multimodal Visualization-of-Thought (MVoT). It enables visual thinking in MLLMs by generating image visualizations of their reasoning traces. To ensure high-quality visualization, we introduce token discrepancy loss into autoregressive MLLMs. This innovation significantly improves both visual coherence and fidelity. We validate this approach through several dynamic spatial reasoning tasks. Experimental results reveal that MVoT demonstrates competitive performance across tasks. Moreover, it exhibits robust and reliable improvements in the most challenging scenarios where CoT fails. Ultimately, MVoT establishes new possibilities for complex reasoning tasks where visual thinking can effectively complement verbal reasoning.

Poster

#E-2304

PENCIL: Long Thoughts with Short Memory

Chenxiao Yang · Nati Srebro · David McAllester · Zhiyuan Li

While state-of-the-art LLMs have demonstrated great promise of using long Chains-of-Thought (CoT) to boost reasoning, scaling it up to more challenging problems is fundamentally limited by suboptimal memory usage — intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We introduce PENCIL, which incorporates a novel reduction mechanism into the autoregressive generation process that recursively clean up intermediate thoughts based on patterns learned from training. By alternately generating and erasing, PENCIL can think deeper to solve harder problems using shorter context and less computes. Empirically, for example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein's puzzle — a task that challenges much larger models like GPT-4. Theoretically, we prove PENCIL can perform universal efficient computation by simulating any Turing machines with optimal time and space complexity, and thus can solve arbitrary computable tasks that are otherwise intractable for vanilla CoT.

Poster

#E-2305

Mask-Enhanced Autoregressive Prediction: Pay Less Attention to Learn More

Xialie Zhuang · Zhikai Jia · Jianjin Li · Zhenyu Zhang · Li Shen · Zheng Cao · Shiwei Liu

Large Language Models (LLMs) are discovered to suffer from accurately retrieving key information. To address this, we propose Mask-Enhanced Autoregressive Prediction (MEAP), a simple yet effective training paradigm that seamlessly integrates Masked Language Modeling (MLM) into Next-Token Prediction (NTP) to enhance the latter's in-context retrieval capabilities. Specifically, MEAP first randomly masks a small fraction of input tokens and then directly performs the standard next-token prediction autoregressive using a decoder-only Transformer. MEAP eliminates the need for bidirectional attention or encoder-decoder architectures for MLM, incurring no additional computational overhead during pre-training or inference. Intensive experiments demonstrate that MEAP substantially outperforms NTP on key information retrieval and long-context reasoning tasks, while performing on par or better on commonsense reasoning tasks. The benefits of MEAP also extend to supervised fine-tuning, where it shows remarkable advantages in lost-in-the-middle scenarios, outperforming NTP by 11.77% percentage points. Our analysis indicates that MEAP’s effectiveness arises from its ability to promote more distinguishable attention scores by concentrating on a reduced set of non-masked tokens. This mechanism improves the model’s focus on task-relevant signals while mitigating the influence of peripheral context. These findings position MEAP as a promising training paradigm for large language models. Code has been submitted.

Poster

#E-2306

TruthFlow: Truthful LLM Generation via Representation Flow Correction

Hanyu Wang · Bochuan Cao · Yuanpu Cao · Jinghui Chen

Large language models (LLMs) are known to struggle with consistently generating truthful responses. While various representation intervention techniques have been proposed, these methods typically apply a universal representation correction vector to all input queries, limiting their effectiveness against diverse queries in practice. In this study, we introduce TruthFlow, a novel method that leverages the Flow Matching technique for query-specific truthful representation correction. Specifically, TruthFlow first uses a flow model to learn query-specific correction vectors that transition representations from hallucinated to truthful states. Then, during inference, the trained flow model generates these correction vectors to enhance the truthfulness of LLM outputs. Experimental results demonstrate that TruthFlow significantly improves performance on open-ended generation tasks across various advanced LLMs evaluated on TruthfulQA. Moreover, the trained TruthFlow model exhibits strong transferability, performing effectively on other unseen hallucination benchmarks.

Poster

#E-2307

MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment

Tianze Wang · Dongnan Gui · Yifan Hu · Shuhang Lin · Linjun Zhang

Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.

Poster

#E-2308

GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation

Jiashu HE · Mingyu Ma · Jinxuan Fan · Dan Roth · Wei Wang · Alejandro Ribeiro

Existing approaches based on context prompting or reinforcement learning (RL) to improve the reasoning capacities of large language models (LLMs) depend on the LLMs' internal knowledge to produce reliable Chain-Of-Thought (CoT). However, no matter the size of LLMs, certain problems cannot be resolved in a single forward pass. Meanwhile, agent-based reasoning systems require access to a comprehensive nonparametric knowledge base, which is often costly or not feasible for use in scientific and niche domains. We present Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data ($\textbf{observe}$), engage in query-specific associative thinking ($\textbf{reflect}$), and then synthesize this information to produce the final output ($\textbf{speak}$). Extensive experiments demonstrated the following benefits of our framework: (1) GIVE increases the performance of LLMs across various sizes. (2) In some scenarios, GIVE allows smaller LLMs to surpass larger, more sophisticated ones in scientific tasks ($\textbf{GPT3.5T + GIVE > GPT4}$). (3) GIVE is effective on scientific and open-domain assessments. (4) GIVE is a training-free method that enables LLMs to tackle new problems that extend beyond their training data (up to $\textbf{43.5}$\% $\rightarrow$ $\textbf{88.2}$\% accuracy improvement). (5) GIVE allows LLM agents to reason using both restricted (very small) and noisy (very large) knowledge sources, accommodating knowledge graphs (KG) ranging from $\textbf{135}$ to more than $\textbf{840k}$ nodes. (6) The reasoning process involved in GIVE is fully interpretable. Our code is available at https://github.com/Jason-Tree/GIVE

Poster

#E-2309

Memory Layers at Scale

Vincent-Pierre Berges · Barlas Oğuz · Daniel HAZIZA · Scott Yih · Luke Zettlemoyer · Gargi Ghosh

Memory layers use a trainable key-value lookup mechanism to add extra parameters to a model without increasing FLOPs. Conceptually, sparsely activated memory layers complement compute-heavy dense feed-forward layers, providing dedicated capacity to store and retrieve information cheaply. This work takes memory layers beyond proof-of-concept, proving their utility at contemporary scale. On downstream tasks, language models augmented with our improved memory layer outperform dense models with more than twice the computation budget, as well as mixture-of-expert models when matched for both compute and parameters. We find gains are especially pronounced for factual tasks. We provide a fully parallelizable memory layer implementation, demonstrating scaling laws with up to 128B memory parameters, pretrained to 1 trillion tokens, comparing to base models with up to 8B parameters.

Poster

#E-2310

Model Swarms: Collaborative Search to Adapt LLM Experts via Swarm Intelligence

Shangbin Feng · Zifeng Wang · Yike Wang · Sayna Ebrahimi · Hamid Palangi · Lesly Miculicich · Achin Kulshrestha · Nathalie Rauschmayr · Yejin Choi · Yulia Tsvetkov · Chen-Yu Lee · Tomas Pfister

We propose Model Swarms, a collaborative search algorithm to adapt LLMs via swarm intelligence, the collective behavior guiding individual systems. Specifically, Model Swarms starts with a pool of LLM experts and a utility function. Guided by the best-found checkpoints across models, diverse LLM experts collaboratively move in the weight space and optimize a utility function representing model adaptation objectives. Compared to existing model composition approaches, Model Swarms offers tuning-free model adaptation, works in low-data regimes with as few as 200 examples, and does not require assumptions about specific experts in the swarm or how they should be composed. Extensive experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests, improving over 12 model composition baselines by up to 21.0% across tasks and contexts. Further analysis reveals that LLM experts discover previously unseen capabilities in initial checkpoints and that Model Swarms enable the weak-to-strong transition of experts through the collaborative search process.

Poster

#E-2311

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Yafu Li · Xuyang Hu · Xiaoye Qu · Linjie Li · Yu Cheng

Large language models (LLMs) have presented impressive performance but often lack the flexibility to adapt to human preferences quickly without retraining. Inspired by the recent efforts on test-time scaling, we make the first attempt to propose Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, eliminating the need to update model parameters. Instead of relying on purely numerical rewards, TPO translates reward signals into \emph{textual} critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth of the inference process. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly.

Poster

#E-2312

GraphGPT: Generative Pre-trained Graph Eulerian Transformer

Qifang Zhao · Weidong Ren · Tianyu Li · Hong Liu · Xingsheng He · Xiaoxiao Xu

We introduce GraphGPT, a novel self-supervised generative pre-trained model for graph learning based on the Graph Eulerian Transformer (GET). First, we propose GET, which combines a standard transformer encoder or decoder architecture with an innovative graph-to-sequence transformation method. This method converts graphs or sampled subgraphs into sequences of tokens representing nodes, edges, and attributes in a reversible manner using Eulerian paths. We pre-train GET using either of the two self-supervised tasks: next-token prediction (NTP) and scheduled masked-token prediction (SMTP). The pre-trained model is then fine-tuned for downstream tasks such as graph-, edge-, and node-level prediction. Despite its simplicity, GraphGPT achieves performance comparable to or surpassing state-of-the-art methods on multiple large-scale Open Graph Benchmark (OGB) datasets. It demonstrates exceptional results on the molecular property prediction dataset PCQM4Mv2 and the protein-protein interaction dataset ogbl-ppa. Notably, generative pre-training enables scaling GraphGPT to 2 billion parameters while maintaining performance gains — a breakthrough that overcomes the scalability limitations of traditional Graph Neural Networks (GNNs) and prior graph transformers (GTs). To advance research in graph foundation models and facilitate scientific discovery in chemistry, materials science, and related fields, we have released thesource code (https://github.com/alibaba/graph-gpt) and model checkpoints (https://www.modelscope.cn/organization/Alibaba-DT).

Poster

#E-2400

SpeCache: Speculative Key-Value Caching for Efficient Generation of LLMs

Shibo Jie · Yehui Tang · Kai Han · Zhi-Hong Deng · Jing Han

Transformer-based large language models (LLMs) have already achieved remarkable results on long-text tasks, but the limited GPU memory (VRAM) resources struggle to accommodate the linearly growing demand for key-value (KV) cache as the sequence length increases, which has become a bottleneck for the application of LLMs on long sequences. Existing KV cache compression methods include eviction, merging, or quantization of the KV cache to reduce its size. However, compression results in irreversible information forgetting, potentially affecting the accuracy of subsequent decoding. In this paper, we propose SpeCache, which takes full advantage of the large and easily expandable CPU memory to offload the complete KV cache, and dynamically fetches KV pairs back in each decoding step based on their importance measured by low-precision KV cache copy in VRAM. To avoid inference latency caused by CPU-GPU communication, SpeCache speculatively predicts the KV pairs that the next token might attend to, allowing us to prefetch them before the next decoding step which enables parallelization of prefetching and computation. Experiments on LongBench and Needle-in-a-Haystack benchmarks verify that SpeCache effectively reduces VRAM usage while avoiding information forgetting for long sequences without re-training, even with a 10x high KV cache compression ratio.

Poster

#E-2401

Learning to Plan & Reason for Evaluation with Thinking-LLM-as-a-Judge

Swarnadeep Saha · Xian Li · Marjan Ghazvininejad · JASON WESTON · Tianlu Wang

LLM-as-a-Judge models generate chain-of-thought (CoT) sequences intended to capture the step-by-step reasoning process that underlies the final evaluation of a response. However, due to the lack of human-annotated CoTs for evaluation, the required components and structure of effective reasoning traces remain understudied. Consequently, previous approaches often (1) constrain reasoning traces to hand-designed components, such as a list of criteria, reference answers, or verification questions and (2) structure them such that planning is intertwined with the reasoning for evaluation. In this work, we propose EvalPlanner, a preference optimization algorithm for Thinking-LLM-as-a-Judge that first generates an unconstrained evaluation plan, followed by its execution, and then the final judgment. In a self-training loop, EvalPlanner iteratively optimizes over synthetically constructed evaluation plans and executions, leading to better final verdicts. Our method achieves a new state-of-the-art performance for generative reward models on RewardBench and PPE, despite being trained on fewer amount of, and synthetically generated, preference pairs. Additional experiments on other benchmarks like RM-Bench, JudgeBench, and FollowBenchEval further highlight the utility of both planning and reasoning for building robust LLM-as-a-Judge reasoning models.

Poster

#E-2402

Demystifying Long Chain-of-Thought Reasoning

Edward Yeo · Yuxuan Tong · Xinyao Niu · Graham Neubig · Xiang Yue

Scaling inference compute has become a key driver of advanced reasoning in large language models (LLMs). A proven approach for scaling inference compute is to generate long chains-of-thought (CoTs), enabling models to engage in structured reasoning strategies such as backtracking and error correction. Reinforcement learning (RL) has emerged as a crucial method for developing these capabilities, yet the conditions under which long CoTs emerge remain unclear, and RL training requires careful design choices. In this study, we systematically investigate the underlying mechanics of long CoT reasoning—examining the factors that enable models to generate extended reasoning trajectories. Through extensive supervised fine-tuning (SFT) and RL experiments, we identify three key findings: 1) while SFT is not strictly necessary, it significantly simplifies training and improves efficiency; 2) reasoning capabilities tend to emerge with increased training compute but are not guaranteed, making reward shaping essential for stabilizing CoT length growth; and 3) scaling verifiable reward signals is critical for RL, and we find that leveraging noisy, web-extracted solutions with filtering mechanisms shows promising potential, particularly in out-of-distribution (OOD) reasoning tasks such as STEM problem-solving. These insights provide practical guidance for optimizing training strategies to enhance long CoT reasoning in LLMs.

Poster

#E-2403

Compositional Causal Reasoning Evaluation in Language Models

Jacqueline Maasch · Alihan Hüyük · Xinnuo Xu · Aditya Nori · Javier Gonzalez

Causal reasoning and compositional reasoning are two core aspirations in AI. Measuring the extent of these behaviors requires principled evaluation methods. We explore a unified perspective that considers both behaviors simultaneously, termed compositional causal reasoning (CCR): the ability to infer how causal measures compose and, equivalently, how causal quantities propagate through graphs. We instantiate a framework for the systematic evaluation of CCR for the average treatment effect and the probability of necessity and sufficiency. As proof of concept, we demonstrate CCR evaluation for language models in the Llama, Phi, and GPT families. On a math word problem, our framework revealed a range of taxonomically distinct error patterns. CCR errors increased with the complexity of causal paths for all models except o1.

Poster

#E-2404

Policy Guided Tree Search for Enhanced LLM Reasoning

Yang Li

Despite their remarkable capabilities, large language models often struggle with tasks requiring complex reasoning and planning. While existing approaches like Chain-of-Thought prompting and tree search techniques show promise, they are limited by their reliance on predefined heuristics and computationally expensive exploration strategies. We propose Policy-Guided Tree Search (PGTS), a framework that combines reinforcement learning with structured tree exploration to efficiently navigate reasoning paths. Our key innovation is a learned policy that dynamically decides between expanding, branching, backtracking, or terminating exploration, eliminating the need for manual heuristics or exhaustive search. Experiments across mathematical reasoning, logical deduction, and planning benchmarks demonstrate that PGTS achieves superior reasoning performance while significantly reducing computational costs compared to existing methods. These results establish PGTS as a scalable and effective solution for tackling complex reasoning tasks with LLMs.

Poster

#E-2405

Do NOT Think That Much for 2+3=? On the Overthinking of Long Reasoning Models

Xingyu Chen · Jiahao Xu · Tian Liang · Zhiwei He · Jianhui Pang · Dian Yu · Linfeng Song · Qiuzhi Liu · Mengfei Zhou · Zhuosheng Zhang · Rui Wang · Zhaopeng Tu · Haitao Mi · Dong Yu

The remarkable performance of long reasoning models can be attributed to their ability to emulate human-like long-time thinking during inference. These models employ extended chain-of-thought (CoT) processes, exploring multiple strategies to enhance problem-solving capabilities. However, a critical question remains: How to intelligently and efficiently scale computational resources during testing. This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where long reasoning models generate redundant solutions that contribute minimally to accuracy and diversity, thereby wasting computational resources on simple problems with minimal benefit. We introduce novel efficiency metrics from both outcome and process perspectives to evaluate the rational use of computational resources by long reasoning models. Using a self-training paradigm, we propose strategies to mitigate overthinking, simplifying reasoning processes without compromising accuracy. Experimental results show that our approach successfully reduces computational overhead while preserving model performance across a range of testsets with varying difficulty levels, such as GSM8K, MATH500, GPQA, and AIME. Our code is open-source and available at https://github.com/galaxyChen/overthinking.

Poster

#E-2406

CoMemo: LVLMs Need Image Context with Image Memory

Shi Liu · Weijie Su · Xizhou Zhu · Wenhai Wang · Jifeng Dai

Recent advancements in Large Vision-Language Models built upon Large Language Models have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures.Project page is available at https://lalbj.github.io/projects/CoMemo/.

Poster

#E-2407

Mixture of Hidden-Dimensions: Not All Hidden-States’ Dimensions are Needed in Transformer

Yilong Chen · Junyuan Shang · Zhenyu Zhang · Jiawei Sheng · Tingwen Liu · Shuohuan Wang · Yu Sun · Hua Wu · Haifeng Wang

Transformer models encounter inefficiency when scaling hidden dimensions due to the uniform expansion of parameters. When delving into the sparsity of hidden dimensions, we observe that only a small subset of dimensions are highly activated, where some dimensions are commonly activated across tokens, and some others uniquely activated for individual tokens. To leverage this, we propose MoHD (Mixture of Hidden Dimensions), a sparse architecture that combines shared sub-dimensions for common features and dynamically routes specialized sub-dimensions per token. To address the potential information loss from sparsity, we introduce activation scaling and group fusion mechanisms. MoHD efficiently expands hidden dimensions with minimal computational increases, outperforming vanilla Transformers in both parameter efficiency and task performance across 10 NLP tasks. MoHD achieves 1.7% higher performance with 50% fewer activatied parameters and 3.7% higher performance with 3× total parameters expansion at constant activated parameters cost. MoHD offers a new perspective for scaling the model, showcasing the potential of hidden dimension sparsity.

Spotlight Poster

#E-2408

Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety

Zihan Guan · Mengxuan Hu · Ronghang Zhu · Sheng Li · Anil Vullikanti

Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at https://github.com/GuanZihan/Benign-Samples-Matter.

Poster

#E-2409

AlphaDPO: Adaptive Reward Margin for Direct Preference Optimization

Junkang Wu · xue wang · Zhengyi Yang · Jiancan Wu · Jinyang Gao · Bolin Ding · Xiang Wang · Xiangnan He

Aligning large language models (LLMs) with human preferences requires balancing policy optimization with computational stability. While recent offline methods like DPO and SimPO bypass reinforcement learning’s complexity, they face critical limitations: DPO relies on static reference models that degrade with policy updates, and SimPO assumes a uniform target reward margin that ignores instance-wise preference strength. We propose AlphaDPO, an adaptive preference optimization framework that dynamically reparameterizes the reference distribution to address these issues. Our key innovation lies in an implicit reference model (\hat{\pi}{\text{ref}} \propto U(y|x)(\pi\theta/\pi_{\text{ref}})^\alpha), which interpolates between policy-driven specialization and uniform exploration while enabling instance-adaptive reward margins. Theoretically, we prove AlphaDPO implicitly controls sequential KL divergence between iterative policy updates, ensuring stability even with poorly calibrated reference models. Empirically, AlphaDPO achieves state-of-the-art performance on AlpacaEval 2 (58.7\% LC win rate) and Arena-Hard (35.7\% win rate) across Mistral2-7B, Llama3-8B, and Gemma2-9B, demonstrating robust alignment without multi-stage training. Our work establishes adaptive reference reparameterization as a principled mechanism for preference optimization.

Poster

#E-2410

GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs

Yue Wang · Qizhou Wang · Feng Liu · Wei Huang · Yali Du · Xiaojiang Du · Bo Han

Large language model (LLM) unlearning has demonstrated its essential role in removing privacy and copyright-related responses, crucial for their legal and safe applications. However, the pursuit of complete unlearning often comes with substantial costs due to its compromises in their general functionality, leading to a notorious trade-off between unlearning and retention. It motivates this paper to explore enhanced unlearning schemes that can mitigate this trade-off. Specifically, we propose Gradient Rectified Unlearning (GRU), an improved framework that regulates the directions of gradient updates during the unlearning procedure such that their side impacts on other, unrelated responses can be minimized. GRU is easy and general to implement, demonstrating practical effectiveness across a variety of well-established unlearning benchmarks.

Poster

#E-2411

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang · Hanyang(Jeremy) Chen · Junyu Zhang · Mark Zhao · Cheng Qian · Kangrui Wang · Qineng Wang · Teja Koripella · Marziyeh Movahedi · Manling Li · Heng Ji · Huan Zhang · Tong Zhang

Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents.EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning.Through extensive experiments, we evaluated 24 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only $28.9\\%$ on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code and dataset are available at [https://embodiedbench.github.io](https://embodiedbench.github.io).

Poster

#E-2412

Hypo3D: Exploring Hypothetical Reasoning in 3D

Ye Mao · Weixun Luo · Junpeng Jing · Anlan Qiu · Krystian Mikolajczyk

The rise of vision-language foundation models marks an advancement in bridging the gap between human and machine capabilities in 3D scene reasoning. Existing 3D reasoning benchmarks assume real-time scene accessibility, which is impractical due to the high cost of frequent scene updates. To this end, we introduce Hypothetical 3D Reasoning, namely Hypo3D, a benchmark designed to evaluate models' ability to reason without access to real-time scene data. Models need to imagine the scene state based on a provided change description before reasoning. Hypo3D is formulated as a 3D Visual Question Answering (VQA) benchmark, comprising 7,727 context changes across 700 indoor scenes, resulting in 14,885 question-answer pairs. An anchor-based world frame is established for all scenes, ensuring consistent reference to a global frame for directional terms in context changes and QAs. Extensive experiments show that state-of-the-art foundation models struggle to reason effectively in hypothetically changed scenes. This reveals a substantial performance gap compared to humans, particularly in scenarios involving movement changes and directional reasoning. Even when the change is irrelevant to the question, models often incorrectly adjust their answers. The code and dataset are publicly available at: https://matchlab-imperial.github.io/Hypo3D.

Poster

#E-2500

Robust Multimodal Large Language Models Against Modality Conflict

Zongmeng Zhang · Wengang Zhou · Jie Zhao · Houqiang Li

Despite the impressive capabilities of multimodal large language models (MLLMs) in vision-language tasks, they are prone to hallucinations in real-world scenarios. This paper investigates the hallucination phenomenon in MLLMs from the perspective of modality conflict. Unlike existing works focusing on the conflicts between model responses and inputs, we study the inherent conflicts in inputs from different modalities that place MLLMs in a dilemma and directly lead to hallucinations. We formally define the modality conflict and construct a dataset named Multimodal Modality Conflict (MMMC) to simulate this phenomenon in vision-language tasks. Three methods based on prompt engineering, supervised fine-tuning, and reinforcement learning are proposed to alleviate the hallucination caused by modality conflict. Extensive experiments are conducted on the MMMC dataset to analyze the merits and demerits of these methods. Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine-tuning method shows promising and stable performance. Our work sheds light on the unnoticed modality conflict that leads to hallucinations and provides more insights into the robustness of MLLMs.

Poster

#E-2501

DocKS-RAG: Optimizing Document-Level Relation Extraction through LLM-Enhanced Hybrid Prompt Tuning

Xiaolong Xu · Yibo Zhou · Haolong Xiang · Xiaoyong Li · Xuyun Zhang · Lianyong Qi · Wanchun Dou

Document-level relation extraction (RE) aims to extract comprehensive correlations between entities and relations from documents. Most of existing works conduct transfer learning on pre-trained language models (PLMs), which allows for richer contextual representation to improve the performance. However, such PLMs-based methods suffer from incorporating structural knowledge, such as entity-entity interactions. Moreover, current works struggle to infer the implicit relations between entities across different sentences, which results in poor prediction. To deal with the above issues, we propose a novel and effective framework, named DocKS-RAG, which introduces extra structural knowledge and semantic information to further enhance the performance of document-level RE. Specifically, we construct a Document-level Knowledge Graph from the observable documentation data to better capture the structural information between entities and relations. Then, a Sentence-level Semantic Retrieval-Augmented Generation mechanism is designed to consider the similarity in different sentences by retrieving the relevant contextual semantic information. Furthermore, we present a hybrid-prompt tuning method on large language models (LLMs) for specific document-level RE tasks. Finally, extensive experiments conducted on two benchmark datasets demonstrate that our proposed framework enhances all the metrics compared with state-of-the-art methods.

Poster

#E-2502

Teaching Language Models to Critique via Reinforcement Learning

Zhihui Xie · Jie chen · Liyu Chen · Weichao Mao · Jingjing Xu · Lingpeng Kong

Teaching large language models (LLMs) to critique and refine their outputs is crucial for building systems that can iteratively improve, yet it is fundamentally limited by the ability to provide *accurate judgments* and *actionable suggestions*. In this work, we study LLM critics for code generation and propose $\texttt{CTRL}$, a framework for $\texttt{C}$ritic $\texttt{T}$raining via $\texttt{R}$einforcement $\texttt{L}$earning, which trains a critic model to generate feedback that maximizes correction performance for a fixed generator model without human supervision. Our results demonstrate that critics trained with $\texttt{CTRL}$ significantly enhance pass rates and mitigate compounding errors across both base and stronger generator models.Furthermore, we show that these critic models act as accurate generative reward models and enable test-time scaling through iterative critique-revision, achieving up to 106.1\% relative improvements across challenging code generation benchmarks.

Spotlight Poster

#E-2503

Aligning with Logic: Measuring, Evaluating and Improving Logical Preference Consistency in Large Language Models

Yinhong Liu · Zhijiang Guo · Tianya Liang · Ehsan Shareghi · Ivan Vulić · Nigel Collier

Large Language Models (LLMs) are expected to be predictable and trustworthy to support reliable decision-making systems. Yet current LLMs often show inconsistencies in their judgments. In this work, we examine \textit{logical preference consistency} as a foundational requirement for building more dependable LLM systems, ensuring stable and coherent decision-making while minimizing erratic or contradictory outputs.To quantify the logical preference consistency, we propose a universal evaluation framework based on three fundamental properties: transitivity, commutativity and negation invariance.Through extensive experimentation across diverse LLMs, we demonstrate that these properties serve as strong indicators of judgment robustness.Furthermore, we introduce a data refinement and augmentation technique, REPAIR, that enhances logical consistency while maintaining alignment with human preferences. Finally, we show that improving consistency leads to better performance in LLM-driven logic-based algorithms, reinforcing stability and coherence in decision-making systems.

Poster

#E-2504

Gravity-Bench-v1: A Benchmark on Gravitational Physics Discovery for Agents

Nolan Koblischke · Hyunseok Jang · Kristen Menou · Mohamad Ali-Dib

Modern science emerged from reasoning over repeatedly-observed planetary motions. We present Gravity-Bench-v1, an environment-based benchmark that challenges AI agents on tasks that parallel this historical development. Gravity-Bench-v1 evaluates agents on the discovery of physics concealed within a dynamic environment, using rigorous gravitational dynamics simulations. Gravity-Bench includes out-of-distribution cases, i.e. with physics that deviates from the real world, to evaluate true scientific generalization capabilities. Agents must plan to collect data within an experimental budget and must perform a dynamic form of data analysis and reasoning to solve tasks efficiently. Our benchmark admits an open-ended space of solutions. Reference solutions for each task are provided to calibrate AI performance against human expertise. Technically at an upper-undergraduate level, our benchmark proves challenging to baseline AI agents. Gravity-Bench-v1 and planned extensions should help map out AI progress towards scientific discovery capabilities.

Poster

#E-2505

Interpreting the Repeated Token Phenomenon in Large Language Models

Itay Yona · Ilia Shumailov · Jamie Hayes · Yossi Gandelsman

Large Language Models (LLMs), despite their impressive capabilities, often fail to accurately repeat a single word when prompted to, and instead output unrelated text. This unexplained failure mode represents a vulnerability, allowing even end users to diverge models away from their intended behavior. We aim to explain the causes for this phenomenon and link it to the concept of "attention sinks", an emergent LLM behavior crucial for fluency, in which the initial token receives disproportionately high attention scores. Our investigation identifies the neural circuit responsible for attention sinks and shows how long repetitions disrupt this circuit. We extend this finding to other nonrepeating sequences that exhibit similar circuit disruptions. To address this, we propose a targeted patch that effectively resolves the issue without negatively impacting the overall performance of the model. This study provides a mechanistic explanation for an LLM vulnerability, demonstrating how interpretability can diagnose and address issues, and offering insights that pave the way for more secure and reliable models.

Poster

#E-2506

Non-Asymptotic Length Generalization

Thomas Chen · Tengyu Ma · Zhiyuan Li

Length generalization is the ability of a learning algorithm to learn a hypothesis which generalizes to longer inputs than the inputs in the training set. In this paper, we provide provable guarantees of length generalization for various classes of functions in an idealized setting. First, we formalize the framework of non-asymptotic length generalization, which requires a computable upper bound for the minimum input length that guarantees length generalization, as a function of the complexity of ground-truth function under some given complexity measure. We refer to this minimum input length to length generalize as length complexity. We show the Minimum-Complexity Interpolator learning algorithm achieves optimal length complexity. We further show that whether a function class admits non-asymptotic length generalization is equivalent to the decidability of its language equivalence problem, which implies that there is no computable upper bound for the length complexity of Context-Free Grammars. On the positive side, we show that the length complexity of Deterministic Finite Automata is $2n - 2$ where $n$ is the number of states of the ground-truth automaton. Our main results are upper bounds of length complexity for a subset of a transformer-related function class called C-RASP (Yang & Chiang, 2024). We show that the length complexity of 1-layer C-RASP functions is $O(T^2)$ when the ground-truth function has precision $T$, and that the length complexity of 2-layer C-RASP functions is $O(T^{O(K)})$ when the ground-truth function has precision $T$ and $K$ heads.

Poster

#E-2507

What Do Learning Dynamics Reveal About Generalization in LLM Mathematical Reasoning?

Katie Kang · Amrith Setlur · Dibya Ghosh · Jacob Steinhardt · Claire Tomlin · Sergey Levine · Aviral Kumar

Modern large language models (LLMs) excel at fitting finetuning data, but often struggle on unseen examples. In order to teach models genuine reasoning abilities rather than superficial pattern matching, our work aims to better understand how the learning dynamics of LLM finetuning shapes downstream generalization. Our analysis focuses on reasoning tasks, whose problem structure allows us to distinguish between memorization (the exact replication of reasoning steps from the training data) and performance (the correctness of the final solution). We find that a model's performance on test prompts can be effectively characterized by a training metric we call pre-memorization train accuracy: the accuracy of model samples on training queries before they begin to copy the exact reasoning steps from the training set. On the dataset level, this metric is able to almost perfectly predict test accuracy, achieving $R^2$ of $\geq 0.9$ across various models (Llama3 8B, Gemma2 9B), datasets (GSM8k, MATH), and training configurations. On a per-example level, this metric is also indicative of whether individual model predictions are robust to perturbations in the training query. By connecting a model's learning dynamics to test performance, pre-memorization train accuracy can inform training decisions, such as the makeup of the training data. Our experiments on data curation show that prioritizing examples with low pre-memorization accuracy leads to 1.5-2x improvements in data efficiency compared to i.i.d. data scaling and other data scaling techniques.

Poster

#E-2508

Overtrained Language Models Are Harder to Fine-Tune

Jacob Mitchell Springer · Sachin Goyal · Kaiyue Wen · Tanishq Kumar · Xiang Yue · Sadhika Malladi · Graham Neubig · Aditi Raghunathan

Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon \textbf{catastrophic overtraining}. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2\% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Poster

#E-2509

Perception in Reflection

Yana Wei · Liang Zhao · Kangheng Lin · En Yu · Yuang Peng · Runpei Dong · Jianjian Sun · Haoran Wei · Zheng Ge · Xiangyu Zhang · Vishal Patel

We present a perception in reflection paradigm designed to transcend the limitations of current large vision-language models (LVLMs), which are expected yet often fail to achieve perfect perception initially. Specifically, we propose Reflective Perception (RePer), a dual-model reflection mechanism that systematically alternates between policy and critic models, enables iterative refinement of visual perception. This framework is powered by Reflective Perceptual Learning (RPL), which reinforces intrinsic reflective capabilities through a methodically constructed visual reflection dataset and reflective unlikelihood training Comprehensive experimental evaluation demonstrates RePer's quantifiable improvements in image understanding, captioning precision, and hallucination reduction. Notably, RePer achieves strong alignment between model attention patterns and human visual focus, while RPL optimizes fine-grained and free-form preference alignment. These advancements establish perception in reflection as a robust paradigm for future multimodal agents, particularly in tasks requiring complex reasoning and multi-step manipulation. Project Page: https://weiyana.github.io/Perception-in-Reflection

Poster

#E-2510

CROW: Eliminating Backdoors from Large Language Models via Internal Consistency Regularization

Nay Myat Min · Long H. Pham · Yige Li · Jun Sun

Large Language Models (LLMs) are vulnerable to backdoor attacks that manipulate outputs via hidden triggers. Existing defense methods—designed for vision/text classification tasks—fail for text generation. We propose Internal Consistency Regularization (CROW), a defense leveraging the observation that backdoored models exhibit unstable layer-wise hidden representations when triggered, while clean models show smooth transitions. CROW enforces consistency across layers via adversarial perturbations and regularization during finetuning, neutralizing backdoors without requiring clean reference models or trigger knowledge—only a small clean dataset. Experiments across Llama-2 (7B, 13B), CodeLlama (7B, 13B), and Mistral-7B demonstrate CROW’s effectiveness: it achieves significant reductions in attack success rates across diverse backdoor strategies (sentiment steering, targeted refusal, code injection) while preserving generative performance. CROW’s architecture-agnostic design enables practical deployment.

Poster

#E-2511

Resolving Lexical Bias in Model Editing

Hammad Rizwan · Domenic Rosati · Ga Wu · Hassan Sajjad

Model editing aims to modify the outputs of large language models after they are trained. Previous approaches have often involved direct alterations to model weights, which can result in model degradation. Recent techniques avoid making modifications to the model's weights by using an adapter that applies edits to the model when triggered by semantic similarity in the representation space. We demonstrate that current adapter methods are critically vulnerable to strong lexical biases, leading to issues such as applying edits to irrelevant prompts with overlapping words. This paper presents a principled approach to learning a disentangled representation space that facilitates precise localization of edits by maintaining distance between irrelevant prompts while preserving proximity among paraphrases. In our empirical study, we show that our method (Projector Editor Networks for Model Editing - PENME) achieves state-of-the-art model editing results while being more computationally efficient during inference than previous methods and adaptable across different architectures.

Poster

#E-2512

Programming Every Example: Lifting Pre-training Data Quality Like Experts at Scale

Fan Zhou · Zengzhi Wang · Qian Liu · Junlong Li · Pengfei Liu

Large language model pre-training has traditionally relied on human experts to craft heuristics for improving the corpora quality, resulting in numerous rules developed to date. However, these fixed rules lack the flexibility to address the unique characteristics of individual examples, yet crafting sample-wise rules is impractical for human experts. In this paper, we show that even small language models, with only 0.3B parameters, can exhibit substantial data refining capabilities. We propose Programming Every Example (ProX), a novel framework that treats data refinement as a programming task, and enables the model to refine corpora by generating and executing fine-grained operations, such as string normalization, for each individual example at scale. Experiments show that models trained on ProX-refined data consistently outperform other baselines across 10 benchmarks, demonstrating effectiveness across model sizes (up to 1.7B) and pre-training corpora (C4, RedPajama-V2, FineWeb, FineWeb-Edu, and DCLM).ProX also shows great potential in continual pre-training: on math domain, ProX boosts 7B models by up to 20% within 10B tokens—results typically achieved with much larger scale training (e.g., 200B tokens).We believe ProX offers a way to curate high-quality pre-training data, and finally contributes to efficient LLM development.

Poster

#E-2600

Retraining-free Merging of Sparse MoE via Hierarchical Clustering

I-Chun Chen · Hsu-Shen Liu · Wei-Fang Sun · Chen-Hao Chao · Yen-Chang Hsu · Chun-Yi Lee

Sparse Mixture-of-Experts (SMoE) models represent a significant advancement in large language model (LLM) development through their efficient parameter utilization. These models achieve substantial performance improvements at reducedinference costs. However, the deployment of SMoE models faces constraints from extensive memory requirements of expert components in resource-limited environments. To address these limitations, this paper introduces Hierarchical Clustering for Sparsely activated Mixture of Experts (HC-SMoE), a task-agnostic expert merging framework for parameter reduction without retraining. HC-SMoE introduces a novel hierarchical clustering approach based on expert outputs to ensure merging robustness independent of routing decisions. The proposed output-based clustering method enables effective capture of functional relationships between experts for large-scale architectures. We provide theoretical analysis and comprehensive evaluations across multiple zero-shot language tasks to demonstrate HC-SMoE’s effectiveness in state-of-the-art models including Qwen and Mixtral. The experimental results validate HC-SMoE’s superior performance and practical applicability for real-world deployments. Our implementation is available at https://github.com/wazenmai/HC-SMoE.

Poster

#E-2601

C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation

Guoxin Chen · Minpeng Liao · Peiying Yu · Dingmin Wang · Zile Qiao · Chao Yang · Xin Zhao · Kai Fan

Retrieval-augmented generation (RAG) systems face a fundamental challenge in aligning independently developed retrievers and large language models (LLMs). Existing approaches typically involve modifying either component or introducing simple intermediate modules, resulting in practical limitations and sub-optimal performance. Inspired by human search behavior—typically involving a back-and-forth process of proposing search queries and reviewing documents, we propose C-3PO, a proxy-centric framework that facilitates communication between retrievers and LLMs through a lightweight multi-agent system. Our framework implements three specialized agents that collaboratively optimize the entire RAG pipeline without altering the retriever and LLMs. These agents work together to assess the need for retrieval, generate effective queries, and select information suitable for the LLMs. To enable effective multi-agent coordination, we develop a tree-structured rollout approach for reward credit assignment in reinforcement learning. Extensive experiments in both in-domain and out-of-distribution scenarios demonstrate that C-3PO significantly enhances RAG performance while maintaining plug-and-play flexibility and superior generalization capabilities.

Spotlight Poster

#E-2602

CodeIO: Condensing Reasoning Patterns via Code Input-Output Prediction

Junlong Li · Daya Guo · Dejian Yang · Runxin Xu · Yu Wu · Junxian He

Reasoning is a fundamental capability of Large Language Models. While prior research predominantly focuses on enhancing narrow skills like math or code generation, improving performance on many other reasoning tasks remains challenging due to sparse and fragmented training data. To address this issue, we propose CodeI/O, a novel approach that systematically condenses diverse reasoning patterns inherently embedded in contextually-grounded codes, through transforming the original code into a code input-output prediction format. By training models to predict inputs/outputs given code and test cases entirely in natural language as Chain-of-Thought (CoT) rationales, we expose them to universal reasoning primitives—like logic flow planning, state-space searching, decision tree traversal, and modular decomposition—while decoupling structured reasoning from code-specific syntax and preserving procedural rigor. Experimental results demonstrate CodeI/O leads to consistent improvements across symbolic, scientific, logic, math & numerical, and commonsense reasoning tasks. By matching the existing ground-truth outputs or re-executing the code with predicted inputs, we can verify each prediction and further enhance the CoTs through multi-turn revision, resulting in CodeI/O++ and achieving higher performance. Our data and models will be publicly available.

Poster

#E-2603

QLASS: Boosting Language Agent Inference via Q-Guided Stepwise Search

Zongyu Lin · Yao Tang · Xingcheng Yao · Da Yin · ziniu hu · Yizhou Sun · Kai-Wei Chang

Language agents have become a promising solution to complex interactive tasks. One of the key ingredients to the success of language agents is the reward model on the trajectory of the agentic workflow, which provides valuable guidance during training or inference. However, due to the lack of annotations of intermediate interactions, most existing works use an outcome reward model to optimize policies across entire trajectories. This may lead to sub-optimal policies and hinder the overall performance. To address this, we propose QLASS (Q-guided Language Agent Stepwise Search), to automatically generate annotations by estimating Q-values in a stepwise manner for open language agents. By introducing a reasoning tree and performing process reward modeling, QLASS provides effective intermediate guidance for each step. With the stepwise guidance, we propose a Q-guided generation strategy to enable language agents to better adapt to long-term value, resulting in significant performance improvement during model inference on complex interactive agent tasks. Notably, even with almost half the annotated data, QLASS retains strong performance, demonstrating its efficiency in handling limited supervision. We also empirically demonstrate that QLASS can lead to more effective decision making through qualitative analysis.

Poster

#E-2604

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Yongchao Chen · Yilun Hao · Yueying Liu · Yang Zhang · Chuchu Fan

Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.

Poster

#E-2605

Modularized Self-Reflected Video Reasoner for Multimodal LLM with Application to Video Question Answering

Zihan Song · Xin Wang · Zi Qian · Hong Chen · Longtao Huang · Hui Xue' · Wenwu Zhu

Multimodal Large Language Models (Multimodal LLMs) have shown their strength in Video Question Answering (VideoQA). However, due to the black-box nature of end-to-end training strategies, existing approaches based on Multimodal LLMs suffer from the lack of interpretability for VideoQA: they can neither present reasoning paths nor indicate where the answers are derived from the video. To address this issue, we propose MSR-ViR (Modularized Self-Reflected Video Reasoner), which for the first time integrates modular networks to Multimodal LLMs, capable of providing VideoQA with explicit reasoning paths for more interpretability. Specifically, a MoST-Grounding (Modularized Spatial-Temporal Grounding) network is proposed to decompose complex questions via tree-structured policies, localizing relevant temporal and spatial segments within videos through step-by-step reasoning. The proposed MoST-Grounding network provides explicit visually grounded information for Multimodal LLMs with clear reasoning paths, thus enhancing interpretability for the predicted answers. To further improve the reasoning quality, we design an Alternate Self-reflection Training Strategy to jointly optimize policy generation and Multimodal LLMs. Experiments on real-world datasets demonstrate the superiority of our proposed MSR-ViR framework in video understanding, reasoning transparency, and providing explicit localization evidence for answers.

Poster

#E-2606

Product of Experts with LLMs: Boosting Performance on ARC Is a Matter of Perspective

Daniel Franzen · Jan Disselhoff · David Hartmann

The Abstraction and Reasoning Corpus (ARC-AGI) poses a significant challenge for large language models (LLMs), exposing limitations in their abstract reasoning abilities. In this work, we leverage task-specific data augmentations throughout the training, generation, and scoring phases, and employ a depth-first search algorithm to generate diverse, high-probability candidate solutions. Furthermore, we utilize the LLM not only as a generator but also as a scorer, using its output probabilities to select the most promising solutions. Our method achieves a score of 71.6% (286.5/400 solved tasks) on the public ARC-AGI evaluation set, demonstrating state-of-the-art performance among publicly available approaches. While concurrent closed-source work has reported higher scores, our method distinguishes itself through its transparency, reproducibility, and remarkably low inference cost, averaging only around 2ct per task on readily available hardware.

Poster

#E-2607

Understanding Synthetic Context Extension via Retrieval Heads

Xinyu Zhao · Fangcong Yin · Greg Durrett

Long-context LLMs are increasingly in demand for applications such as retrieval-augmented generation. To defray the cost of pretraining LLMs over long contexts, recent work takes an approach of synthetic context extension: fine-tuning LLMs with synthetically-generated long-context data. However, it remains unclear how and why this synthetic context extension imparts abilities for downstream long-context tasks. In this paper, we investigate fine-tuning on synthetic data for three long-context tasks that require retrieval and reasoning. We vary the realism of "needle'' concepts to be retrieved and diversity of the surrounding "haystack'' context, from using LLMs to construct synthetic documents to using templated relations and creating symbolic datasets. Although models trained on synthetic data underperform models trained on the real data, the impacts of both training settings can be understood via a shared feature of the attention computation, retrieval heads (Wu et al., 2024). The retrieval heads learned from synthetic data have high overlap with retrieval heads learned on real data. Furthermore, there is a strong correlation between the recall of heads learned and the downstream performance of a model, allowing us to interpret and predict the performance of models trained in different settings. Our results shed light on how to interpret synthetic data fine-tuning performance and how to approach creating better data for learning real-world LLM capabilities over long contexts.

Poster

#E-2608

Streamline Without Sacrifice - Squeeze out Computation Redundancy in LMM

Penghao Wu · Lewei Lu · Ziwei Liu

Large multimodal models excel in multimodal tasks but face significant computational challenges due to excessive visual tokens. Unlike token reduction methods that focus on token-level redundancy, we identify and study the computation-level redundancy on vision tokens to ensure no information loss. Our key insight is that vision tokens from the pretrained vision encoder do not necessarily require all the heavy operations (e.g., self-attention, FFNs) in decoder-only LMMs and could be processed more lightly with proper designs. We designed a series of experiments to discover and progressively squeeze out the vision-related computation redundancy. Based on our findings, we propose ProxyV, a novel approach that utilizes proxy vision tokens to alleviate the computational burden on original vision tokens. ProxyV enhances efficiency without compromising performance and can even yield notable performance gains in scenarios with more moderate efficiency improvements. Furthermore, the flexibility of ProxyV is demonstrated through its combination with token reduction methods to boost efficiency further.

Poster

#E-2609

Tuning LLM Judge Design Decisions for 1/1000 of the Cost

David Salinas · Omar Swelam · Frank Hutter

Evaluating Large Language Models (LLMs) often requires costly human annotations. To address this, LLM-based judges have been proposed, which compare the outputs of two LLMs enabling the ranking of models without human intervention. While several approaches have been proposed, many confounding factors are present between different papers. For instance the model, the prompt and other hyperparameters are typically changed at the same time making apple-to-apple comparisons challenging.In this paper, we propose to systematically analyze and tune the hyperparameters of LLM judges. To alleviate the high cost of evaluating a judge, we propose to leverage multi-objective multi-fidelity which allows to find judges that trades accuracy for cost and also reduce significantly the cost of the search. Our method identifies judges that not only outperform existing benchmarks in accuracy and cost-efficiency but also utilize open-weight models, ensuring greater accessibility and reproducibility.

Poster

#E-2610

EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration

Allen Nie · Yi Su · Bo Chang · Jonathan Lee · Ed Chi · Quoc Le · Minmin Chen

Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs' (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs' performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM's exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.

Poster

#E-2611

Controlling Large Language Model with Latent Action

Chengxing Jia · Ziniu Li · Pengyuan Wang · Yi-Chen Li · Zhenyu Hou · Yuxiao Dong · Yang Yu

Adapting Large Language Models (LLMs) to downstream tasks using Reinforcement Learning (RL) has proven to be an effective approach. However, LLMs do not inherently define the structure of an agent for RL training, particularly in terms of specifying the action space. This paper studies learning a compact latent action space to enhance the controllability and exploration of RL for LLMs. Inspired by reinforcement learning from observations, we propose Controlling Large Language Models with Latent Actions CoLA, a framework that integrates a latent action space into pre-trained LLMs. CoLA employs an \emph{inverse dynamics model} to extract latent actions conditioned on future tokens, ensuring that the next token prediction is partially influenced by these actions. Simultaneously, CoLA fine-tunes the pre-trained LLM to function as a \emph{language world model}, capable of incorporating latent actions as inputs. Additionally, CoLA trains a \emph{policy model} to generate actions within this language world model. The policy model can be trained via behavior cloning to mimic a standard language model or through RL to maximize task-specific rewards. In this work, we apply CoLA to the Llama-3.1-8B model. Our experiments demonstrate that, compared to RL with token-level actions, CoLA's latent actions enable greater semantic diversity. For enhancing downstream tasks, we show that CoLA with RL achieves a score of 42.4 on the \emph{math500} benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, CoLA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM's capabilities, unlike the baseline. Finally, CoLA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs via RL. These results highlight CoLA's potential to advance RL-based adaptation of LLMs for downstream applications. The CoLA model is available at \url{https://huggingface.co/LAMDA-RL/Llama-3.1-CoLA-10B}.

Poster

#E-2612

Language Models as Implicit Tree Search

Ziliang Chen · Zhao-Rong Lai · Yufeng Yang · Liangda Fang · ZHANFU YANG · Liang Lin

Despite advancing language model (LM) alignment, direct preference optimization (DPO) falls short in LM reasoning with the free lunch from reinforcement learning (RL). As the breakthrough, this work proposes a new RL-free preference optimization method aiming to achieve DPO along with learning another LM, whose response generation policy holds the asymptotic equivalence with AlphaZero-like search, the apex of algorithms for complex reasoning missions like chess Go. While circumventing explicit value and reward modeling, the neural implicit tree search executed by the extra LM remains seeking to equip DPO with reasoning procedure technically akin to AlphaZero. Our experiments demonstrate that our methodology outperforms both regular DPO variants in human preference alignment, and MCTS-based LMs in mathematical reasoning and planning tasks.

Poster

#E-2700

Automated Hypothesis Validation with Agentic Sequential Falsifications

Kexin Huang · Ying Jin · Ryan Li · Michael Li · Emmanuel J Candes · Jure Leskovec

Hypotheses are central to information acquisition, decision-making, and discovery. However, many real-world hypotheses are abstract, high-level statements that are difficult to validate directly. This challenge is further intensified by the rise of hypothesis generation from Large Language Models (LLMs), which are prone to hallucination and produce hypotheses in volumes that make manual validation impractical. Here we propose POPPER, an agentic framework for rigorous automated validation of free-form hypotheses. Guided by Karl Popper's principle of falsification, POPPER validates a hypothesis using LLM agents that design and execute falsification experiments targeting its measurable implications. A novel sequential testing framework ensures strict Type-I error control while actively gathering evidence from diverse observations, whether drawn from existing data or newly conducted procedures. We demonstrate POPPER on six domains including biology, economics, and sociology. POPPER delivers robust error control, high power, and scalability. Furthermore, compared to human scientists, POPPER achieved comparable performance in validating complex biological hypotheses while reducing time by 10 folds, providing a scalable, rigorous solution for hypothesis validation. POPPER is freely available at https://github.com/snap-stanford/POPPER.

Poster

#E-2701

What Limits Bidirectional Model's Generative Capabilities? A Uni-Bi-Directional Mixture-of-Expert Method For Bidirectional Fine-tuning

Zuchao Li · Yonghua Hei · Qiwei Li · Lefei Zhang · Ping Wang · hai zhao · qi baoyuan · Liu Guoming

Large Language Models (LLMs) excel in generation tasks, yet their causal attention mechanisms limit performance in embedding tasks. While bidirectional modeling may enhance embeddings, naively fine-tuning unidirectional models bidirectionally severely degrades generative performance.To investigate this trade-off, we analyze attention weights as dependence indicators and find that bidirectional fine-tuning increases subsequent dependence, impairing unidirectional generation. Through systematic Transformer module evaluations, we discover the FFN layer is least affected by such dependence. Leveraging this discovery, we propose UBMoE-LLM, a novel Uni-Bi-directional Mixture-of-Experts LLM, which integrates the original unidirectional FFN with a bidirectionally fine-tuned FFN via unsupervised contrastive learning. This MoE-based approach enhances embedding performance while preserving robust generation.Extensive experiments across diverse datasets and model scales validate our attention dependence metric and demonstrate UBMoE-LLM’s superior generative quality and reduced hallucination. Code is available at: https://github.com/heiyonghua/ubmoe_llm.

Poster

#E-2702

Rethinking External Slow-Thinking: From Snowball Errors to Probability of Correct Reasoning

Zeyu Gan · Yun Liao · Yong Liu

Test-time scaling, which is also often referred to as slow-thinking, has been demonstrated to enhance multi-step reasoning in large language models (LLMs). However, despite its widespread utilization, the mechanisms underlying slow-thinking methods remain poorly understood. This paper explores the mechanisms of external slow-thinking from a theoretical standpoint. We begin by examining the snowball error effect within the LLM reasoning process and connect it to the likelihood of correct reasoning using information theory. Building on this, we show that external slow-thinking methods can be interpreted as strategies to mitigate the error probability. We further provide a comparative analysis of popular external slow-thinking approaches, ranging from simple to complex, highlighting their differences and interrelationships. Our findings suggest that the efficacy of these methods is not primarily determined by the specific framework employed, and that expanding the search scope or the model's internal reasoning capacity may yield more sustained improvements in the long term. We open-source our code at https://github.com/ZyGan1999/Snowball-Errors-and-Probability.

Poster

#E-2703

Accelerating Large Language Model Reasoning via Speculative Search

Zhihai Wang · Jie Wang · Jilai Pan · Xilin Xia · Huiling Zhen · Mingxuan Yuan · Jianye Hao · Feng Wu

Tree-search-based reasoning methods have significantly enhanced the reasoning capability of large language models (LLMs) by facilitating the exploration of multiple intermediate reasoning steps, i.e., thoughts. However, these methods suffer from substantial inference latency, as they have to generate numerous reasoning thoughts, severely limiting LLM applicability. To address this challenge, we propose a novel Speculative Search (SpecSearch) framework that significantly accelerates LLM reasoning by optimizing thought generation. Specifically, SpecSearch utilizes a small model to strategically collaborate with a large model at both thought and token levels, efficiently generating high-quality reasoning thoughts. The major pillar of SpecSearch is a novel quality-preserving rejection mechanism, which effectively filters out thoughts whose quality falls below that of the large model's outputs. Moreover, we show that SpecSearch preserves comparable reasoning quality to the large model. Experiments on both the Qwen and Llama models demonstrate that SpecSearch significantly outperforms state-of-the-art approaches, achieving up to 2.12$\times$ speedup with comparable reasoning quality.

Poster

#E-2704

GANQ: GPU-Adaptive Non-Uniform Quantization for Large Language Models

Pengxiang Zhao · Xiaoming Yuan

Large Language Models (LLMs) face significant deployment challenges due to their substantial resource requirements. While low-bit quantized weights can reduce memory usage and improve inference efficiency, current hardware lacks native support for mixed-precision General Matrix Multiplication (mpGEMM), resulting in inefficient dequantization-based implementations. Moreover, uniform quantization methods often fail to capture weight distributions adequately, leading to performance degradation. We propose GANQ (GPU-Adaptive Non-Uniform Quantization), a layer-wise post-training non-uniform quantization framework optimized for hardware-efficient lookup table-based mpGEMM. GANQ achieves superior quantization performance by utilizing a training-free, GPU-adaptive optimization algorithm to efficiently reduce layer-wise quantization errors. Extensive experiments demonstrate GANQ's ability to reduce the perplexity gap from the FP16 baseline compared to state-of-the-art methods for both 3-bit and 4-bit quantization. Furthermore, when deployed on a single NVIDIA RTX 4090 GPU, GANQ's quantized models achieve up to 2.57$\times$ speedup over the baseline, advancing memory and inference efficiency in LLM deployment.

Poster

#E-2705

Diverging Preferences: When do Annotators Disagree and do Models Know?

Michael Zhang · Zhilin Wang · Jena Hwang · Yi Dong · Olivier Delalleau · Yejin Choi · Eunsol Choi · Xiang Ren · Valentina Pyatkin

We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning ten categories across four high-level classes and find that the majority of disagreements are due to factors such as task underspecification or response style. Our findings challenge a standard assumption in reward modeling methods that annotator disagreements can be attributed to simple noise. We then explore how these findings impact two areas of LLM development: reward modeling training and evaluation. In our experiments, we demonstrate how standard reward modeling (e.g., Bradley-Terry) and LLM-as-Judge evaluation methods fail to account for divergence between annotators. These findings highlight challenges in LLM evaluations, which are greatly influenced by divisive features like response style, and in developing pluralistically aligned LLMs. To address these issues, we develop methods for identifying diverging preferences to mitigate their influence in evaluations and during LLM training.

Poster

#E-2706

On Teacher Hacking in Language Model Distillation

Daniil Tiapkin · Daniele Calandriello · Johan Ferret · Sarah Perrin · Nino Vieillard · Alexandre Rame · Mathieu Blondel

Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model, leading to degraded performance on the true objective, in line with Goodhart's law.In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher.Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking.Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust LMs.

Poster

#E-2707

MoRAgent: Parameter Efficient Agent Tuning with Mixture-of-Roles

Jing Han · Binwei Yan · Tianyu Guo · Zheyuan Bai · Mengyu Zheng · Hanting Chen · Ying Nie

Despite recent advancements of fine-tuning large language models (LLMs) to facilitate agent tasks, parameter-efficient fine-tuning (PEFT) methodologies for agent remain largely unexplored. In this paper, we introduce three key strategies for PEFT in agent tasks: 1) Inspired by the increasingly dominant \textit{Reason+Action} paradigm, we first decompose the capabilities necessary for the agent tasks into three distinct roles: reasoner, executor, and summarizer. The reasoner is responsible for comprehending the user's query and determining the next role based on the execution trajectory. The executor is tasked with identifying the appropriate functions and parameters to invoke. The summarizer conveys the distilled information from conversations back to the user. 2) We then propose the Mixture-of-Roles (MoR) framework, which comprises three specialized Low-Rank Adaptation (LoRA) groups, each designated to fulfill a distinct role. By focusing on their respective specialized capabilities and engaging in collaborative interactions, these LoRAs collectively accomplish the agent task. 3) To effectively fine-tune the framework, we develop a multi-role data generation pipeline based on publicly available datasets, incorporating role-specific content completion and reliability verification.We conduct extensive experiments and thorough ablation studies on various LLMs and agent benchmarks, demonstrating the effectiveness of the proposed method. This project is publicly available at https://mor-agent.github.io

Poster

#E-2708

Reward-Augmented Data Enhances Direct Preference Alignment of LLMs

Shenao Zhang · Zhihan Liu · Boyi Liu · Yufeng Zhang · Yingxiang Yang · Yongfei Liu · Liyu Chen · Tao Sun · Zhaoran Wang

Preference alignment in Large Language Models (LLMs) has significantly improved their ability to adhere to human instructions and intentions. However, existing direct alignment algorithms primarily focus on relative preferences and often overlook the qualitative aspects of responses, despite having access to preference data that includes reward scores from judge models during AI feedback. Striving to maximize the implicit reward gap between the chosen and the slightly inferior rejected responses can cause overfitting and unnecessary unlearning of the high-quality rejected responses. The unawareness of the reward scores also drives the LLM to indiscriminately favor the low-quality chosen responses and fail to generalize to optimal responses that are sparse in data. To overcome these shortcomings, our study introduces reward-conditioned LLM policies that discern and learn from the entire spectrum of response quality within the dataset, helping extrapolate to more optimal regions. We propose an effective yet simple data relabeling method that conditions the preference pairs on quality scores to construct a reward-augmented dataset. The experiments across various benchmarks and diverse models demonstrate that our approach consistently boosts DPO by a considerable margin. Through comprehensive ablation studies, we demonstrate that our method not only maximizes the utility of preference data but also mitigates the issue of unlearning, demonstrating its broad effectiveness beyond mere data expansion. Our code is available at https://github.com/shenao-zhang/reward-augmented-preference.

Poster

#E-2709

T1: Advancing Language Model Reasoning through Reinforcement Learning and Inference Scaling

Zhenyu Hou · Xin Lv · Rui Lu · Jiajie Zhang · Yujiang Li · Zijun Yao · Juanzi Li · Jie Tang · Yuxiao Dong

Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning tasks. However, existing approaches mainly rely on imitation learning and struggle to achieve effective test-time scaling. While reinforcement learning (RL) holds promise for enabling self-exploration, recent attempts yield modest improvements in complex reasoning. In this paper, we present T1 to scale RL by encouraging exploration and understand inference scaling. We first initialize the LLM using synthesized chain-of-thought data that integrates trial-and-error and self-verification. To scale RL training, we promote increased sampling diversity through over-sampling. We demonstrate that T1 with open LLMs as its base exhibits inference scaling behavior and achieves superior performance on challenging math reasoning benchmarks. More importantly, we present a simple strategy to examine inference scaling, where increased inference budgets directly lead to T1’s better performance without any additional verification. The model weights and training data are publicly available at https://github.com/THUDM/T1.

Poster

#E-2710

LLMs Can Reason Faster Only If We Let Them

Bilgehan Sel · Lifu Huang · Naren Ramakrishnan · Ruoxi Jia · Ming Jin

Large language models (LLMs) are making inroads into classical AI problems such as automated planning, yet key shortcomings continue to hamper their integration. Chain-of-Thought (CoT) struggles in complex multi-step reasoning, and Tree-of-Thoughts requires multiple queries that increase computational overhead. Recently, Algorithm-of-Thoughts (AoT) have shown promise using in-context examples, at the cost of significantly longer solutions compared to CoT. Aimed at bridging the solution length gap between CoT and AoT, this paper introduces AoT-O3, which combines supervised finetuning on AoT-style plans with a reinforcement learning (RL) framework designed to reduce solution length. The RL component uses a reward model that favors concise, valid solutions while maintaining planning accuracy. Empirical evaluations indicate that AoT-O3 shortens solution length by up to 80\% compared to baseline AoT while maintaining or surpassing prior performance. These findings suggest a promising pathway for more efficient, scalable LLM-based planning.

Poster

#E-2711

SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning

Tianjian Li · Daniel Khashabi

Aligning language models with human preferences relies on pairwise preference datasets. While some studies suggest that on-policy data consistently outperforms off-policy data for preference learning, others indicate that the advantages of on-policy data are task-dependent, highlighting the need for a systematic exploration of their interplay.In this work, we show that on-policy and off-policy data offer complementary strengths: on-policy data is particularly effective for reasoning tasks like math and coding, while off-policy data performs better on subjective tasks such as creative writing and making personal recommendations. Guided by these findings, we introduce SimpleMix, an approach to combine the complementary strengths of on-policy and off-policy preference learning by simply mixing these two data sources. Our empirical results across diverse tasks and benchmarks demonstrate that SimpleMix substantially improves language model alignment. Specifically, SimpleMix improves upon on-policy DPO and off-policy DPO by an average of 6.03 on Alpaca Eval 2.0. Moreover, it surpasses prior approaches that are much more complex in combining on- and off-policy data, such as HyPO and DPO-Mix-P, by an average of 3.05. These findings validate the effectiveness and efficiency of SimpleMix for enhancing preference-based alignment.

Poster

#E-2712

RULEBREAKERS: Challenging LLMs at the Crossroads between Formal Logic and Human-like Reasoning

Jason Chan · Robert Gaizauskas · Zhixue Zhao

Formal logic enables computers to reason in natural language by representing sentences in symbolic forms and applying rules to derive conclusions. However, in what our study characterizes as "rulebreaker" scenarios, this method can lead to conclusions that are typically not inferred or accepted by humans given their common sense and factual knowledge. Inspired by works in cognitive science, we create RULEBREAKERS, the first dataset for rigorously evaluating the ability of large language models (LLMs) to recognize and respond to rulebreakers (versus non-rulebreakers) in a knowledge-informed and human-like manner. Evaluating seven LLMs, we find that most models achieve mediocre accuracy on RULEBREAKERS and exhibit some tendency to over-rigidly apply logical rules, unlike what is expected from typical human reasoners. Further analysis suggests that this apparent failure is potentially associated with the models' poor utilization of their world knowledge and their attention distribution patterns. Whilst revealing a limitation of current LLMs, our study also provides a timely counterbalance to a growing body of recent works that propose methods relying on formal logic to improve LLMs' general reasoning capabilities, highlighting their risk of further increasing divergence between LLMs and human-like reasoning.

Poster

#E-2800

Statistical Hypothesis Testing for Auditing Robustness in Language Models

Paulius Rauba · Qiyao Wei · Mihaela van der Schaar

Consider the problem of testing whether the outputs of a large language model (LLM) system change under an arbitrary intervention, such as an input perturbation or changing the model variant. We cannot simply compare two LLM outputs since they might differ due to the stochastic nature of the system, nor can we compare the entire output distribution due to computational intractability. While existing methods for analyzing text-based outputs exist, they focus on fundamentally different problems, such as measuring bias or fairness. To this end, we introduce distribution-based perturbation analysis, a framework that reformulates LLM perturbation analysis as a frequentist hypothesis testing problem. We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling, enabling tractable inference without restrictive distributional assumptions. The framework is (i) model-agnostic, (ii) supports the evaluation of arbitrary input perturbations on any black-box LLM, (iii) yields interpretable p-values; (iv) supports multiple perturbations via controlled error rates; and (v) provides scalar effect sizes. We demonstrate the usefulness of the framework across multiple case studies, showing how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models. Above all, we see this as a reliable frequentist hypothesis testing framework for LLM auditing.

Poster

#E-2801

Thickness-aware E(3)-Equivariant 3D Mesh Neural Networks

Sungwon Kim · Namkyeong Lee · Yunyoung Doh · Seungmin Shin · Guimok Cho · Seung-Won Jeon · Sangkook Kim · Chanyoung Park

Mesh-based 3D static analysis methods have recently emerged as efficient alternatives to traditional computational numerical solvers, significantly reducing computational costs and runtime for various physics-based analyses. However, these methods primarily focus on surface topology and geometry, often overlooking the inherent thickness of real-world 3D objects, which exhibits high correlations and similar behavior between opposing surfaces. This limitation arises from the disconnected nature of these surfaces and the absence of internal edge connections within the mesh. In this work, we propose a novel framework, the Thickness-aware E(3)-Equivariant 3D Mesh Neural Network (T-EMNN), that effectively integrates the thickness of 3D objects while maintaining the computational efficiency of surface meshes. Additionally, we introduce data-driven coordinates that encode spatial information while preserving E(3)-equivariance or invariance properties, ensuring consistent and robust analysis. Evaluations on a real-world industrial dataset demonstrate the superior performance of T-EMNN in accurately predicting node-level 3D deformations, effectively capturing thickness effects while maintaining computational efficiency.

Poster

#E-2802

Adaptive Message Passing: A General Framework to Mitigate Oversmoothing, Oversquashing, and Underreaching

Federico Errica · Henrik Christiansen · Viktor Zaverkin · Takashi Maruyama · Mathias Niepert · Francesco Alesiani

Long-range interactions are essential for the correct description of complex systems in many scientific fields. The price to pay for including them in the calculations, however, is a dramatic increase in the overall computational costs. Recently, deep graph networks have been employed as efficient, data-driven models for predicting properties of complex systems represented as graphs. These models rely on a message passing strategy that should, in principle, capture long-range information without explicitly modeling the corresponding interactions. In practice, most deep graph networks cannot really model long-range dependencies due to the intrinsic limitations of (synchronous) message passing, namely oversmoothing, oversquashing, and underreaching. This work proposes a general framework that \textit{learns to mitigate} these limitations: within a variational inference framework, we endow message passing architectures with the ability to adapt their depth and filter messages along the way. With theoretical and empirical arguments, we show that this strategy better captures long-range interactions, by competing with the state of the art on five node and graph prediction datasets.

Poster

#E-2803

EAGLES: Towards Effective, Efficient, and Economical Federated Graph Learning via Unified Sparsification

Zitong Shi · Guancheng Wan · Wenke Huang · Guibin Zhang · He Li · Carl Yang · Mang Ye

Federated Graph Learning (FGL) has gained significant attention as a privacy-preserving approach to collaborative learning, but the computational demands increase substantially as datasets grow and Graph Neural Network (GNN) layers deepen. To address these challenges, we propose $\textbf{EAGLES}$, a unified sparsification framework. EAGLES applies client-consensus parameter sparsification to generate multiple unbiased subnetworks at varying sparsity levels, reducing the need for iterative adjustments and mitigating performance degradation. In the graph structure domain, we introduced a dual-expert approach: a $\textit{graph sparsification expert}$ uses multi-criteria node-level sparsification, and a $\textit{graph synergy expert}$ integrates contextual node information to produce optimal sparse subgraphs. Furthermore, the framework introduces a novel distance metric that leverages node contextual information to measure structural similarity among clients, fostering effective knowledge sharing. We also introduce the $\textbf{Harmony Sparsification Principle}$, EAGLES balances model performance with lightweight graph and model structures. Extensive experiments demonstrate its superiority, achieving competitive performance on various datasets, such as reducing training FLOPS by 82\% $\downarrow$ and communication costs by 80\% $\downarrow$ on the ogbn-proteins dataset, while maintaining high performance.

Poster

#E-2804

GPEN: Global Position Encoding Network for Enhanced Subgraph Representation Learning

Nannan Wu · Yuming Huang · Yiming Zhao · Jie Chen · Wenjun Wang

Subgraph representation learning has attracted growing interest due to its wide applications in various domains. However, existing methods primarily focus on local neighborhood structures while overlooking the significant impact of global structural information, in particular the influence of multi-hop neighbors beyond immediate neighborhoods. This presents two key challenges: how to effectively capture the structural relationships between distant nodes, and how to prevent excessive aggregation of global structural information from weakening the discriminative ability of subgraph representations.To address these challenges, we propose GPEN (Global Position Encoding Network). GPEN leverages a hierarchical tree structure to encode each node's global position based on its path distance to the root node, enabling a systematic way to capture relationships between distant nodes. Furthermore, we introduce a boundary-aware convolution module that selectively integrates global structural information while maintaining the unique structural patterns of each subgraph. Extensive experiments on eight public datasets identify that GPEN significantly outperforms state-of-the-art methods in subgraph representation learning.

Poster

#E-2805

Open Your Eyes: Vision Enhances Message Passing Neural Networks in Link Prediction

Yanbin Wei · Xuehao Wang · Zhan Zhuang · Yang Chen · Shuhao Chen · Yulong Zhang · James Kwok · Yu Zhang

Message-passing graph neural networks (MPNNs) and structural features (SFs) are cornerstones for the link prediction task. However, as a common and intuitive mode of understanding, the potential of visual perception has been overlooked in the MPNN community. For the first time, we equip MPNNs with vision structural awareness by proposing an effective framework called Graph Vision Network (GVN), along with a more efficient variant (E-GVN). Extensive empirical results demonstrate that with the proposed frameworks, GVN consistently benefits from the vision enhancement across seven link prediction datasets, including challenging large-scale graphs. Such improvements are compatible with existing state-of-the-art (SOTA) methods and GVNs achieve new SOTA results, thereby underscoring a promising novel direction for link prediction.

Poster

#E-2806

Positional Encoding meets Persistent Homology on Graphs

Yogesh Verma · Amauri Souza · Vikas Garg

The local inductive bias of message-passing graph neural networks (GNNs) hampers their ability to exploit key structural information (e.g., connectivity and cycles). Positional encoding (PE) and Persistent Homology (PH) have emerged as two promising approaches to mitigate this issue. PE schemes endow GNNs with location-aware features, while PH methods enhance GNNs with multiresolution topological features. However, a rigorous theoretical characterization of the relative merits and shortcomings of PE and PH has remained elusive. We bridge this gap by establishing that neither paradigm is more expressive than the other, providing novel constructions where one approach fails but the other succeeds. Our insights inform the design of a novel learnable method, PiPE (Persistence-informed Positional Encoding), which is provably more expressive than both PH and PE. PiPE demonstrates strong performance across a variety of tasks (e.g., molecule property prediction, graph classification, and out-of-distribution generalization), thereby advancing the frontiers of graph representation learning. Code is available at https://github.com/Aalto-QuML/PIPE

Poster

#E-2807

PROXSPARSE: REGULARIZED LEARNING OF SEMI-STRUCTURED SPARSITY MASKS FOR PRETRAINED LLMS

Hongyi Liu · Rajarshi Saha · Zhen Jia · Youngsuk Park · Jiaji Huang · Shoham Sabach · Yu-Xiang Wang · George Karypis

Large Language Models (LLMs) have demonstrated exceptional performance in natural language processing tasks, yet their massive size makes serving them inefficient and costly. Semi-structured pruning has emerged as an effective method for model acceleration, but existing approaches are suboptimal because they focus on local, layer-wise optimizations using heuristic rules, failing to leverage global feedback. We present ProxSparse, a learning-based framework for mask selection enabled by regularized optimization. ProxSparse transforms the rigid, non-differentiable mask selection process into a smoother optimization procedure, allowing gradual mask exploration with flexibility. ProxSparse does not involve additional weight updates once the mask is determined. Our extensive evaluations on 7 widely used models show that ProxSparse consistently outperforms previously proposed semi-structured mask selection methods with significant improvement, demonstrating the effectiveness of our learned approach towards semi-structured pruning.

Poster

#E-2808

STP: Self-play LLM Theorem Provers with Iterative Conjecturing and Proving

Kefan Dong · Tengyu Ma

A fundamental challenge in formal theorem proving by LLMs is the lack of high-quality training data. Although reinforcement learning or expert iteration partially mitigates this issue by alternating between LLM generating proofs and finetuning them on correctly generated ones, performance quickly plateaus due to the scarcity of correct proofs (sparse rewards). To keep improving the models with limited data, we draw inspiration from mathematicians, who continuously develop new results, partly by proposing novel conjectures or exercises (which are often variants of known results) and attempting to solve them. We design the Self-play Theorem Prover (STP) that simultaneously takes on two roles, conjecturer and prover, each providing training signals to the other. The conjecturer is trained iteratively on previously generated conjectures that are barely provable by the current prover, which incentivizes it to generate increasingly challenging conjectures over time. The prover attempts to prove the conjectures with standard expert iteration. We evaluate STP with both Lean and Isabelle formal versifiers. With 51.3 billion tokens generated during the training in Lean, STP proves 28.5% of the statements in the LeanWorkbook dataset, doubling the previous best result of 13.1% achieved through expert iteration.The final model achieves state-of-the-art performance among whole-proof generation methods on miniF2F-test (65.0\%), ProofNet-test (23.9\%) and PutnamBench (8/644) with pass@3200.

Poster

#E-2809

Let LLM Tell What to Prune and How Much to Prune

Mingzhe Yang · Sihao Lin · Changlin Li · Xiaojun Chang

Large language models (LLMs) have revolutionized various AI applications. However, their billions of parameters pose significant challenges for practical deployment. Structured pruning is a hardware-friendly compression technique and receives widespread attention. Nonetheless, existing literature typically targets a single structure of LLMs. We observe that the structure units of LLMs differ in terms of inference cost and functionality. Therefore, pruning a single structure unit in isolation often results in an imbalance between performance and efficiency. In addition, previous works mainly employ a prescribed pruning ratio. Since the significance of LLM modules may vary, it is ideal to distribute the pruning load to a specific structure unit according to its role within LLMs. To address the two issues, we propose a pruning method that targets multiple LLM modules with dynamic pruning ratios. Specifically, we find the intrinsic properties of LLMs can guide us to determine the importance of each module and thus distribute the pruning load on demand, i.e., what to prune and how much to prune. This is achieved by quantifying the complex interactions within LLMs. Extensive experiments on multiple benchmarks and LLM variants demonstrate that our method effectively balances the trade-off between efficiency and performance.

Poster

#E-2810

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

Lutfi Erdogan · Hiroki Furuta · Sehoon Kim · Nicholas Lee · Suhong Moon · Gopala Anumanchipalli · Kurt Keutzer · Amir Gholaminejad

Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57.58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81.36% success rate on WebVoyager.

Poster

#E-2811

FedPHA: Federated Prompt Learning for Heterogeneous Client Adaptation

Chengying Fang · Wenke Huang · Guancheng Wan · Yihao Yang · Mang Ye

Federated Prompt Learning (FPL) adapts pre-trained Vision-Language Models (VLMs) to federated learning through prompt tuning, leveraging their transferable representations and strong generalization capabilities. Traditional methods often require uniform prompt lengths for federated aggregation, limiting adaptability to clients with diverse prompt lengths and distribution biases. In this paper, we propose Federated Prompt Learning for Heterogeneous Client Adaptation (FedPHA), a novel framework that combines a fixed-length global prompt for efficient aggregation with local prompts of varying lengths to capture client-specific data characteristics. Additionally, FedPHA designs Singular Value Decomposition (SVD) based projection and bidirectional alignment to disentangle global conflicts arising from client heterogeneity, ensuring that personalized client tasks effectively utilize non-harmful global knowledge. This approach ensures that global knowledge improves model generalization while local knowledge preserves local optimization. Experimental results validate the effectiveness of FedPHA in achieving a balance between global and personalized knowledge in federated learning scenarios.

Poster

#E-2812

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

Wenbo Pan · Zhichao Liu · Qiguang Chen · Xiangyang Zhou · Yu Haining · Xiaohua Jia

Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective.

Poster

#E-2900

From Theory to Practice: Rethinking Green and Martin Kernels for Unleashing Graph Transformers

Yoon Hyeok Lee · Jaemin Park · Taejin Paik · Doyun Kim · Bosun Hwang

Graph Transformers (GTs) have emerged as a powerful alternative to message-passing neural networks, yet their performance heavily depends on effectively embedding structural inductive biases. In this work, we introduce novel structural encodings (SEs) grounded in a rigorous analysis of random walks (RWs), leveraging Green and Martin kernels that we have carefully redefined for AI applications while preserving their mathematical essence.These kernels capture the long-term behavior of RWs on graphs and allow for enhanced representation of complex topologies, including non-aperiodic and directed acyclic substructures.Empirical evaluations across eight benchmark datasets demonstrate strong performance across diverse tasks, notably in molecular and circuit domains.We attribute this performance boost to the improved ability of our kernel-based SEs to encode intricate structural information, thereby strengthening the global attention and inductive bias within GTs.This work highlights the effectiveness of theoretically grounded kernel methods in advancing Transformer-based models for graph learning.

Spotlight Poster

#E-2901

AutoGFM: Automated Graph Foundation Model with Adaptive Architecture Customization

Haibo Chen · Xin Wang · Zeyang Zhang · Haoyang Li · Ling Feng · Wenwu Zhu

Graph foundation models (GFMs) aim to share graph knowledge across diverse domains and tasks to boost graph machine learning. However, existing GFMs rely on hand-designed and fixed graph neural network (GNN) architectures, failing to utilize optimal architectures w.r.t. specific domains and tasks, inevitably leading to suboptimal performance in diverse graph domains and tasks. In this paper, we explore graph neural architecture search (GNAS) for GFMs for the first time, which suffers from the problem of architecture inconsistency, i.e., the optimal architectures for different tasks and domains vary. We tackle this problem by discovering an invariant graph-architecture relationship across domains and tasks, which imposes three challenges: i) how to capture invariant and variant patterns; ii) how to customize architectures to adapt to diverse domains and tasks; iii) how to mitigate the data domination phenomenon during the architecture search process.To address these challenges, we propose Automated Graph Foundation Model with Adaptive Architecture Customization (AutoGFM), providing a theoretical analysis to demonstrate the limitations of existing GNAS. Specifically, we first propose a disentangled contrastive graph encoder to learn invariant and variant patterns. Then, we design an invariant-guided architecture customization strategy to customize architectures for data from diverse domains and tasks. Finally, we propose a curriculum architecture customization mechanism to mitigate the phenomenon of particular data dominating the search process. Extensive experiments demonstrate that AutoGFM outperforms baselines, achieving state-of-the-art performance.

Poster

#E-2902

Directed Graph Grammars for Sequence-based Learning

Michael Sun · Orion Foo · Gang Liu · Wojciech Matusik · Jie Chen

Directed acyclic graphs (DAGs) are a class of graphs commonly used in practice, with examples that include electronic circuits, Bayesian networks, and neural architectures. While many effective encoders exist for DAGs, it remains challenging to decode them in a principled manner, because the nodes of a DAG can have many different topological orders. In this work, we propose a grammar-based approach to constructing a principled, compact and equivalent sequential representation of a DAG. Specifically, we view a graph as derivations over an unambiguous grammar, where the DAG corresponds to a unique sequence of production rules. Equivalently, the procedure to construct such a description can be viewed as a lossless compression of the data. Such a representation has many uses, including building a generative model for graph generation, learning a latent space for property prediction, and leveraging the sequence representational continuity for Bayesian Optimization over structured data.

Poster

#E-2903

WILTing Trees: Interpreting the Distance Between MPNN Embeddings

Masahiro Negishi · Thomas Gärtner · Pascal Welke

We investigate the distance function learned by message passing neural networks (MPNNs) in specific tasks, aiming to capture the functional distance between prediction targets that MPNNs implicitly learn. This contrasts with previous work, which links MPNN distances on arbitrary tasks to structural distances on graphs that ignore task-specific information. To address this gap, we distill the distance between MPNN embeddings into an interpretable graph distance. Our method uses optimal transport on the Weisfeiler Leman Labeling Tree (WILT), where the edge weights reveal subgraphs that strongly influence the distance between embeddings. This approach generalizes two well-known graph kernels and can be computed in linear time. Through extensive experiments, we demonstrate that MPNNs define the relative position of embeddings by focusing on a small set of subgraphs that are known to be functionally important in the domain.

Poster

#E-2904

Energy-Based Preference Model Offers Better Offline Alignment than the Bradley-Terry Preference Model

Yuzhong Hong · Hanshan Zhang · Junwei Bao · Hongfei Jiang · yang song

Since the debut of DPO, it has been shown that aligning a target LLM with human preferences via the KL-constrained RLHF loss is mathematically equivalent to a special kind of reward modeling task. Concretely, the task requires: 1) using the target LLM to parameterize the reward model, and 2) tuning the reward model so that it has a 1:1 linear relationship with the true reward. However, we identify a significant issue: the DPO loss might have multiple minimizers, of which only one satisfies the required linearity condition. The problem arises from a well-known issue of the underlying Bradley-Terry preference model: it does not always have a unique maximum likelihood estimator (MLE). Consequently, the minimizer of the RLHF loss might be unattainable because it is merely one among many minimizers of the DPO loss. As a better alternative, we propose an energy-based preference model (EBM) that always has a unique MLE, inherently satisfying the linearity requirement. To showcase the practical utility of replacing BTM with our EBM in the context of offline alignment, we adapt a simple yet scalable objective function from the recent literature on fitting EBM and name it as Energy Preference Alignment (EPA). Empirically, we demonstrate that EPA consistently delivers better performance on open benchmarks compared to DPO, thereby validating the theoretical superiority of our EBM.

Poster

#E-2905

Expressive Power of Graph Neural Networks for (Mixed-Integer) Quadratic Programs

Ziang Chen · Xiaohan Chen · Jialin Liu · Xinshang Wang · Wotao Yin

Quadratic programming (QP) is the most widely applied category of problems in nonlinear programming. Many applications require real-time/fast solutions, though not necessarily with high precision. Existing methods either involve matrix decomposition or use the preconditioned conjugate gradient method. For relatively large instances, these methods cannot achieve the real-time requirement unless there is an effective preconditioner. Recently, graph neural networks (GNNs) opened new possibilities for QP. Some promising empirical studies of applying GNNs for QP tasks show that GNNs can capture key characteristics of an optimization instance and provide adaptive guidance accordingly to crucial configurations during the solving process, or directly provide an approximate solution. However, the theoretical understanding of GNNs in this context remains limited. Specifically, it is unclear what GNNs can and cannot achieve for QP tasks in theory. This work addresses this gap in the context of linearly constrained QP tasks. In the continuous setting, we prove that message-passing GNNs can universally represent fundamental properties of quadratic programs, including feasibility, optimal objective values, and optimal solutions. In the more challenging mixed-integer setting, while GNNs are not universal approximators, we identify a subclass of QP problems that GNNs can reliably represent.

Poster

#E-2906

Delay-DSGN: A Dynamic Spiking Graph Neural Network with Delay Mechanisms for Evolving Graph

Zhiqiang Wang · Jianghao Wen · Jianqing Liang

Dynamic graph representation learning using Spiking Neural Networks (SNNs) exploits the temporal spiking behavior of neurons, offering advantages in capturing the temporal evolution and sparsity of dynamic graphs. However, existing SNN-based methods often fail to effectively capture the impact of latency in information propagation on node representations. To address this, we propose Delay-DSGN, a dynamic spiking graph neural network incorporating a learnable delay mechanism. By leveraging synaptic plasticity, the model dynamically adjusts connection weights and propagation speeds, enhancing temporal correlations and enabling historical data to influence future representations. Specifically, we introduce a Gaussian delay kernel into the neighborhood aggregation process at each time step, adaptively delaying historical information to future time steps and mitigating information forgetting. Experiments on three large-scale dynamic graph datasets demonstrate that Delay-DSGN outperforms eight state-of-the-art methods, achieving the best results in node classification tasks. We also theoretically derive the constraint conditions between the Gaussian kernel's standard deviation and size, ensuring stable training and preventing gradient explosion and vanishing issues.

Poster

#E-2907

PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities

Daniel Zilberg · Ron Levie

We propose PieClam (Prior Inclusive Exclusive Cluster Affiliation Model): a graph autoencoder, where nodes are embedded into a code space by an algorithm that maximizes the log-likelihood of the decoded graph. PieClam is a community affiliation model that extends well-known methods like BigClam in two main manners. First, instead of the decoder being defined via pairwise interactions between the nodes in the code space, we also incorporate a learned prior on the distribution of nodes in the code space, turning our method into a graph generative model. Secondly, we generalize the notion of communities by allowing not only sets of nodes with strong connectivity, which we call inclusive communities, but also sets of nodes with strong disconnection, which we call exclusive communities. By introducing a new graph similarity measure, called the log cut distance, we show that PieClam is a universal autoencoder, able to uniformly approximately reconstruct any graph. Our method is shown to obtain competitive performance in graph anomaly detection and link prediction benchmarks.

Spotlight Poster

#E-2908

Covered Forest: Fine-grained generalization analysis of graph neural networks

Antonis Vasileiou · Ben Finkelshtein · Floris Geerts · Ron Levie · Christopher Morris

The expressive power of message-passing graph neural networks (MPNNs) is reasonably well understood, primarily through combinatorial techniques from graph isomorphism testing. However, MPNNs' generalization abilities---making meaningful predictions beyond the training set---remain less explored. Current generalization analyses often overlook graph structure, limit the focus to specific aggregation functions, and assume the impractical, hard-to-optimize $0$-$1$ loss function. Here, we extend recent advances in graph similarity theory to assess the influence of graph structure, aggregation, and loss functions on MPNNs' generalization abilities. Our empirical study supports our theoretical insights, improving our understanding of MPNNs' generalization properties.

Poster

#E-2909

MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Yuang Zhang · Jiaxi Gu · Li-Wen Wang · Han Wang · JunqiCheng · Yuefeng Zhu · FangYuan Zou

In recent years, while generative AI has advanced significantly in image generation, video generation continues to face challenges in controllability, length, and detail quality, which hinder its application. We present MimicMotion, a framework for generating high-quality human videos of arbitrary length using motion guidance. Our approach has several highlights. Firstly, we introduce confidence-aware pose guidance that ensures high frame quality and temporal smoothness. Secondly, we introduce regional loss amplification based on pose confidence, which reduces image distortion in key regions. Lastly, we propose a progressive latent fusion strategy to generate long and smooth videos. Experiments demonstrate the effectiveness of our approach in producing high-quality human motion videos. Videos and comparisons are available at https://tencent.github.io/MimicMotion.

Poster

#E-2910

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer

Enze Xie · Junsong Chen · Yuyang Zhao · Jincheng YU · Ligeng Zhu · Yujun Lin · Zhekai Zhang · Muyang Li · Junyu Chen · Han Cai · Bingchen Liu · Zhou Daquan · Song Han

This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. Building upon SANA-1.0, we introduce three key innovations: (1) Efficient Training Scaling: A depth-growth paradigm that enables scaling from 1.6B to 4.8B parameters with significantly reduced computational resources, combined with a memory-efficient 8-bit optimizer. (2) Model Depth Pruning: A block importance analysis technique for efficient model compression to arbitrary sizes with minimal quality loss. (3) Inference-time Scaling: A repeated sampling strategy that trades computation for model capacity, enabling smaller models to match larger model quality at inference time. Through these strategies, SANA-1.5 achieves a text-image alignment score of 0.72 on GenEval, which can be further improved to 0.80 through inference scaling, establishing a new SoTA on GenEval benchmark. These innovations enable efficient model scaling across different compute budgets while maintaining high quality, making high-quality image generation more accessible.

Spotlight Poster

#E-2911

Normalizing Flows are Capable Generative Models

Shuangfei Zhai · Ruixiang Zhang · Preetum Nakkiran · David Berthelot · Jiatao Gu · Huangjie Zheng · Tianrong Chen · Miguel Angel Bautista Martin · Navdeep Jaitly · Joshua M Susskind

Normalizing Flows (NFs) are likelihood-based models for continuous inputs. They have demonstrated promising results on both density estimation and generative modeling tasks, but have received relatively little attention in recent years. In this work, we demonstrate that NFs are more powerful than previously believed. We present TarFlow: a simple and scalable architecture that enables highly performant NF models. TarFlow can be thought of as a Transformer-based variant of Masked Autoregressive Flows (MAFs): it consists of a stack of autoregressive Transformer blocks on image patches, alternating the autoregression direction between layers. TarFlow is straightforward to train end-to-end, and capable of directly modeling and generating pixels. We also propose three key techniques to improve sample quality: Gaussian noise augmentation during training, a post training denoising procedure, and an effective guidance method for both class-conditional and unconditional settings. Putting these together, TarFlow sets new state-of-the-art results on likelihood estimation for images, beating the previous best methods by a large margin, and generates samples with quality and diversity comparable to diffusion models, for the first time with a stand-alone NF model. We make our code available at https://github.com/apple/ml-tarflow.

Poster

#E-2912

Efficient Molecular Conformer Generation with SO(3)-Averaged Flow Matching and Reflow

Zhonglin Cao · Mario Geiger · Allan Costa · Danny Reidenbach · Karsten Kreis · Tomas Geffner · Franco Pellegrini · Guoqing Zhou · Emine Kucukbenli

Fast and accurate generation of molecular conformers is desired for downstream computational chemistry and drug discovery tasks. Currently, training and sampling state-of-the-art diffusion or flow-based models for conformer generation require significant computational resources. In this work, we build upon flow-matching and propose two mechanisms for accelerating training and inference of generative models for 3D molecular conformer generation. For fast training, we introduce the SO(3)-Averaged Flow training objective, which leads to faster convergence to better generation quality compared to conditional optimal transport flow or Kabsch-aligned flow. We demonstrate that models trained using SO(3)-Averaged Flow can reach state-of-the-art conformer generation quality. For fast inference, we show that the reflow and distillation methods of flow-based models enable few-steps or even one-step molecular conformer generation with high quality. The training techniques proposed in this work show a path towards highly efficient molecular conformer generation with flow-based models.

Poster

#E-3001

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Zihan Liu · Shuangrui Ding · Zhixiong Zhang · Xiaoyi Dong · Pan Zhang · Yuhang Zang · Yuhang Cao · Dahua Lin · Jiaqi Wang

Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, leading to cumbersome training and inference pipelines, as well as suboptimal overall generation quality due to error accumulation across stages.In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline.The code is available at https://github.com/LiuZH-19/SongGen.

Poster

#E-3002

Introducing 3D Representation for Dense Volume-to-Volume Translation via Score Fusion

Xiyue Zhu · Dou Kwark · Ruike Zhu · Kaiwen Hong · Yiqi Tao · Shirui Luo · Yudu Li · Zhi-Pei Liang · Volodymyr Kindratenko

In volume-to-volume translations in medical images, existing models often struggle to capture the inherent volumetric distribution using 3D voxel-space representations, due to high computational dataset demands. We present Score-Fusion, a novel volumetric translation model that effectively learns 3D representations by ensembling perpendicularly trained 2D diffusion models in score function space. By carefully initializing our model to start with an average of 2D models as in existing models, we reduce 3D training to a fine-tuning process, mitigating computational and data demands. Furthermore, we explicitly design the 3D model's hierarchical layers to learn ensembles of 2D features, further enhancing efficiency and performance. Moreover, Score-Fusion naturally extends to multi-modality settings by fusing diffusion models conditioned on different inputs for flexible, accurate integration. We demonstrate that 3D representation is essential for better performance in downstream recognition tasks, such as tumor segmentation, where most segmentation models are based on 3D representation. Extensive experiments demonstrate that Score-Fusion achieves superior accuracy and volumetric fidelity in 3D medical image super-resolution and modality translation. Additionally, we extend Score-Fusion to video super-resolution by integrating 2D diffusion models on time-space slices with a spatial-temporal video diffusion backbone, highlighting its potential for general-purpose volume translation and providing broader insight into learning-based approaches for score function fusion.

Poster

#E-3004

ZipAR: Parallel Autoregressive Image Generation through Spatial Locality

Yefei He · Feng Chen · Yuanyu He · Shaoxuan He · Hong Zhou · Kaipeng Zhang · Bohan Zhuang

In this paper, we propose ZipAR, a training-free, plug-and-play parallel decoding framework for accelerating autoregressive (AR) visual generation. The motivation stems from the observation that images exhibit local structures, and spatially distant regions tend to have minimal interdependence. Given a partially decoded set of visual tokens, in addition to the original next-token prediction scheme in the row dimension, the tokens corresponding to spatially adjacent regions in the column dimension can be decoded in parallel. To ensure alignment with the contextual requirements of each token, we employ an adaptive local window assignment scheme with rejection sampling analogous to speculative decoding. By decoding multiple tokens in a single forward pass, the number of forward passes required to generate an image is significantly reduced, resulting in a substantial improvement in generation efficiency. Experiments demonstrate that ZipAR can reduce the number of model forward passes by up to 91% on the Emu3-Gen model without requiring any additional retraining.

Poster

#E-3005

MissScore: High-Order Score Estimation in the Presence of Missing Data

Wenqin Liu · Haoze Hou · Erdun Gao · Biwei Huang · Qiuhong Ke · Howard Bondell · Mingming Gong

Score-based generative models are essential in various machine learning applications, with strong capabilities in generation quality. In particular, high-order derivatives (scores) of data density offer deep insights into data distributions, building on the proven effectiveness of first-order scores for modeling and generating synthetic data, unlocking new possibilities for applications. However, learning them typically requires complete data, which is often unavailable in domains such as healthcare and finance due to data corruption, acquisition constraints, or incomplete records. To tackle this challenge, we introduce MissScore, a novel framework for estimating high-order scores in the presence of missing data. We derive objective functions for estimating high-order scores under different missing data mechanisms and propose a new algorithm specifically designed to handle missing data effectively. Our empirical results demonstrate that MissScore accurately and efficiently learns the high-order scores from incomplete data and generates high-quality samples, resulting in strong performance across a range of downstream tasks.

Poster

#E-3006

Training Diffusion-based Generative Models with Limited Data

Zhaoyu Zhang · Yang Hua · Guanxiong Sun · Hui Wang · Seán McLoone

Diffusion-based generative models (diffusion models) often require a large amount of data to train a score-based model that learns the score function of the data distribution through denoising score matching. However, collecting and cleaning such data can be expensive, time-consuming, and even infeasible. In this paper, we present a novel theoretical insight for diffusion models that two factors, i.e., the denoiser function hypothesis space and the number of training samples, can affect the denoising score matching error of all training samples. Based on this theoretical insight, it is evident that minimizing the total denoising score matching error is challenging within the denoiser function hypothesis space in existing methods, when training diffusion models with limited data. To address this, we propose a new diffusion model called Limited Data Diffusion (LD-Diffusion), which consists of two main components: a compressing model and a novel mixed augmentation with fixed probability (MAFP) strategy. Specifically, the compressing model can constrain the complexity of the denoiser function hypothesis space and MAFP can effectively increase the training samples by providing more informative guidance than existing data augmentation methods in the compressed hypothesis space. Extensive experiments on several datasets demonstrate that LD-Diffusion can achieve better performance compared to other diffusion models. Codes are available at https://github.com/zzhang05/LD-Diffusion.

Poster

#E-3007

AdvI2I: Adversarial Image Attack on Image-to-Image Diffusion Models

Yaopei Zeng · Yuanpu Cao · Bochuan Cao · Yurui Chang · Jinghui Chen · Lu Lin

Recent advances in diffusion models have significantly enhanced the quality of image synthesis, yet they have also introduced serious safety concerns, particularly the generation of Not Safe for Work (NSFW) content. Previous research has demonstrated that adversarial prompts can be used to generate NSFW content. However, such adversarial text prompts are often easily detectable by text-based filters, limiting their efficacy. In this paper, we expose a previously overlooked vulnerability: adversarial image attacks targeting Image-to-Image (I2I) diffusion models. We propose AdvI2I, a novel framework that manipulates input images to induce diffusion models to generate NSFW content. By optimizing a generator to craft adversarial images, AdvI2I circumvents existing defense mechanisms, such as Safe Latent Diffusion (SLD), without altering the text prompts. Furthermore, we introduce AdvI2I-Adaptive, an enhanced version that adapts to potential countermeasures and minimizes the resemblance between adversarial images and NSFW concept embeddings, making the attack more resilient against defenses. Through extensive experiments, we demonstrate that both AdvI2I and AdvI2I-Adaptive can effectively bypass current safeguards, highlighting the urgent need for stronger security measures to address the misuse of I2I diffusion models.

Poster

#E-3008

Graph Generative Pre-trained Transformer

Xiaohui Chen · Yinkai Wang · JIAXING HE · Yuanqi Du · Soha Hassoun · Xiaolin Xu · Liping Liu

Graph generation is a critical task in numerous domains, including molecular design and social network analysis, due to its ability to model complex relationships and structured data. While most modern graph generative models utilize adjacency matrix representations, this work revisits an alternative approach that represents graphs as sequences of node set and edge set. We advocate for this approach due to its efficient encoding of graphs and propose a novel representation. Based on this representation, we introduce the Graph Generative Pre-trained Transformer (G2PT), an auto-regressive model that learns graph structures via next-token prediction. To further exploit G2PT's capabilities as a general-purpose foundation model, we explore fine-tuning strategies for two downstream applications: goal-oriented generation and graph property prediction. We conduct extensive experiments across multiple datasets. Results indicate that G2PT achieves superior generative performance on both generic graph and molecule datasets. Furthermore, G2PT exhibits strong adaptability and versatility in downstream tasks from molecular design to property prediction.

Poster

#E-3009

Inverse Flow and Consistency Models

Yuchen Zhang · Jian Zhou

Inverse generation problems, such as denoising without ground truth observations, is a critical challenge in many scientific inquiries and real-world applications. While recent advances in generative models like diffusion models, conditional flow matching, and consistency models achieved impressive results by casting generation as denoising problems, they cannot be directly used for inverse generation without access to clean data. Here we introduce Inverse Flow (IF), a novel framework that enables using these generative models for inverse generation problems including denoising without ground truth. Inverse Flow can be flexibly applied to nearly any continuous noise distribution and allows complex dependencies. We propose two algorithms for learning Inverse Flows, Inverse Flow Matching (IFM) and Inverse Consistency Model (ICM). Notably, to derive the computationally efficient, simulation-free inverse consistency model objective, we generalized consistency training to any forward diffusion processes or conditional flows, which have applications beyond denoising. We demonstrate the effectiveness of IF on synthetic and real datasets, outperforming prior approaches while enabling noise distributions that previous methods cannot support. Finally, we showcase applications of our techniques to fluorescence microscopy and single-cell genomics data, highlighting IF's utility in scientific problems. Overall, this work expands the applications of powerful generative models to inversion generation problems.

Poster

#E-3010

CHATS: Combining Human-Aligned Optimization and Test-Time Sampling for Text-to-Image Generation

Minghao Fu · Guo-Hua Wang · Liangfu Cao · Qing-Guo Chen · Zhao Xu · Weihua Luo · Kaifu Zhang

Diffusion models have emerged as a dominant approach for text-to-image generation. Key components such as the human preference alignment and classifier-free guidance play a crucial role in ensuring generation quality. However, their independent application in current text-to-image models continues to face significant challenges in achieving strong text-image alignment, high generation quality, and consistency with human aesthetic standards. In this work, we for the first time, explore facilitating the collaboration of human performance alignment and test-time sampling to unlock the potential of text-to-image models. Consequently, we introduce CHATS (Combining Human-Aligned optimization and Test-time Sampling), a novel generative framework that separately models the preferred and dispreferred distributions and employs a proxy-prompt-based sampling strategy to utilize the useful information contained in both distributions. We observe that CHATS exhibits exceptional data efficiency, achieving strong performance with only a small, high-quality funetuning dataset. Extensive experiments demonstrate that CHATS surpasses traditional preference alignment methods, setting new state-of-the-art across various standard benchmarks. The code is publicly available at github.com/AIDC-AI/CHATS.

Poster

#E-3011

Gaussian Mixture Flow Matching Models

Hansheng Chen · Kai Zhang · Hao Tan · Zexiang Xu · Fujun Luan · Leonidas Guibas · Gordon Wetzstein · Sai Bi

Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity. However, they underperform in few-step sampling due to discretization error and tend to produce over-saturated colors under classifier-free guidance (CFG). To address these limitations, we propose a novel Gaussian mixture flow matching (GMFlow) model: instead of predicting the mean, GMFlow predicts dynamic Gaussian mixture (GM) parameters to capture a multi-modal flow velocity distribution, which can be learned with a KL divergence loss. We demonstrate that GMFlow generalizes previous diffusion and flow matching models where a single Gaussian is learned with an $L_2$ denoising loss. For inference, we derive GM-SDE/ODE solvers that leverage analytic denoising distributions and velocity fields for precise few-step sampling. Furthermore, we introduce a novel probabilistic guidance scheme that mitigates the over-saturation issues of CFG and improves image generation quality. Extensive experiments demonstrate that GMFlow consistently outperforms flow matching baselines in generation quality, achieving a Precision of 0.942 with only 6 sampling steps on ImageNet 256$\times$256.

Poster

#E-3012

OmiAD: One-Step Adaptive Masked Diffusion Model for Multi-class Anomaly Detection via Adversarial Distillation

Yaoxuan Feng · Wenchao Chen · yuxin li · Bo Chen · Yubiao Wang · Zixuan Zhao · Hongwei Liu · Mingyuan Zhou

Diffusion models have demonstrated outstanding performance in industrial anomaly detection. However, their iterative denoising nature results in slow inference speed, limiting their practicality for real-time industrial deployment. To address this challenge, we propose OmiAD, a one-step masked diffusion model for multi-class anomaly detection, derived from a well-designed multi-step Adaptive Masked Diffusion Model (AMDM) and compressed using Adversarial Score Distillation (ASD). OmiAD first introduces AMDM, equipped with an adaptive masking strategy that dynamically adjusts masking patterns based on noise levels and encourages the model to reconstruct anomalies as normal counterparts by leveraging broader context, to reduce the pixel-level shortcut reliance. Then, ASD is developed to compress the multi-step diffusion process into a single-step generator by score distillation and incorporating a shared-weight discriminator effectively reusing parameters while significantly improving both inference efficiency and detection performance. The effectiveness of OmiAD is validated on four diverse datasets, achieving state-of-the-art performance across seven metrics while delivering a remarkable inference speedup.

Poster

#E-3100

Continuous Visual Autoregressive Generation via Score Maximization

Chenze Shao · Fandong Meng · Jie Zhou

Conventional wisdom suggests that autoregressive models are used to process discrete data. When applied to continuous modalities such as visual data, Visual AutoRegressive modeling (VAR) typically resorts to quantization-based approaches to cast the data into a discrete space, which can introduce significant information loss. To tackle this issue, we introduce a Continuous VAR framework that enables direct visual autoregressive generation without vector quantization. The underlying theoretical foundation is strictly proper scoring rules, which provide powerful statistical tools capable of evaluating how well a generative model approximates the true distribution. Within this framework, all we need is to select a strictly proper score and set it as the training objective to optimize. We primarily explore a class of training objectives based on the energy score, which is likelihood-free and thus overcomes the difficulty of making probabilistic predictions in the continuous space. Previous efforts on continuous autoregressive generation, such as GIVT and diffusion loss, can also be derived from our framework using other strictly proper scores. Source code: \url{https://github.com/shaochenze/EAR}.

Poster

#E-3101

FrameBridge: Improving Image-to-Video Generation with Bridge Models

Yuji Wang · Zehua Chen · Chen Xiaoyu · Yixiang Wei · Jun Zhu · Jianfei Chen

Diffusion models have achieved remarkable progress on image-to-video (I2V) generation, while their noise-to-data generation process is inherently mismatched with this task, which may lead to suboptimal synthesis quality. In this work, we present FrameBridge. By modeling the frame-to-frames generation process with a bridge model based data-to-data generative process, we are able to fully exploit the information contained in the given image and improve the consistency between the generation process and I2V task.Moreover, we propose two novel techniques toward the two popular settings of training I2V models, respectively. Firstly, we propose SNR-Aligned Fine-tuning (SAF), making the first attempt to fine-tune a diffusion model to a bridge model and, therefore, allowing us to utilize the pre-trained diffusion-based text-to-video (T2V) models. Secondly, we propose neural prior, further improving the synthesis quality of FrameBridge when training from scratch. Experiments conducted on WebVid-2M and UCF-101 demonstrate the superior quality of FrameBridge in comparison with the diffusion counterpart (zero-shot FVD 95 vs. 192 on MSR-VTT and non-zero-shot FVD 122 vs. 171 on UCF-101), and the advantages of our proposed SAF and neural prior for bridge-based I2V models. The project page: https://framebridge-icml.github.io/

Poster

#E-3102

Distillation of Discrete Diffusion through Dimensional Correlations

Satoshi Hayakawa · Yuhta Takida · Masaaki Imaizumi · Hiromi Wakaki · Yuki Mitsufuji

Diffusion models have demonstrated exceptional performances in various fields of generative modeling, but suffer from slow sampling speed due to their iterative nature. While this issue is being addressed in continuous domains, discrete diffusion models face unique challenges, particularly in capturing dependencies between elements (e.g., pixel relationships in image, sequential dependencies in language) mainly due to the computational cost of processing high-dimensional joint distributions. In this paper, (i) we propose "mixture" models for discrete diffusion that are capable of treating dimensional correlations while remaining scalable, and (ii) we provide a set of loss functions for distilling the iterations of existing models. Two primary theoretical insights underpin our approach: First, conventional models with element-wise independence can well approximate the data distribution, but essentially require many sampling steps. Second, our loss functions enable the mixture models to distill such many-step conventional models into just a few steps by learning the dimensional correlations. Our experimental results show the effectiveness of the proposed method in distilling pretrained discrete diffusion models across image and language domains. The code used in the paper is available at https://github.com/sony/di4c.

Poster

#E-3103

Simple and Critical Iterative Denoising: A Recasting of Discrete Diffusion in Graph Generation

Yoann Boget

Discrete Diffusion and Flow Matching models have significantly advanced generative modeling for discrete structures, including graphs. However, the dependencies of the noisy distributions across time of these models lead to error accumulation and propagation during the reverse denoising process—a phenomenon known as \emph{compounding denoising errors}. To address this problem, we propose a novel framework called \emph{Simple Iterative Denoising}, which simplifies discrete diffusion and circumvents the issue by removing dependencies on previous intermediate states in the noising process. Additionally, we enhance our model by incorporating a \emph{Critic}, which during generation selectively retains or corrupts elements in an instance based on their likelihood under the data distribution. Our empirical evaluations demonstrate that the proposed method significantly outperforms existing discrete diffusion baselines in graph generation tasks.

Poster

#E-3104

Scalable Non-Equivariant 3D Molecule Generation via Rotational Alignment

Yuhui Ding · Thomas Hofmann

Equivariant diffusion models have achieved impressive performance in 3D molecule generation. These models incorporate Euclidean symmetries of 3D molecules by utilizing an SE(3)-equivariant denoising network. However, specialized equivariant architectures limit the scalability and efficiency of diffusion models. In this paper, we propose an approach that relaxes such equivariance constraints. Specifically, our approach learns a sample-dependent SO(3) transformation for each molecule to construct an aligned latent space. A non-equivariant diffusion model is then trained over the aligned representations. Experimental results demonstrate that our approach performs significantly better than previously reported non-equivariant models. It yields sample quality comparable to state-of-the-art equivariant diffusion models and offers improved training and sampling efficiency. Our code is available at: https://github.com/skeletondyh/RADM

Poster

#E-3105

Differentiable Solver Search for Fast Diffusion Sampling

shuai wang · Zexian Li · Qipeng zhang · Tianhui Song · Xubin Li · Tiezheng Ge · Bo Zheng · Limin Wang

Diffusion models have demonstrated remarkable generation quality but at the cost of numerous function evaluations. Recently, advanced ODE-based solvers have been developed to mitigate the substantial computational demands of reverse-diffusion solving under limited sampling steps. However, these solvers, heavily inspired by Adams-like multistep methods, rely solely on t-related Lagrange interpolation. We show that t-related Lagrange interpolation is suboptimal for diffusion model and reveal a compact search space comprised of time steps and solver coefficients. Building on our analysis, we propose a novel differentiable solver search algorithm to identify more optimal solver. Equipped with the searched solver, rectified-flow models, e.g., SiT-XL/2 and FlowDCN-XL/2, achieve FID scores of 2.40 and 2.35, respectively, on ImageNet-$256\times256$ with only 10 steps. Meanwhile, DDPM model, DiT-XL/2, reaches a FID score of 2.33 with only 10 steps. Notably, our searched solver outperforms traditional solvers by a significant margin. Moreover, our searched solver demonstrates generality across various model architectures, resolutions, and model sizes.

Spotlight Poster

#E-3106

On the Guidance of Flow Matching

Ruiqi Feng · Chenglei Yu · Wenhao Deng · Peiyan Hu · Tailin Wu

Flow matching has shown state-of-the-art performance in various generative tasks, ranging from image generation to decision-making, where generation under energy guidance (abbreviated as guidance in the following) is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found at https://github.com/AI4Science-WestlakeU/flow_guidance.

Poster

#E-3107

Effective and Efficient Masked Image Generation Models

Zebin You · Jingyang Ou · Xiaolu Zhang · Jun Hu · JUN ZHOU · Chongxuan Li

Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as \textbf{eMIGM}. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fréchet Inception Distance (FID). In particular, on ImageNet $256\times256$, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion model REPA while requiring less than 45\% of the NFE. Additionally, on ImageNet $512\times512$, eMIGM outperforms the strong continuous diffusion model EDM2. Code is available at \url{https://github.com/ML-GSAI/eMIGM}.

Poster

#E-3108

Is Noise Conditioning Necessary for Denoising Generative Models?

Qiao Sun · Zhicheng Jiang · Hanhong Zhao · Kaiming He

It is widely believed that noise conditioning is indispensable for denoising diffusion models to work successfully. This work challenges this belief. Motivated by research on blind image denoising, we investigate a variety of denoising-based generative models in the absence of noise conditioning. To our surprise, most models exhibit graceful degradation, and in some cases, they even perform better without noise conditioning. We provide a mathematical analysis of the error introduced by removing noise conditioning and demonstrate that our analysis aligns with empirical observations. We further introduce a noise-unconditional model that achieves a competitive FID of 2.23 on CIFAR-10, significantly narrowing the gap to leading noise-conditional models. We hope our findings will inspire the community to revisit the foundations and formulations of denoising generative models.

Poster

#E-3109

Towards Rationale-Answer Alignment of LVLMs via Self-Rationale Calibration

Yuanchen Wu · Ke Yan · Shouhong Ding · Ziyin Zhou · Xiaoqiang Li

Large Vision-Language Models (LVLMs) have manifested strong visual question answering capability. However, they still struggle with aligning the rationale and the generated answer, leading to inconsistent reasoning and incorrect responses. To this end, this paper introduces Self-Rationale Calibration (SRC) framework to iteratively calibrate the alignment between rationales and answers. SRC begins by employing a lightweight “rationale fine-tuning” approach, which modifies the model’s response format to require a rationale before deriving answer without explicit prompts. Next, SRC searches a diverse set of candidate responses from the fine-tuned LVLMs for each sample, followed by a proposed pairwise scoring strategy using a tailored scoring model, R-Scorer, to evaluate both rationale quality and factual consistency of candidates. Based on a confidence-weighted preference curation process, SRC decouples the alignment calibration into a preference fine-tuning manner, leading to significant improvements of LVLMs in perception, reasoning, and generalization across multiple benchmarks. Our results emphasize the rationale-oriented alignment in exploring the potential of LVLMs.

Poster

#E-3110

ExLM: Rethinking the Impact of $\texttt{[MASK]}$ Tokens in Masked Language Models

Kangjie Zheng · Junwei Yang · Siyue Liang · Bin Feng · Zequn Liu · Wei Ju · Zhiping Xiao · Ming Zhang

Masked Language Models (MLMs) have achieved remarkable success in many self-supervised representation learning tasks. MLMs are trained by randomly masking portions of the input sequences with $\texttt{[MASK]}$ tokens and learning to reconstruct the original content based on the remaining context. This paper explores the impact of $\texttt{[MASK]}$ tokens on MLMs. Analytical studies show that masking tokens can introduce the ***corrupted semantics*** problem, wherein the corrupted context may convey multiple, ambiguous meanings. This problem is also a key factor affecting the performance of MLMs on downstream tasks. Based on these findings, we propose a novel enhanced-context MLM, ExLM. Our approach expands $\texttt{[MASK]}$ tokens in the input context and models the dependencies between these expanded states. This enhancement increases context capacity and enables the model to capture richer semantic information, effectively mitigating the corrupted semantics problem during pre-training. Experimental results demonstrate that ExLM achieves significant performance improvements in both text modeling and SMILES modeling tasks. Further analysis confirms that ExLM enriches semantic representations through context enhancement, and effectively reduces the semantic multimodality commonly observed in MLMs.

Poster

#E-3111

Physics-informed Temporal Alignment for Auto-regressive PDE Foundation Models

Congcong Zhu · Xiaoyan Xu · Jiayue Han · Jingrun Chen

Auto-regressive partial differential equation (PDE) foundation models have shown great potential in handling time-dependent data. However, these models suffer from error accumulation caused by the shortcut problem deeply rooted in auto-regressive prediction. The challenge becomes particularly evident for out-of-distribution data, as the pretraining performance may approach random model initialization for downstream tasks with long-term dynamics. To deal with this problem, we propose physics-informed temporal alignment (PITA), a self-supervised learning framework inspired by inverse problem solving. Specifically, PITA aligns the physical dynamics discovered at different time steps on each given PDE trajectory by integrating physics-informed constraints into the self-supervision signal. The alignment is derived from observation data without relying on known physics priors, indicating strong generalization ability to out-of-distribution data. Extensive experiments show that PITA significantly enhances the accuracy and robustness of existing foundation models on diverse time-dependent PDE data. The code is available at \url{https://github.com/SCAILab-USTC/PITA}.

Poster

#E-3112

M+: Extending MemoryLLM with Scalable Long-Term Memory

Yu Wang · Dmitry Krotov · Yuanzhe Hu · Yifan Gao · Wangchunshu Zhou · Julian McAuley · Dan Gutfreund · Rogerio Feris · Zexue He

Equipping large language models (LLMs) with latent-space memory has attracted increasing attention as they can extend the context window of existing language models. However, retaining information from the distant past remains a challenge. For example, MemoryLLM (Wang et al., 2024a), as a representative work with latent-space memory, compresses past information into hidden states across all layers, forming a memory pool of 1B parameters. While effective for sequence lengths up to 16k tokens, it struggles to retain knowledge beyond 20k tokens. In this work, we address this limitation by introducing M+, a memory-augmented model based on MemoryLLM that significantly enhances long-term information retention. M+ integrates a long-term memory mechanism with a co-trained retriever, dynamically retrieving relevant information during text generation. We evaluate M+ on diverse benchmarks, including long-context understanding and knowledge retention tasks. Experimental results show that M+ significantly outperforms MemoryLLM and recent strong baselines, extending knowledge retention from under 20k to over 160k tokens with similar GPU memory overhead.

Poster

#E-3200

Fundamental Limits of Visual Autoregressive Transformers: Universal Approximation Abilities

Yifang Chen · Xiaoyu Li · Yingyu Liang · Zhenmei Shi · Zhao Song

We investigate the fundamental limits of transformer-based foundation models, extending our analysis to include Visual Autoregressive (VAR) transformers. VAR represents a big step toward generating images using a novel, scalable, coarse-to-fine ``next-scale prediction'' framework. These models set a new quality bar, outperforming all previous methods, including Diffusion Transformers, while having state-of-the-art performance for image synthesis tasks. Our primary contributions establish that, for single-head VAR transformers with a single self-attention layer and single interpolation layer, the VAR Transformer is universal. From the statistical perspective, we prove that such simple VAR transformers are universal approximators for any word-to-image Lipschitz functions. Furthermore, we demonstrate that flow-based autoregressive transformers inherit similar approximation capabilities. Our results provide important design principles for effective and computationally efficient VAR Transformer strategies that can be used to extend their utility to more sophisticated VAR models in image generation and other related areas.

Poster

#E-3201

Sparse Spectral Training and Inference on Euclidean and Hyperbolic Neural Networks

Jialin Zhao · Yingtao Zhang · Xinghang Li · Huaping Liu · Carlo Cannistraci

The growing demands on GPU memory posed by the increasing number of neural network parameters call for training approaches that are more memory-efficient. Previous memory reduction training techniques, such as Low-Rank Adaptation (LoRA) and ReLoRA, face challenges, with LoRA being constrained by its low-rank structure, particularly during intensive tasks like pre-training, and ReLoRA suffering from saddle point issues. In this paper, we propose Sparse Spectral Training (SST) to optimize memory usage for pre-training. SST updates all singular values and selectively updates singular vectors through a multinomial sampling method weighted by the magnitude of the singular values. Furthermore, SST employs singular value decomposition to initialize and periodically reinitialize low-rank parameters, reducing distortion relative to full-rank training compared to other low-rank methods. Through comprehensive testing on both Euclidean and hyperbolic neural networks across various tasks, SST demonstrates its ability to outperform existing memory reduction training methods and is comparable to full-rank training in various cases. On LLaMA-1.3B, with only 18.7\% of the parameters trainable compared to full-rank training (using a rank equivalent to 6\% of the embedding dimension), SST reduces the perplexity gap between other low-rank methods and full-rank training by 97.4\%. This result highlights SST as an effective parameter-efficient technique for model pre-training.

Poster

#E-3202

Oscillation-Reduced MXFP4 Training for Vision Transformers

Yuxiang Chen · Haocheng Xi · Jun Zhu · Jianfei Chen

Pre-training Transformers in FP4 precision is becoming a promising approach to gain substantial speedup, but it comes with a considerable loss of accuracy. Microscaling (MX) data format provides a fine-grained per-group quantization method to improve the representation ability of the FP4 format and is supported by the next-generation Blackwell GPU architecture. However, training with MXFP4 data format still results in significant degradation and there is a lack of systematic research on the reason.In this work, we propose a novel training method TetraJet for a more accurate FP4 training. We comprehensively evaluate all of the quantizers involved in the training, and identify the weight oscillation problem in the forward pass as the main source of the degradation in MXFP4 training. Therefore, we introduce two novel methods, EMA Quantizer (Q-EMA) and Adaptive Ramping Optimizer (Q-Ramping), to resolve the oscillation problem. Extensive experiments on Vision Transformers demonstrate that TetraJet consistently outperforms the existing 4-bit training methods, and Q-EMA \& Q-Ramping can provide additional enhancement by effectively reducing oscillation. We decreased the accuracy degradation by more than 50% compared to the baseline, and can even achieve competitive performance compared to full precision training.

Poster

#E-3203

Sorbet: A Neuromorphic Hardware-Compatible Transformer-Based Spiking Language Model

Kaiwen Tang · Zhanglu Yan · Weng-Fai Wong

For reasons such as privacy, there are use cases for language models at the edge. This has given rise to small language models targeted for deployment in resource-constrained devices where energy efficiency is critical. Spiking neural networks (SNNs) offer a promising solution due to their energy efficiency, and there are already works on realizing transformer-based models on SNNs. However, key operations like softmax and layer normalization (LN) are difficult to implement on neuromorphic hardware, and many of these early works sidestepped them. To address these challenges, we introduce Sorbet, a transformer-based spiking language model that is more neuromorphic hardware-compatible. Sorbet incorporates a novel shifting-based softmax called PTsoftmax and a BitShifting-based PowerNorm (BSPN), both designed to replace the respective energy-intensive operations. By leveraging knowledge distillation and model quantization, Sorbet achieved a highly compressed binary weight model that maintains competitive performance while achieving $27.16\times$ energy savings compared to BERT. We validate Sorbet through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference. Our code is publicly available at [https://github.com/Kaiwen-Tang/Sorbet](https://github.com/Kaiwen-Tang/Sorbet)

Poster

#E-3204

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

Jintao Zhang · Haofeng Huang · Pengle Zhang · Jia wei · Jun Zhu · Jianfei Chen

Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a hardware-friendly thread-level granularity and quantize matrixes $(\widetilde P, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the accuracy of INT4 $QK^\top$. Third, we propose a two-level accumulation strategy for $\widetilde PV$ to enhance the accuracy of FP8 $\widetilde PV$. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about **3x** and **4.5x**, respectively. Moreover, SageAttention2 matches the speed of FlashAttention3(fp8) on the Hopper GPUs, while delivering much higher accuracy. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for language, image, and video generation.

Poster

#E-3205

Hgformer: Hyperbolic Graph Transformer for Collaborative Filtering

Yang Xin · Xingrun Li · Heng Chang · Yang jinze · xihong yang · Shengyu Tao · Maiko Shigeno · Ningkang Chang · Junfeng Wang · Dawei Yin · Erxue Min

Recommender systems are increasingly spreading to different areas like e-commerce or video streaming to alleviate information overload. One of the most fundamental methods for recommendation is Collaborative Filtering (CF), which leverages historical user-item interactions to infer user preferences. In recent years, Graph Neural Networks (GNNs) have been extensively studied to capture graph structures in CF tasks. Despite this remarkable progress, local structure modeling and embedding distortion still remain two notable limitations in the majority of GNN-based CF methods. Therefore, in this paper, we propose a novel Hyperbolic Graph Transformer architecture, to tackle the long-tail problems in CF tasks. Specifically, the proposed framework is comprised of two essential modules: 1) Local Hyperbolic Graph Convolutional Network (LHGCN), which performs graph convolution entirely in the hyperbolic manifold and captures the local structure of each node; 2) Hyperbolic Transformer, which is comprised of hyperbolic cross-attention mechanisms to capture global information. Furthermore, to enable its feasibility on large-scale data, we introduce an unbiased approximation of the cross-attention for linear computational complexity, with a theoretical guarantee in approximation errors. Empirical experiments demonstrate that our proposed model outperforms the leading collaborative filtering methods and significantly mitigates the long-tail issue in CF tasks. Our implementations are available in \url{https://github.com/EnkiXin/Hgformer}.

Poster

#E-3206

RePaViT: Scalable Vision Transformer Acceleration via Structural Reparameterization on Feedforward Network Layers

Xuwei Xu · Yang Li · Yudong Chen · Jiajun LIU · Sen Wang

We reveal that feedforward network (FFN) layers, rather than attention layers, are the primary contributors to Vision Transformer (ViT) inference latency, with their impact signifying as model size increases. This finding highlights a critical opportunity for optimizing the efficiency of large-scale ViTs by focusing on FFN layers. In this work, we propose a novel channel idle mechanism that facilitates post-training structural reparameterization for efficient FFN layers during testing. Specifically, a set of feature channels remains idle and bypasses the nonlinear activation function in each FFN layer, thereby forming a linear pathway that enables structural reparameterization during inference. This mechanism results in a family of ReParameterizable Vision Transformers (RePaViTs), which achieve remarkable latency reductions with acceptable sacrifices (sometimes gains) in accuracy across various ViTs. The benefits of our method scale consistently with model sizes, demonstrating greater speed improvements and progressively narrowing accuracy gaps or even higher accuracies on larger models. In particular, RePa-ViT-Large and RePa-ViT-Huge enjoy 66.8% and 68.7% speed-ups with +1.7% and +1.1% higher top-1 accuracies under the same training strategy, respectively. RePaViT is the first to employ structural reparameterization on FFN layers to expedite ViTs to our best knowledge, and we believe that it represents an auspicious direction for efficient ViTs. Source code is available at https://github.com/Ackesnal/RePaViT.

Spotlight Poster

#E-3207

In-Context Denoising with One-Layer Transformers: Connections between Attention and Associative Memory Retrieval

Matthew Smart · Alberto Bietti · Anirvan Sengupta

We introduce in-context denoising, a task that refines the connection between attention-based architectures and dense associative memory (DAM) networks, also known as modern Hopfield networks. Using a Bayesian framework, we show theoretically and empirically that certain restricted denoising problems can be solved optimally even by a single-layer transformer. We demonstrate that a trained attention layer processes each denoising prompt by performing a single gradient descent update on a context-aware DAM energy landscape, where context tokens serve as associative memories and the query token acts as an initial state. This one-step update yields better solutions than exact retrieval of either a context token or a spurious local minimum, providing a concrete example of DAM networks extending beyond the standard retrieval paradigm. Overall, this work solidifies the link between associative memory and attention mechanisms first identified by Ramsauer et al., and demonstrates the relevance of associative memory models in the study of in-context learning.

Poster

#E-3208

In-Context Linear Regression Demystified: Training Dynamics and Mechanistic Interpretability of Multi-Head Softmax Attention

Jianliang He · Xintian Pan · Siyu Chen · Zhuoran Yang

We study how multi-head softmax attention models are trained to perform in-context learning on linear data. Through extensive empirical experiments and rigorous theoretical analysis, we demystify the emergence of elegant attention patterns: a diagonal and homogeneous pattern in the key-query weights, and a last-entry-only and zero-sum pattern in the output-value weights. Remarkably, these patterns consistently appear from gradient-based training starting from random initialization. Our analysis reveals that such emergent structures enable multi-head attention to approximately implement a debiased gradient descent predictor --- one that outperforms single-head attention and nearly achieves Bayesian optimality up to proportional factor. We also extend our study to scenarios with anisotropic covariates and multi-task linear regression. Our results reveal that in-context learning ability emerges from the trained transformer as an aggregated effect of its architecture and the underlying data distribution, paving the way for deeper understanding and broader applications of in-context learning.

Poster

#E-3209

Contrastive Localized Language-Image Pre-Training

Hong-You Chen · Zhengfeng Lai · Haotian Zhang · Xinze Wang · Marcin Eichner · Keen You · Meng Cao · Bowen Zhang · Yinfei Yang · Zhe Gan

CLIP has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, it has been widely adopted as the vision backbone of multimodal large language models (MLLMs). The success of CLIP relies on aligning web-crawled noisy text annotations at image levels. However, such criteria may be insufficient for downstream tasks in need of fine-grained vision representations, especially when understanding region-level is demanding for MLLMs. We improve the localization capability of CLIP with several advances. Our proposed pre-training method, Contrastive Localized Language-Image Pre-training (CLOC), complements CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text labels. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

Poster

#E-3210

FG-CLIP: Fine-Grained Visual and Textual Alignment

Chunyu Xie · Bin Wang · Fanjing Kong · Jincheng Li · Dawei Liang · Gengshen Zhang · Dawei Leng · Yuhui Yin

Contrastive Language-Image Pre-training (CLIP) excels in multimodal tasks such as image-text retrieval and zero-shot classification but struggles with fine-grained understanding due to its focus on coarse-grained short captions. To address this, we propose Fine-Grained CLIP (FG-CLIP), which enhances fine-grained understanding through three key innovations. First, we leverage large multimodal models to generate 1.6 billion long caption-image pairs for capturing global-level semantic details. Second, a high-quality dataset is constructed with 12 million images and 40 million region-specific bounding boxes aligned with detailed captions to ensure precise, context-rich representations. Third, 10 million hard fine-grained negative samples are incorporated to improve the model's ability to distinguish subtle semantic differences. We construct a comprehensive dataset, termed FineHARD, by integrating high-quality region-specific annotations with challenging fine-grained negative samples. Corresponding training methods are meticulously designed for these data. Extensive experiments demonstrate that FG-CLIP outperforms the original CLIP and other state-of-the-art methods across various downstream tasks, including fine-grained understanding, open-vocabulary object detection, image-text retrieval, and general multimodal benchmarks. These results highlight FG-CLIP's effectiveness in capturing fine-grained image details and improving overall model performance. The data, code, and models are available at https://github.com/360CVGroup/FG-CLIP.

Poster

#E-3211

QMamba: On First Exploration of Vision Mamba for Image Quality Assessment

Fengbin Guan · Xin Li · Zihao Yu · Yiting Lu · Zhibo Chen

In this work, we take the first exploration of the recently popular foundation model, i.e., State Space Model/Mamba, in image quality assessment (IQA), aiming at observing and excavating the perception potential in vision Mamba. A series of works on Mamba has shown its significant potential in various fields, e.g., segmentation and classification. However, the perception capability of Mamba remains under-explored. Consequently, we propose QMamba by revisiting and adapting the Mamba model for three crucial IQA tasks, i.e., task-specific, universal, and transferable IQA, which reveals its clear advantages over existing foundational models, e.g., Swin Transformer, ViT, and CNNs, in terms of perception and computational cost. To improve the transferability of QMamba, we propose the StylePrompt tuning paradigm, where lightweight mean and variance prompts are injected to assist task-adaptive transfer learning of pre-trained QMamba for different downstream IQA tasks. Compared with existing prompt tuning strategies, our StylePrompt enables better perceptual transfer with lower computational cost. Extensive experiments on multiple synthetic, authentic IQA datasets, and cross IQA datasets demonstrate the effectiveness of our proposed QMamba.

Poster

#E-3300

Stream-level Flow Matching with Gaussian Processes

Ganchao Wei · Li Ma

Flow matching (FM) is a family of training algorithms for fitting continuous normalizing flows (CNFs). Conditional flow matching (CFM) exploits the fact that the marginal vector field of a CNF can be learned by fitting least-squares regression to the conditional vector field specified given one or both ends of the flow path. In this paper, we extend the CFM algorithm by defining conditional probability paths along "streams'', instances of latent stochastic paths that connect data pairs of source and target, which are modeled with Gaussian process (GP) distributions. The unique distributional properties of GPs help preserve the ``simulation-free'' nature of CFM training. We show that this generalization of the CFM can effectively reduce the variance in the estimated marginal vector field at a moderate computational cost, thereby improving the quality of the generated samples under common metrics. Additionally, adopting the GP on the streams allows for flexibly linking multiple correlated training data points (e.g., time series). We empirically validate our claim through both simulations and applications to image and neural time series data.

Poster

#E-3301

TabPFN Unleashed: A Scalable and Effective Solution to Tabular Classification Problems

Si-Yang Liu · Han-Jia Ye

TabPFN has emerged as a promising in-context learning model for tabular data, capable of directly predicting the labels of test samples given labeled training examples. It has demonstrated competitive performance, particularly on small-scale classification tasks. However, despite its effectiveness, TabPFN still requires further refinement in several areas, including handling high-dimensional features, aligning with downstream datasets, and scaling to larger datasets.In this paper, we revisit existing variants of TabPFN and observe that most approaches focus either on reducing bias or variance, often neglecting the need to address the other side, while also increasing inference overhead. To fill this gap, we propose Beta (Bagging and Encoder-based Fine-tuning for TabPFN Adaptation), a novel and effective method designed to minimize both bias and variance. To reduce bias, we introduce a lightweight encoder to better align downstream tasks with the pre-trained TabPFN. By increasing the number of encoders in a lightweight manner, Beta mitigates variance, thereby further improving the model’s performance. Additionally, bootstrapped sampling is employed to further reduce the impact of data perturbations on the model, all while maintaining computational efficiency during inference. Our approach enhances TabPFN’s ability to handle high-dimensional data and scale to larger datasets. Experimental results on over 200 benchmark classification datasets demonstrate that Beta either outperforms or matches state-of-the-art methods.

Poster

#E-3302

GRADEO: Towards Human-Like Evaluation for Text-to-Video Generation via Multi-Step Reasoning

Zhun Mou · Bin Xia · Zhengchao Huang · Wenming Yang · Jiaya Jia

Recent great advances in video generation models have demonstrated their potential to produce high-quality videos, bringing challenges to effective evaluation. Unlike human evaluation, existing automated evaluation metrics lack high-level semantic understanding and reasoning capabilities for video, thus making them infeasible and unexplainable. To fill this gap, we curate GRADEO-Instruct, a multi-dimensional T2V evaluation instruction tuning dataset, including 3.3k videos from over 10 existing video generation models and multi-step reasoning assessments converted by 16k human annotations. We then introduce GRADEO, one of the first specifically designed video evaluation models, which grades AI-generated videos for explainable scores and assessments through multi-step reasoning. Experiments show that our method aligns better with human evaluations than existing methods. Furthermore, our benchmarking reveals that current video generation models struggle to produce content that aligns with human reasoning and complex real-world scenarios. The models, datasets, and codes will be released soon.

Poster

#E-3303

One-dimensional Path Convolution

Xuanshu Luo · Martin Werner

Two-dimensional (2D) convolutional kernels have dominated convolutional neural networks (CNNs) in image processing. While linearly scaling 1D convolution provides parameter efficiency, its naive integration into CNNs disrupts image locality, thereby degrading performance. This paper presents path convolution (PathConv), a novel CNN design exclusively with 1D operations, achieving ResNet-level accuracy using only 1/3 parameters. To obtain locality-preserving image traversal paths, we analyze Hilbert/Z-order paths and expose a fundamental trade-off: improved proximity for most pixels comes at the cost of excessive distances for other sacrificed ones to their neighbors. We resolve this issue by proposing path shifting, a succinct method to reposition sacrificed pixels. Using the randomized rounding algorithm, we show that three shifted paths are sufficient to offer better locality preservation than trivial raster scanning. To mitigate potential convergence issues caused by multiple paths, we design a lightweight path-aware channel attention mechanism to capture local intra-path and global inter-path dependencies. Experimental results further validate the efficacy of our method, establishing the proposed 1D PathConv as a viable backbone for efficient vision models.

Poster

#E-3304

Understanding Mode Connectivity via Parameter Space Symmetry

Bo Zhao · Nima Dehmamy · Robin Walters · Rose Yu

Neural network minima are often connected by curves along which train and test loss remain nearly constant, a phenomenon known as mode connectivity. While this property has enabled applications such as model merging and fine-tuning, its theoretical explanation remains unclear. We propose a new approach to exploring the connectedness of minima using parameter space symmetry. By linking the topology of symmetry groups to that of the minima, we derive the number of connected components of the minima of linear networks and show that skip connections reduce this number. We then examine when mode connectivity and linear mode connectivity hold or fail, using parameter symmetries which account for a significant part of the minimum. Finally, we provide explicit expressions for connecting curves in the minima induced by symmetry. Using the curvature of these curves, we derive conditions under which linear mode connectivity approximately holds. Our findings highlight the role of continuous symmetries in understanding the neural network loss landscape.

Poster

#E-3305

Multiobjective distribution matching

Xiaoyuan Zhang · Peijie Li · Ying Ying YU · Yichi Zhang · Han Zhao · Qingfu Zhang

Distribution matching is a key technique in machine learning, with applications in generative models, domain adaptation, and algorithmic fairness. A related but less explored challenge is generating a distribution that aligns with multiple underlying distributions, often with conflicting objectives, known as a Pareto optimal distribution.In this paper, we develop a general theory based on information geometry to construct the Pareto set and front for the entire exponential family under KL and inverse KL divergences. This formulation allows explicit derivation of the Pareto set and front for multivariate normal distributions, enabling applications like multiobjective variational autoencoders (MOVAEs) to generate interpolated image distributions.Experimental results on real-world images demonstrate that both algorithms can generate high-quality interpolated images across multiple distributions.

Poster

#E-3306

Compositional Scene Understanding through Inverse Generative Modeling

Yanbo Wang · Justin Dauwels · Yilun Du

Generative models have demonstrated remarkable abilities in generating high-fidelity visual content. In this work, we explore how generative models can further be used not only to synthesize visual content but also to understand the properties of a scene given a natural image. We formulate scene understanding as an inverse generative modeling problem, where we seek to find conditional parameters of a visual generative model to best fit a given natural image. To enable this procedure to infer scene structure from images substantially different than those seen during training, we further propose to build this visual generative model compositionally from smaller models over pieces of a scene. We illustrate how this procedure enables us to infer the set of objects in a scene, enabling robust generalization to new test scenes with an increased number of objects of new shapes. We further illustrate how this enables us to infer global scene factors, likewise enabling robust generalization to new scenes. Finally, we illustrate how this approach can be directly applied to existing pretrained text-to-image generative models for zero-shot multi-object perception. Code and visualizations are at https://energy-based-model.github.io/compositional-inference.

Poster

#E-3307

Explanatory Instructions: Towards Unified Vision Tasks Understanding and Zero-shot Generalization

Yang Shen · Xiu-Shen Wei · Yifan Sun · YuXin Song · Tao Yuan · Jian Jin · He-Yang Xu · Yazhou Yao · Errui Ding

Computer Vision (CV) has yet to fully achieve the zero-shot task generalization observed in Natural Language Processing (NLP), despite following many of the milestones established in NLP, such as large transformer models, extensive pre-training, and the auto-regression paradigm, among others. In this paper, we rethink the reality that CV adopts discrete and terminological task definitions (e.g., "image segmentation"), and conjecture it is a key barrier that hampers zero-shot task generalization. Our hypothesis is that without truly understanding previously-seen tasks—due to these terminological definitions—deep models struggle to generalize to novel tasks. To verify this, we introduce Explanatory Instructions, which provide an intuitive way to define CV task objectives through detailed linguistic transformations from input images to outputs. We create a large-scale dataset comprising 12 million "image input $\to$ explanatory instruction $\to$ output" triplets, and train an auto-regressive-based vision-language model (AR-based VLM) that takes both images and explanatory instructions as input. By learning to follow these instructions, the AR-based VLM achieves instruction-level zero-shot capabilities for previously-seen tasks and demonstrates strong zero-shot generalization for unseen CV tasks. Code and dataset will be open-sourced.

Poster

#E-3308

Activation by Interval-wise Dropout: A Simple Way to Prevent Neural Networks from Plasticity Loss

Sangyeon Park · Isaac Han · Seungwon Oh · KyungJoong Kim

Plasticity loss, a critical challenge in neural network training, limits a model's ability to adapt to new tasks or shifts in data distribution. While widely used techniques like L2 regularization and Layer Normalization have proven effective in mitigating this issue, Dropout remains notably ineffective. This paper introduces AID (Activation by Interval-wise Dropout), a novel method inspired by Dropout, designed to address plasticity loss. Unlike Dropout, AID generates subnetworks by applying Dropout with different probabilities on each preactivation interval. Theoretical analysis reveals that AID regularizes the network, promoting behavior analogous to that of deep linear networks, which do not suffer from plasticity loss. We validate the effectiveness of AID in maintaining plasticity across various benchmarks, including continual learning tasks on standard image classification datasets such as CIFAR10, CIFAR100, and TinyImageNet. Furthermore, we show that AID enhances reinforcement learning performance in the Arcade Learning Environment benchmark.

Spotlight Poster

#E-3309

Where is the Truth? The Risk of Getting Confounded in a Continual World

Florian Peter Busch · Roshni Ramanna Kamath · Rupert Mitchell · Wolfgang Stammer · Kristian Kersting · Martin Mundt

A dataset is confounded if it is most easily solved via a spurious correlation which fails to generalize to new data. In this work, we show that, in a continual learning setting where confounders may vary in time across tasks, the challenge of mitigating the effect of confounders far exceeds the standard forgetting problem normally considered. In particular, we provide a formal description of such continual confounders and identify that, in general, spurious correlations are easily ignored when training for all tasks jointly, but it is harder to avoid confounding when they are considered sequentially. These descriptions serve as a basis for constructing a novel CLEVR-based continually confounded dataset, which we term the ConCon dataset. Our evaluations demonstrate that standard continual learning methods fail to ignore the dataset's confounders. Overall, our work highlights the challenges of confounding factors, particularly in continual learning settings, and demonstrates the need for developing continual learning methods to robustly tackle these.

Poster

#E-3310

How Effective Can Dropout Be in Multiple Instance Learning ?

Wenhui Zhu · Peijie Qiu · Xiwen Chen · Zhangsihao Yang · Aristeidis Sotiras · Abolfazl Razi · Yalin Wang

Multiple Instance Learning (MIL) is a popular weakly-supervised method for various applications, with a particular interest in histological whole slide image (WSI) classification. Due to the gigapixel resolution of WSI, applications of MIL in WSI typically necessitate a two-stage training scheme: first, extract features from the pre-trained backbone and then perform MIL aggregation. However, it is well-known that this suboptimal training scheme suffers from "noisy" feature embeddings from the backbone and inherent weak supervision, hindering MIL from learning rich and generalizable features. However, the most commonly used technique (i.e., dropout) for mitigating this issue has yet to be explored in MIL. In this paper, we empirically explore how effective the dropout can be in MIL. Interestingly, we observe that dropping the top-k most important instances within a bag leads to better performance and generalization even under noise attack. Based on this key observation, we propose a novel MIL-specific dropout method, termed MIL-Dropout, which systematically determines which instances to drop. Experiments on five MIL benchmark datasets and two WSI datasets demonstrate that MIL-Dropout boosts the performance of current MIL methods with a negligible computational cost. The code is available at \url{https://github.com/ChongQingNoSubway/MILDropout}.

Poster

#E-3311

Larger or Smaller Reward Margins to Select Preferences for LLM Alignment?

Kexin Huang · Junkang Wu · Ziqian Chen · xue wang · Jinyang Gao · Bolin Ding · Jiancan Wu · Xiangnan He · Xiang Wang

Preference learning is critical for aligning large language models (LLMs) with human values, with the quality of preference datasets playing a crucial role in this process. While existing metrics primarily assess data quality based on either *explicit* or *implicit* reward margins, their single-margin focus often leads to contradictory evaluations for the same data.To address this issue, we propose a new metric of *alignment potential*, $M_{AP}$, which integrates both margins to quantifythe gap from the model's *current implicit* reward margin to the *target explicit* reward margin, thereby estimating the model's potential to align on the preference data.Empirical results demonstrate that training on the data selected by $M_{AP}$ consistently enhances alignment performance, surpassing existing metrics across different base models and optimization objectives.Furthermore, our method can be extended to self-play data generation frameworks, where we use this metric to identify high-quality data within the self-generated content by LLMs. Under this data generation scenario, our method surpasses current state-of-the-artmethods across various training settings and demonstrates continuous improvementswith increasing dataset size and training iterations.

Poster

#E-3312

Learning Bayesian Nash Equilibrium in Auction Games via Approximate Best Response

Kexin Huang · Ziqian Chen · xue wang · Chongming Gao · Jinyang Gao · Bolin Ding · Xiang Wang

Auction plays a crucial role in many modern trading environments, including online advertising and public resource allocation. As the number of competing bidders increases, learning Bayesian Nash Equilibrium (BNE) in auctions faces significant scalability challenges. Existing methods often experience slow convergence in large-scale auctions. For example, in a classic symmetric auction setting, the convergence rate depends on the number of bidders quadratically.To address this issue, we propose the Approximate Best Response Gradient method, a new approach for learning BNE efficiently in auction games. We leverage an analytic solution for gradient estimation to enable efficient gradient computation during optimization. Moreover, we introduce the Best Response Distance objective, which serves as an upper bound of approximation quality to BNE. By optimizing the new objective, our method is proven to achieve a local convergence rate independent of bidder numbers and circumvent the traditional quadratic complexity in the classic symmetric setting.Extensive experiments across various auction formats demonstrate that our approach accelerates convergence and enhances learning efficiency in complex auction settings.

Poster

#E-500

Position: LLMs Need a Bayesian Meta-Reasoning Framework for More Robust and Generalizable Reasoning

Hanqi Yan · Linhai Zhang · Jiazheng Li · Zhenyi Shen · Yulan He

Large language models (LLMs) excel in many reasoning tasks but continue to face significant challenges, such as lack of robustness in reasoning, struggling with cross-task generalization, and inefficiencies in scaling up reasoning capabilities. Current training paradigms, including next-token prediction and reinforcement learning from human feedback, often fall short in adaptability to diverse reasoning tasks. Existing approaches, such as prompt optimization and iterative output refinement, offer performance improvement, but can be inefficient and lack effective generalization. To overcome these limitations, this position paper argues for a transformative shift in how LLMs approach reasoning. Drawing inspiration from cognitive science, particularly meta-reasoning theories such as Dual-Process Theory and Metacognitive Reasoning, we propose a Bayesian meta-reasoning framework for LLMs. Our approach integrates self-awareness, monitoring, evaluation, regulation, and meta-reflection, to enhance LLMs' ability to refine reasoning strategies and generalize across tasks. We revisit existing LLM reasoning methods, identify key challenges, and suggest directions for future research.

Spotlight Poster

#E-501

Position: Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Sam Bowyer · Laurence Aitchison · Desi Ivanova

Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios.

Poster

#E-502

Position: You Can't Manufacture a NeRF

Marta An Kimmel · Mueed Rehman · Yonatan Bisk · Gary Fedder

In this paper, we examine the manufacturability gap in state-of-the-art generative models for 3D object representations. Many models for generating 3D assets focus on rendering virtual content and do not consider the constraints of real-world manufacturing, such as milling, casting, or injection molding. We demonstrate that existing generative models for computer-aided design representation do not generalize outside of their training datasets or to unmodified real, human-created objects. We identify limitations with the current approaches, including missing manufacturing-readable semantics, the inability to decompose complex shapes into parameterized segments appropriate for computer-aided manufacturing, and a lack of appropriate scoring metrics to assess the generated output versus the true reconstruction. The academic community could greatly impact real-world manufacturing by rallying around pathways to solve these challenges. We offer revised, more realistic datasets and baseline benchmarks as a step in targeting the challenge. In evaluating these datasets, we find that existing models are severely overfit to simpler data.

Poster

#E-503

Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)

Yoonsoo Nam · Seok Hyeong Lee · Clémentine Dominé · Yeachan Park · Charles London · Wonyl Choi · Niclas Göring · Seungjai Lee

In physics, complex systems are often simplified into minimal, solvable models that retain only the core principles. In machine learning, layerwise linear models (e.g., linear neural networks) act as simplified representations of neural network dynamics. These models follow the dynamical feedback principle, which describes how layers mutually govern and amplify each other's evolution. This principle extends beyond the simplified models, successfully explaining a wide range of dynamical phenomena in deep neural networks, including neural collapse, emergence, lazy and rich regimes, and grokking. In this position paper, we call for the use of layerwise linear models retaining the core principles of neural dynamical phenomena to accelerate the science of deep learning.

Spotlight Poster

#E-504

Suitability Filter: A Statistical Framework for Classifier Evaluation in Real-World Deployment Settings

Angéline Pouget · Mohammad Yaghini · Stephan Rabanser · Nicolas Papernot

Deploying machine learning models in safety-critical domains poses a key challenge: ensuring reliable model performance on downstream user data without access to ground truth labels for direct validation. We propose the suitability filter, a novel framework designed to detect performance deterioration by utilizing suitability signals—model output features that are sensitive to covariate shifts and indicative of potential prediction errors. The suitability filter evaluates whether classifier accuracy on unlabeled user data shows significant degradation compared to the accuracy measured on the labeled test dataset. Specifically, it ensures that this degradation does not exceed a pre-specified margin, which represents the maximum acceptable drop in accuracy. To achieve reliable performance evaluation, we aggregate suitability signals for both test and user data and compare these empirical distributions using statistical hypothesis testing, thus providing insights into decision uncertainty. Our modular method adapts to various models and domains. Empirical evaluations across different classification tasks demonstrate that the suitability filter reliably detects performance deviations due to covariate shift. This enables proactive mitigation of potential failures in high-stakes applications.

Poster

#E-505

De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

Wei Fan · Kejiang Chen · Chang Liu · Weiming Zhang · Nenghai Yu

The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https://de-antifake.github.io.

Poster

#E-506

Circumventing Backdoor Space via Weight Symmetry

Jie Peng · Hongwei Yang · Jing Zhao · Hengji Dong · Hui He · Weizhe Zhang · Haoyu He

Deep neural networks are vulnerable to backdoor attacks, where malicious behaviors are implanted during training. While existing defenses can effectively purify compromised models, they typically require labeled data or specific training procedures, making them difficult to apply beyond supervised learning settings. Notably, recent studies have shown successful backdoor attacks across various learning paradigms, highlighting a critical security concern. To address this gap, we propose Two-stage Symmetry Connectivity (TSC), a novel backdoor purification defense that operates independently of data format and requires only a small fraction of clean samples. Through theoretical analysis, we prove that by leveraging permutation invariance in neural networks and quadratic mode connectivity, TSC amplifies the loss on poisoned samples while maintaining bounded clean accuracy. Experiments demonstrate that TSC achieves robust performance comparable to state-of-the-art methods in supervised learning scenarios. Furthermore, TSC generalizes to self-supervised learning frameworks, such as SimCLR and CLIP, maintaining its strong defense capabilities. Our code is available at https://github.com/JiePeng104/TSC.

Poster

#E-600

Assessing Safety Risks and Quantization-aware Safety Patching for Quantized Large Language Models

Kejia Chen · Jiawen Zhang · Jiacong Hu · Yu Wang · Jian Lou · Zunlei Feng · Mingli Song

Quantized large language models (LLMs) have gained increasing attention and significance for enabling deployment in resource-constrained environments. However, emerging studies on a few calibration dataset-free quantization methods suggest that quantization may compromise the safety capabilities of LLMs, underscoring the urgent need for systematic safety evaluations and effective mitigation strategies. In this paper, we present comprehensive safety evaluations across various mainstream quantization techniques and diverse calibration datasets, utilizing widely accepted safety benchmarks. To address the identified safety vulnerabilities, we propose a quantization-aware safety patching framework, Q-resafe, to efficiently restore the safety capabilities of quantized LLMs while minimizing any adverse impact on utility. Extensive experiment results demonstrate that Q-resafe successfully re-aligns the safety of quantized LLMs with their pre-quantization counterparts, even under challenging evaluation scenarios. Project page: https://github.com/Thecommonirin/Qresafe.

Poster

#E-601

TRUST-VLM: Thorough Red-Teaming for Uncovering Safety Threats in Vision-Language Models

Kangjie Chen · Muyang Li · Guanlin Li · Shudong Zhang · Shangwei Guo · Tianwei Zhang

Vision-Language Models (VLMs) have become a cornerstone in multi-modal artificial intelligence, enabling seamless integration of visual and textual information for tasks such as image captioning, visual question answering, and cross-modal retrieval. Despite their impressive capabilities, these models often exhibit inherent vulnerabilities that can lead to safety failures in critical applications. Red-teaming is an important approach to identify and test system's vulnerabilities, but how to conduct red-teaming for contemporary VLMs is an unexplored area. In this paper, we propose a novel multi-modal red-teaming approach, TRUST-VLM, to enhance both the attack success rate and the diversity of successful test cases for VLMs. Specifically, TRUST-VLM is built upon the in-context learning to adversarially test a VLM on both image and text inputs. Furthermore, we involve feedback from the target VLM to improve the efficiency of test case generation. Extensive experiments show that TRUST-VLM not only outperforms traditional red-teaming techniques in generating diverse and effective adversarial cases but also provides actionable insights for model improvement. These findings highlight the importance of advanced red-teaming strategies in ensuring the reliability of VLMs.

Poster

#E-602

Efficient and Privacy-Preserving Soft Prompt Transfer for LLMs

Xun Wang · Jing Xu · Franziska Boenisch · Michael Backes · Christopher A. Choquette Choo · Adam Dziedzic

Prompting has become a dominant paradigm for adapting large language models (LLMs).While discrete (textual) prompts are widely used for their interpretability, soft (parameter) prompts have recently gained traction in APIs. This is because they can encode information from more training samples while minimizing the user's token usage, leaving more space in the context window for task-specific input. However, soft prompts are tightly coupled to the LLM they are tuned on, limiting their generalization to other LLMs. This constraint is particularly problematic for efficiency and privacy: (1) tuning prompts on each LLM incurs high computational costs, especially as LLMs continue to grow in size. Additionally, (2) when the LLM is hosted externally, soft prompt tuning often requires sharing private data with the LLM provider. For instance, this is the case with the NVIDIA NeMo API.To address these issues, we propose POST (Privacy Of Soft prompt Transfer), a framework that enables private tuning of soft prompts on a small model and subsequently transfers these prompts to a larger LLM.POST uses knowledge distillation to derive a small model directly from the large LLM to improve prompt transferability, tunes the soft prompt locally, optionally with differential privacy guarantees, and transfers it back to the larger LLM using a small public dataset. Our experiments show that POST reduces computational costs, preserves privacy, and effectively transfers high-utility soft prompts.

Poster

#E-603

Improving Out-of-Distribution Detection via Dynamic Covariance Calibration

Kaiyu Guo · Zijian Wang · Tan Pan · Brian Lovell · Mahsa Baktashmotlagh

Out-of-Distribution (OOD) detection is essential for the trustworthiness of AI systems. Methods using prior information (i.e., subspace-based methods) have shown effective performance by extracting information geometry to detect OOD data with a more appropriate distance metric. However, these methods fail to address the geometry distorted by ill-distributed samples, due to the limitation of statically extracting information geometry from the training distribution. In this paper, we argue that the influence of ill-distributed samples can be corrected by dynamically adjusting the prior geometry in response to new data. Based on this insight, we propose a novel approach that dynamically updates the prior covariance matrix using real-time input features, refining its information. Specifically, we reduce the covariance along the direction of real-time input features and constrain adjustments to the residual space, thus preserving essential data characteristics and avoiding effects on unintended directions in the principal space. We evaluate our method on two pre-trained models for the CIFAR dataset and five pre-trained models for ImageNet-1k, including the self-supervised DINO model. Extensive experiments demonstrate that our approach significantly enhances OOD detection across various models. The code is released at https://github.com/workerbcd/ooddcc.

Poster

#E-604

An End-to-End Model for Logits-Based Large Language Models Watermarking

KA HIM WONG · Jicheng Zhou · Jiantao Zhou · Yain-Whar Si

The rise of LLMs has increased concerns over source tracing and copyright protection for AIGC, highlighting the need for advanced detection technologies. Passive detection methods usually face high false positives, while active watermarking techniques using logits or sampling manipulation offer more effective protection. Existing LLM watermarking methods, though effective on unaltered content, suffer significant performance drops when the text is modified and could introduce biases that degrade LLM performance in downstream tasks. These methods fail to achieve an optimal tradeoff between text quality and robustness, particularly due to the lack of end-to-end optimization of the encoder and decoder. In this paper, we introduce a novel end-to-end logits perturbation method for watermarking LLM-generated text. By joint optimization, our approach achieves a better balance between quality and robustness. To address non-differentiable operations in the end-to-end training pipeline, we introduce an online-prompting technique that leverages the on-the-fly LLM as a differentiable surrogate. Our method achieves superior robustness, outperforming distortion-free methods by 37–39% under paraphrasing and 17.2% on average, while maintaining text quality on par with the distortion-free methods in terms of text perplexity and downstream tasks. Our method can be easily generalized to different LLMs. Code is available at https://github.com/KAHIMWONG/E2ELLMWM.

Poster

#E-605

Stealix: Model Stealing via Prompt Evolution

Zhixiong Zhuang · Hui-Po Wang · Irina Nicolae · Mario Fritz

Model stealing poses a significant security risk in machine learning by enabling attackers to replicate a black-box model without access to its training data, thus jeopardizing intellectual property and exposing sensitive information.Recent methods that use pre-trained diffusion models for data synthesis improve efficiency and performance but rely heavily on manually crafted prompts, limiting automation and scalability, especially for attackers with little expertise.To assess the risks posed by open-source pre-trained models, we propose a more realistic threat model that eliminates the need for prompt design skills or knowledge of class names.In this context, we introduce Stealix, the first approach to perform model stealing without predefined prompts. Stealix uses two open-source pre-trained models to infer the victim model’s data distribution, and iteratively refines prompts through a genetic algorithm, progressively improving the precision and diversity of synthetic images.Our experimental results demonstrate that Stealix significantly outperforms other methods, even those with access to class names or fine-grained prompts, while operating under the same query budget. These findings highlight the scalability of our approach and suggest that the risks posed by pre-trained generative models in model stealing may be greater than previously recognized.

Poster

#E-606

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

Anselm Paulus · Arman Zharmagambetov · Chuan Guo · Brandon Amos · Yuandong Tian

Large Language Models (LLMs) are vulnerable to jailbreaking attacks that lead to generation of inappropriate or harmful content. Manual red-teaming requires a time-consuming search for adversarial prompts, whereas automatic adversarial prompt generation often leads to semantically meaningless attacks that do not scale well.In this paper, we present a novel method that uses another LLM, called AdvPrompter, to generate human-readable adversarial prompts in seconds.AdvPrompter, which is trained using an alternating optimization algorithm, generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response.Experimental results on popular open source TargetLLM show highly competitive results on the AdvBench and HarmBench datasets, that also transfer to closed-source black-box LLMs.We also show that training on adversarial suffixes generated by AdvPrompter is a promising strategy for improving the robustness of LLMs to jailbreaking attacks.

Spotlight Poster

#E-700

Statistical Collusion by Collectives on Learning Platforms

Etienne Gauthier · Francis Bach · Michael Jordan

As platforms increasingly rely on learning algorithms, collectives may form and seek ways to influence these platforms to align with their own interests. This can be achieved by coordinated submission of altered data. To evaluate the potential impact of such behavior, it is essential to understand the computations that collectives must perform to impact platforms in this way. In particular, collectives need to make a priori assessments of the effect of the collective before taking action, as they may face potential risks when modifying their data. Moreover they need to develop implementable coordination algorithms based on quantities that can be inferred from observed data. We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain.

Poster

#E-701

Provably Cost-Sensitive Adversarial Defense via Randomized Smoothing

Yuan Xin · Dingfan Chen · Michael Backes · Xiao Zhang

As machine learning models are deployed in critical applications, robustness against adversarial perturbations is crucial. While numerous defensive algorithms have been proposed to counter such attacks, they typically assume that all adversarial transformations are equally important, an assumption that rarely aligns with real-world applications. To address this, we study the problem of robust learning against adversarial perturbations under cost-sensitive scenarios, where the potential harm of different types of misclassifications is encoded in a cost matrix. Our solution introduces a provably robust learning algorithm to certify and optimize for cost-sensitive robustness, building on the scalable certification framework of randomized smoothing. Specifically, we formalize the definition of cost-sensitive certified radius and propose our novel adaptation of the standard certification algorithm to generate tight robustness certificates tailored to any cost matrix. In addition, we design a robust training method that improves certified cost-sensitive robustness without compromising model accuracy. Extensive experiments on benchmark datasets, including challenging ones unsolvable by existing methods, demonstrate the effectiveness of our certification algorithm and training method across various cost-sensitive scenarios.

Poster

#E-702

Understanding and Mitigating Miscalibration in Prompt Tuning for Vision-Language Models

Shuoyuan Wang · Sharon Li · Hongxin Wei

Confidence calibration is critical for the safe deployment of machine learning models in the real world. However, such issue in vision-language models like CLIP, particularly after fine-tuning, has not been fully addressed. In this work, we demonstrate that existing prompt tuning methods usually lead to a trade-off of calibration between base and new classes: the cross-entropy loss used in standard fine-tuning (e.g., CoOp) causes overconfidence in new classes by increasing textual label divergence, whereas regularization-based tuning (e.g., KgCoOp) maintains the confidence level but results in underconfidence in base classes due to the improved accuracy. Inspired by the observations, we introduce Dynamic Outlier Regularization (DOR) to ensure the confidence calibration on both base and new classes after fine-tuning. In particular, we propose to minimize the feature deviation of novel textual labels (instead of base classes) sampled from a large vocabulary. In effect, DOR prevents the increase in textual divergence for new labels while easing restrictions on base classes. Extensive experiments demonstrate that DOR can enhance the calibration performance of current fine-tuning methods on base and new classes.

Poster

#E-703

Understanding Model Ensemble in Transferable Adversarial Attack

Wei Yao · Zeliang Zhang · Huayi Tang · Yong Liu

Model ensemble adversarial attack has become a powerful method for generating transferable adversarial examples that can target even unknown models, but its theoretical foundation remains underexplored. To address this gap, we provide early theoretical insights that serve as a roadmap for advancing model ensemble adversarial attack. We first define transferability error to measure the error in adversarial transferability, alongside concepts of diversity and empirical model ensemble Rademacher complexity. We then decompose the transferability error into vulnerability, diversity, and a constant, which rigidly explains the origin of transferability error in model ensemble attack: the vulnerability of an adversarial example to ensemble components, and the diversity of ensemble components. Furthermore, we apply the latest mathematical tools in information theory to bound the transferability error using complexity and generalization terms, validating three practical guidelines for reducing transferability error: (1) incorporating more surrogate models, (2) increasing their diversity, and (3) reducing their complexity in cases of overfitting. Finally, extensive experiments with 54 models validate our theoretical framework, representing a significant step forward in understanding transferable model ensemble adversarial attacks.

Poster

#E-704

Scaling Laws for Differentially Private Language Models

Ryan McKenna · Yangsibo Huang · Amer Sinha · Borja de Balle Pigem · Zachary Charles · Christopher A. Choquette Choo · Badih Ghazi · Georgios Kaissis · Ravi Kumar · Ruibo Liu · Da Yu · Chiyuan Zhang

Scaling laws have emerged as important components of large language model (LLM) training as they can predict performance gains through scale, and provide guidance on important hyper-parameter choices that would otherwise be expensive. LLMs also rely on large, high-quality training datasets, like those sourced from (sometimes sensitive) user data. Training models on this sensitive user data requires careful privacy protections like differential privacy (DP). However, the dynamics of DP training are significantly different, and consequently their scaling laws are not yet fully understood. In this work, we establish scaling laws that accurately model the intricacies of DP LLM training, providing a complete picture of the compute-privacy-utility and the optimal training configurations in many settings.

Poster

#E-705

Variance as a Catalyst: Efficient and Transferable Semantic Erasure Adversarial Attack for Customized Diffusion Models

Jiachen Yang · Yusong Wang · Yanmei Fang · Yunshu Dai · Fangjun Huang

Latent Diffusion Models (LDMs) enable fine-tuning with only a few images and have become widely used on the Internet. However, it can also be misused to generate fake images, leading to privacy violations and social risks. Existing adversarial attack methods primarily introduce noise distortions to generated images but fail to completely erase identity semantics. In this work, we identify the variance of VAE latent code as a key factor that influences image distortion. Specifically, larger variances result in stronger distortions and ultimately erase semantic information. Based on this finding, we propose a Laplace-based (LA) loss function that optimizes along the fastest variance growth direction, ensuring each optimization step is locally optimal. Additionally, we analyze the limitations of existing methods and reveal that their loss functions often fail to align gradient signs with the direction of variance growth. They also struggle to ensure efficient optimization under different variance distributions. To address these issues, we further propose a novel Lagrange Entropy-based (LE) loss function.Experimental results demonstrate that our methods achieve state-of-the-art performance on CelebA-HQ and VGGFace2. Both proposed loss functions effectively lead diffusion models to generate pure-noise images with identity semantics completely erased. Furthermore, our methods exhibit strong transferability across diverse models and efficiently complete attacks with minimal computational resources. Our work provides a practical and efficient solution for privacy protection.

Poster

#E-706

Janus: Dual-Server Multi-Round Secure Aggregation with Verifiability for Federated Learning

Lang Pu · Jingjing Gu · Chao Lin · Xinyi Huang

Secure Aggregation (SA) is a cornerstone of Federated Learning (FL), ensuring that user updates remain hidden from servers. The advanced Flamingo (S\&P'23) has realized multi-round aggregation and improved efficiency. However, it still faces several key challenges: scalability issues with dynamic user participation, a lack of verifiability for server-side aggregation results, and vulnerability to Model Inconsistency Attacks (MIA) caused by a malicious server distributing inconsistent models. To address these issues, we propose $\textit{Janus}$, a generic SA scheme based on dual-server architecture. Janus ensures security against up to $n-2$ colluding clients (where $n$ is the total client count), which prevents privacy breaches for non-colluders. Additionally, Janus is model-independent, ensuring applicability across any FL model without specific adaptations. Furthermore, Janus introduces a new cryptographic primitive, Separable Homomorphic Commitment, which enables clients to efficiently verify the correctness of aggregation. Finally, extensive experiments show that Janus not only significantly enhances security but also reduces per-client communication and computation overhead from logarithmic to constant scale, with a tolerable impact on model performance.

Spotlight Poster

#E-800

Outstanding Paper

The Value of Prediction in Identifying the Worst-Off

Unai Fischer Abaigar · Christoph Kern · Juan Perdomo

Machine learning is increasingly used in government programs to identify and support the most vulnerable individuals, prioritizing assistance for those at greatest risk over optimizing aggregate outcomes. This paper examines the welfare impacts of prediction in equity-driven contexts, and how they compare to other policy levers, such as expanding bureaucratic capacity. Through mathematical models and a real-world case study on long-term unemployment amongst German residents, we develop a comprehensive understanding of the relative effectiveness of prediction in surfacing the worst-off. Our findings provide clear analytical frameworks and practical, data-driven tools that empower policymakers to make principled decisions when designing these systems.

Poster

#E-801

Noisy SIGNSGD Is More Differentially Private Than You (Might) Think

Richeng Jin · Huaiyu (David) Dai

The prevalent distributed machine learning paradigm faces two critical challenges: communication efficiency and data privacy. SIGNSGD provides a simple-to-implement approach with improved communication efficiency by requiring workers to share only the signs of the gradients. However, it fails to converge in the presence of data heterogeneity, and a simple fix is to add Gaussian noise before taking the signs, which leads to the Noisy SIGNSGD algorithm that enjoys competitive performance while significantly reducing the communication overhead. Existing results suggest that Noisy SIGNSGD with additive Gaussian noise has the same privacy guarantee as classic DP-SGD due to the post-processing property of differential privacy, and logistic noise may be a good alternative to Gaussian noise when combined with the sign-based compressor. Nonetheless, discarding the magnitudes in Noisy SIGNSGD leads to information loss, which may intuitively amplify privacy. In this paper, we make this intuition rigorous and quantify the privacy amplification of the sign-based compressor. Particularly, we analytically show that Gaussian noise leads to a smaller estimation error than logistic noise when combined with the sign-based compressor and may be more suitable for distributed learning with heterogeneous data. Then, we further establish the convergence of Noisy SIGNSGD. Finally, extensive experiments are conducted to validate the theoretical results.

Poster

#E-802

SecEmb: Sparsity-Aware Secure Federated Learning of On-Device Recommender System with Large Embedding

Peihua Mai · Youlong Ding · Ziyan Lyu · Minxin Du · Yan (James) Pang

Federated recommender system (FedRec) has emerged as a solution to protect user data through collaborative training techniques. A typical FedRec involves transmitting the full model and entire weight updates between edge devices and the server, causing significant burdens to edge devices with limited bandwidth and computational power. The sparsity of embedding updates provides opportunity for payload optimization, while existing sparsity-aware federated protocols generally sacrifice privacy for efficiency. A key challenge in designing a secure sparsity-aware efficient protocol is to protect the rated item indices from the server. In this paper, we propose a lossless secure recommender systems with on sparse embedding updates (SecEmb). SecEmb reduces user payload while ensuring that the server learns no information about both rated item indices and individual updates except the aggregated model. The protocol consists of two correlated modules: (1) a privacy-preserving embedding retrieval module that allows users to download relevant embeddings from the server, and (2) an update aggregation module that securely aggregates updates at the server. Empirical analysis demonstrates that SecEmb reduces both download and upload communication costs by up to 90x and decreases user-side computation time by up to 70x compared with secure FedRec protocols. Additionally, it offers non-negligible utility advantages compared with lossy message compression methods.

Poster

#E-803

FOCoOp: Enhancing Out-of-Distribution Robustness in Federated Prompt Learning for Vision-Language Models

Xinting Liao · Weiming Liu · Jiaming Qian · Pengyang Zhou · Jiahe Xu · Wenjie Wang · Chaochao Chen · Xiaolin Zheng · Tat-Seng Chua

Federated prompt learning (FPL) for vision-language models is a powerful approach to collaboratively adapt models across distributed clients while preserving data privacy. However, existing FPL approaches suffer from a trade-off between performance and robustness, particularly in out-of-distribution (OOD) shifts, limiting their reliability in real-world scenarios. The inherent in-distribution (ID) data heterogeneity among different clients makes it more challenging to maintain this trade-off. To fill this gap, we introduce a Federated OOD-aware Context Optimization (FOCoOp) framework, which captures diverse distributions among clients using ID global prompts, local prompts, and OOD prompts. Specifically, FOCoOp leverages three sets of prompts to create both class-level and distribution-level separations, which adapt to OOD shifts through bi-level distributionally robust optimization. Additionally, FOCoOp improves the discrimination consistency among clients, i.e., calibrating global prompts, seemly OOD prompts, and OOD prompts by Semi-unbalanced optimal transport. The extensive experiments on real-world datasets demonstrate that FOCoOp effectively captures decentralized heterogeneous distributions and enhances robustness of different OOD shifts. The project is available at GitHub.

Poster

#E-804

Fast Exact Unlearning for In-Context Learning Data for LLMs

Andrei Muresanu · Anvith Thudi · Michael Zhang · Nicolas Papernot

Modern machine learning models are expensive to train, and there is a growing concern about the challenge of retroactively removing specific training data. Achieving exact unlearning in deep learning pipelines—producing models as if certain data had never been included in training—remains an open problem. In this paper, we revisit exact unlearning in deep learning and show that for large language models (LLMs) we can efficiently exactly unlearn ``fine-tuning data" (the data used to adapt a pre-trained model). This follows from two observations. First, we can use in-context learning to adapt the LLM to the fine-tuning dataset instead of SGD based algorithms. Second, we show that accurate in-context learning can be done with quantized k-means, which allows for effectively constant time unlearning operations. Our evaluation shows that this unlearning recipe has similar performance to fine-tuning alternatives, but vastly reduces the unlearning costs. Our study also highlights the need for new measures of unlearning cost when adapting the learning algorithm to have faster unlearn operations.

Poster

#E-805

Adapting to Linear Separable Subsets with Large-Margin in Differentially Private Learning

Erchi Wang · Yuqing Zhu · Yu-Xiang Wang

This paper studies the problem of differentially private empirical risk minimization (DP-ERM) for binary linear classification. We obtain an efficient $(\varepsilon,\delta)$-DP algorithm with an empirical zero-one risk bound of $\tilde{O}\left(\frac{1}{\gamma^2\varepsilon n} + \frac{|S_{\mathrm{out}}|}{\gamma n}\right)$ where $n$ is the number of data points, $S_{\mathrm{out}}$ is an arbitrary subset of data one can remove and $\gamma$ is the margin of linear separation of the remaining data points (after $S_{\mathrm{out}}$ is removed). Here, $\tilde{O}(\cdot)$ hides only logarithmic terms. In the agnostic case, we improve the existing results when the number of outliers is small. Our algorithm is highly adaptive because it does not require knowing the margin parameter $\gamma$ or outlier subset $S_{\mathrm{out}}$. We also derive a utility bound for the advanced private hyperparameter tuning algorithm.

Poster

#E-806

Towards Trustworthy Federated Learning with Untrusted Participants

Youssef Allouah · Rachid Guerraoui · John Stephan

Resilience against malicious participants and data privacy are essential for trustworthy federated learning, yet achieving both with good utility typically requires the strong assumption of a trusted central server. This paper shows that a significantly weaker assumption suffices: each pair of participants shares a randomness seed unknown to others. In a setting where malicious participants may collude with an untrusted server, we propose CafCor, an algorithm that integrates robust gradient aggregation with correlated noise injection, using shared randomness between participants.We prove that CafCor achieves strong privacy-utility trade-offs, significantly outperforming local differential privacy (DP) methods, which do not make any trust assumption, while approaching central DP utility, where the server is fully trusted. Empirical results on standard benchmarks validate CafCor's practicality, showing that privacy and robustness can coexist in distributed systems without sacrificing utility or trusting the server.

Poster

#E-900

Kandinsky Conformal Prediction: Beyond Class- and Covariate-Conditional Coverage

Konstantina Bairaktari · Jiayun Wu · Steven Wu

Conformal prediction is a powerful distribution-free framework for constructing prediction sets with coverage guarantees. Classical methods, such as split conformal prediction, provide marginal coverage, ensuring that the prediction set contains the label of a random test point with a target probability. However, these guarantees may not hold uniformly across different subpopulations, leading to disparities in coverage. Prior work has explored coverage guarantees conditioned on events related to the covariates and label of the test point. We present Kandinsky conformal prediction, a framework that significantly expands the scope of conditional coverage guarantees. In contrast to Mondrian conformal prediction, which restricts its coverage guarantees to disjoint groups—reminiscent of the rigid, structured grids of Piet Mondrian’s art—our framework flexibly handles overlapping and fractional group memberships defined jointly on covariates and labels, reflecting the layered, intersecting forms in Wassily Kandinsky’s compositions. Our algorithm unifies and extends existing methods, encompassing covariate-based group conditional, class conditional, and Mondrian conformal prediction as special cases, while achieving a minimax-optimal high-probability conditional coverage bound. Finally, we demonstrate the practicality of our approach through empirical evaluation on real-world datasets.

Poster

#E-901

Self-Consuming Generative Models with Adversarially Curated Data

Xiukun Wei · Xueru Zhang

Recent advances in generative models have made it increasingly difficult to distinguish real data from model-generated synthetic data. Using synthetic data for successive training of future model generations creates “self-consuming loops,” which may lead to model collapse or training instability. Furthermore, synthetic data is often subject to human feedback and curated by users based on their preferences. Ferbach et al. (2024) recently showed that when data is curated according to user preferences, the self-consuming retraining loop drives the model to converge toward a distribution that optimizes those preferences. However, in practice, data curation is often noisy or adversarially manipulated. For example, competing platforms may recruit malicious users to adversarially curate data and disrupt rival models. In this paper, we study how generative models evolve under self-consuming retraining loops with noisy and adversarially curated data. We theoretically analyze the impact of such noisy data curation on generative models and identify conditions for the robustness and stability of the retraining process. Building on this analysis, we design attack algorithms for competitive adversarial scenarios, where a platform with a limited budget employs malicious users to misalign a rival’s model from actual user preferences. Experiments on both synthetic and real-world datasets demonstrate the effectiveness of the proposed algorithms.

Poster

#E-902

MONA: Myopic Optimization with Non-myopic Approval Can Mitigate Multi-step Reward Hacking

Sebastian Farquhar · Vikrant Varma · David Lindner · David Elson · Caleb Biddulph · Ian Goodfellow · Rohin Shah

Future advanced AI systems may learn sophisticated strategies through reinforcement learning (RL) that humans cannot understand well enough to safely evaluate. We propose a training method which avoids agents learning undesired multi-step plans that receive high reward (multi-step "reward hacks") even if humans are not able to detect that the behavior is undesired. The method, Myopic Optimization with Non-myopic Approval (MONA), works by combining short-sighted optimization with far-sighted reward. We demonstrate that MONA can prevent multi-step reward hacking that ordinary RL causes, even without being able to detect the reward hacking and without any extra information that ordinary RL does not get access to. We study MONA empirically in three settings which model different misalignment failure modes including 2-step environments with LLMs representing delegated oversight and encoded reasoning and longer-horizon gridworld environments representing sensor tampering.

Poster

#E-903

A New Approach to Backtracking Counterfactual Explanations: A Unified Causal Framework for Efficient Model Interpretability

Pouria Fatemi · Ehsan Sharifian · Mohammad Hossein Yassaee

Counterfactual explanations enhance interpretability by identifying alternative inputs that produce different outputs, offering localized insights into model decisions. However, traditional methods often neglect causal relationships, leading to unrealistic examples. While newer approaches integrate causality, they are computationally expensive. To address these challenges, we propose an efficient method called BRACE based on backtracking counterfactuals that incorporates causal reasoning to generate actionable explanations. We first examine the limitations of existing methods and then introduce our novel approach and its features. We also explore the relationship between our method and previous techniques, demonstrating that it generalizes them in specific scenarios. Finally, experiments show that our method provides deeper insights into model outputs.

Poster

#E-904

A Lens into Interpretable Transformer Mistakes via Semantic Dependency

Ruo-Jing Dong · Yu Yao · Bo Han · Tongliang Liu

Semantic Dependency refers to the relationship between words in a sentence where the meaning of one word depends on another, which is important for natural language understanding.In this paper, we investigate the role of semantic dependencies in answering questions for transformer models, which is achieved by analyzing how token values shift in response to changes in semantics.Through extensive experiments on models including the BERT series, GPT, and LLaMA, we uncover the following key findings:1). Most tokens primarily retain their original semantic information even as they propagate through multiple layers.2). Models can encode truthful semantic dependencies in tokens in the final layer.3). Mistakes in model answers often stem from specific tokens encoded with incorrect semantic dependencies. Furthermore, we found that addressing the incorrectness by directly adjusting parameters is challenging because the same parameters can encode both correct and incorrect semantic dependencies depending on the context.Our findings provide insights into the causes of incorrect information generation in transformers and help the future development of robust and reliable models.

Poster

#E-905

Archetypal SAE: Adaptive and Stable Dictionary Learning for Concept Extraction in Large Vision Models

Thomas Fel · Ekdeep Singh Lubana · Jacob Prince · Matthew Kowal · Victor Boutin · Isabel Papadimitriou · Binxu Wang · Martin Wattenberg · Demba Ba · Talia Konkle

Sparse Autoencoders (SAEs) have emerged as a powerful framework for machine learning interpretability, enabling the unsupervised decomposition of model representations into a dictionary of abstract, human-interpretable concepts. However, we reveal a fundamental limitation: SAEs exhibit severe instability, as identical models trained on similar datasets can produce sharply different dictionaries, undermining their reliability as an interpretability tool. To address this issue, we draw inspiration from the Archetypal Analysis framework introduced by Cutler & Breiman (1994) and present Archetypal SAEs (A-SAE), wherein dictionary atoms are constrained to the data’s convex hull. This geometric anchoring significantly enhances the stability and plausibility of inferred dictionaries, and their mildly relaxed variants RA-SAEs further match state-of-the-art reconstruction abilities. To rigorously assess dictionary quality learned by SAEs, we introduce two new benchmarks that test (i) plausibility, if dictionaries recover “true” classification directions and (ii) identifiability, if dictionaries disentangle synthetic concept mixtures. Across all evaluations, RA-SAEs consistently yield more structured representations while uncovering novel, semantically meaningful concepts in large-scale vision models.

Poster

#E-906

The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)

Zihao Wang · Yibo Jiang · Jiahao Yu · Heqing Huang

Large language models (LLMs) that integrate multiple input roles (e.g., system instructions, user queries, external tool outputs) are increasingly prevalent in practice. Ensuring that the model accurately distinguishes messages from each role—a concept we call role separation—is crucial for consistent multi-role behavior. Although recent work often targets state-of-the-art prompt injection defenses, it remains unclear whether such methods truly teach LLMs to differentiate roles or merely memorize known triggers. In this paper, we examine role-separation learning: the process of teaching LLMs to robustly distinguish system and user tokens. Through a simple, controlled experimental framework, we find that fine-tuned models often rely on two proxies for role identification: (1) task type exploitation, and (2) proximity to begin-of-text. Although data augmentation can partially mitigate these shortcuts, it generally leads to iterative patching rather than a deeper fix. To address this, we propose reinforcing invariant signals that mark role boundaries by adjusting token-wise cues in the model's input encoding. In particular, modifying position IDs helps the model learn clearer distinctions and reduces reliance on superficial proxies. By focusing on this mechanism-centered perspective, our work illuminates how LLMs can more reliably maintain consistent multi-role behavior without merely memorizing known prompts or triggers.

Poster Session

Poster Session 6 East

East Exhibition Hall A-B