Skip to yearly menu bar Skip to main content



2025 Spotlight Posters
Spotlight Poster
Cheng-Yen Hsieh · Xinyou Wang · Daiheng Zhang · Dongyu Xue · Fei YE · Shujian Huang · Zaixiang Zheng · Quanquan Gu

[ West Exhibition Hall B2-B3 ]

Abstract
Multimodal protein language models (PLMs) integrate sequence and token-based structural information, serving as a powerful foundation for protein modeling, generation, and design. However, the reliance on tokenizing 3D structures into discrete tokens causes substantial loss of fidelity about fine-grained structural details and correlations. In this paper, we systematically elucidate the design space of multimodal PLMs to overcome their limitations. We identify tokenization loss and inaccurate structure token predictions by the PLMs as major bottlenecks.To address these, our proposed design space covers improved generative modeling, structure-aware architectures and representation learning, and data exploration. Our advancements approach finer-grained supervision, demonstrating that token-based multimodal PLMs can achieve robust structural modeling.The effective design methods dramatically improve the structure generation diversity, and notably, folding abilities of our 650M model by reducing the RMSD from 5.52 to 2.36 on PDB testset, even outperforming 3B baselines and on par with the specialized folding models.Project page and code: https://bytedance.github.io/dplm/dplm-2.1.
Spotlight Poster
Xingjian Wu · Xiangfei Qiu · Hongfan Gao · Jilin Hu · Bin Yang · Chenjuan Guo

[ East Exhibition Hall A-B ]

Abstract
Probabilistic Time Series Forecasting (PTSF) plays a crucial role in decision-making across various fields, including economics, energy, and transportation. Most existing methods excell at short-term forecasting, while overlooking the hurdles of Long-term Probabilistic Time Series Forecasting (LPTSF). As the forecast horizon extends, the inherent nonlinear dynamics have a significant adverse effect on prediction accuracy, and make generative models inefficient by increasing the cost of each iteration. To overcome these limitations, we introduce $K^2$VAE, an efficient VAE-based generative model that leverages a KoopmanNet to transform nonlinear time series into a linear dynamical system, and devises a KalmanNet to refine predictions and model uncertainty in such linear system, which reduces error accumulation in long-term forecasting. Extensive experiments demonstrate that $K^2$VAE outperforms state-of-the-art methods in both short- and long-term PTSF, providing a more efficient and accurate solution.
Spotlight Poster
Matthew Niedoba · Berend Zwartsenberg · Kevin Murphy · Frank Wood

[ East Exhibition Hall A-B ]

Abstract
We propose a simple, training-free mechanism which explains the generalization behaviour of diffusion models. By comparing pre-trained diffusion models to their theoretically optimal empirical counterparts, we identify a shared local inductive bias across a variety of network architectures. From this observation, we hypothesize that network denoisers generalize through localized denoising operations, as these operations approximate the training objective well over much of the training distribution. To validate our hypothesis, we introduce novel denoising algorithms which aggregate local empirical denoisers to replicate network behaviour. Comparing these algorithms to network denoisers across forward and reverse diffusion processes, our approach exhibits consistent visual similarity to neural network outputs, with lower mean squared error than previously proposed methods.
Spotlight Poster
David Fleischer · David A Stephens · Archer Yang

[ East Exhibition Hall A-B ]

Abstract
We propose a computationally efficient alternative to generalized random forests (GRFs) for estimating heterogeneous effects in large dimensions. While GRFs rely on a gradient-based splitting criterion, which in large dimensions is computationally expensive and unstable, our method introduces a fixed-point approximation that eliminates the need for Jacobian estimation. This gradient-free approach preserves GRF’s theoretical guarantees of consistency and asymptotic normality while significantly improving computational efficiency. We demonstrate that our method achieves a speedup of multiple times over standard GRFs without compromising statistical accuracy. Experiments on both simulated and real-world data validate our approach. Our findings suggest that the proposed method is a scalable alternative for localized effect estimation in machine learning and causal inference applications.
Spotlight Poster
Geigh Zollicoffer · Kenneth Eaton · Jonathan Balloch · Julia Kim · Wei Zhou · Robert Wright · Mark Riedl

[ West Exhibition Hall B2-B3 ]

Abstract
Reinforcement learning (RL) using world models has found significant recent successes.However, when a sudden change to world mechanics or properties occurs then agent performance and reliability can dramatically decline.We refer to the sudden change in visual properties or state transitions as novelties.Implementing novelty detection within generated world model frameworks is a crucialtask for protecting the agent when deployed. In this paper, we propose straightforward bounding approaches to incorporate novelty detection into world model RL agents by utilizing the misalignment of the world model's hallucinated states and the true observed states as a novelty score. We provideeffective approaches to detecting novelties in a distribution of transitions learned by an agent ina world model. Finally, we show the advantage ofour work in a novel environment compared to traditional machine learning novelty detection methods as well as currently accepted RL-focused novelty detection algorithms.
Spotlight Poster
Ashkan Soleymani · Behrooz Tahmasebi · Stefanie Jegelka · Patrick Jaillet

[ East Exhibition Hall A-B ]

Abstract
We study the statistical-computational trade-offs for learning with exact invariances (or symmetries) using kernel regression. Traditional methods, such as data augmentation, group averaging, canonicalization, and frame-averaging, either fail to provide a polynomial-time solution or are not applicable in the kernel setting. However, with oracle access to the geometric properties of the input space, we propose a polynomial-time algorithm that learns a classifier with \emph{exact} invariances. Moreover, our approach achieves the same excess population risk (or generalization error) as the original kernel regression problem. To the best of our knowledge, this is the first polynomial-time algorithm to achieve exact (as opposed to approximate) invariances in this setting, partially addressing a question posed by Diaz (2025) regarding the avoidance of prohibitively large and computationally intensive group averaging methods in kernel regression with exact invariances. Our proof leverages tools from differential geometry, spectral theory, and optimization. A key result in our development is a new reformulation of the problem of learning under invariances as optimizing an infinite number of linearly constrained convex quadratic programs, which may be of independent interest.
Spotlight Poster
Dong Xiao · Guangyao Chen · Peixi Peng · Yangru Huang · Yifan Zhao · Yongxing Dai · Yonghong Tian

[ West Exhibition Hall B2-B3 ]

Abstract
Anomaly detection is essential for the safety and reliability of autonomous driving systems. Current methods often focus on detection accuracy but neglect response time, which is critical in time-sensitive driving scenarios. In this paper, we introduce real-time anomaly detection for autonomous driving, prioritizing both minimal response time and high accuracy. We propose a novel multimodal asynchronous hybrid network that combines event streams from event cameras with image data from RGB cameras. Our network utilizes the high temporal resolution of event cameras through an asynchronous Graph Neural Network and integrates it with spatial features extracted by a CNN from RGB images. This combination effectively captures both the temporal dynamics and spatial details of the driving environment, enabling swift and precise anomaly detection. Extensive experiments on benchmark datasets show that our approach outperforms existing methods in both accuracy and response time, achieving millisecond-level real-time performance.
Spotlight Poster
Pablo Samuel Castro · Nenad Tomasev · Ankit Anand · Navodita Sharma · Rishika Mohanta · Aparna Dev · Kuba Perlin · Siddhant Jain · Kyle Levin · Noemi Elteto · Will Dabney · Alexander Novikov · Glenn Turner · Maria Eckstein · Nathaniel Daw · Kevin Miller · Kimberly Stachenfeld

[ West Exhibition Hall B2-B3 ]

Abstract
Symbolic models play a key role in cognitive science, expressing computationally precise hypotheses about how the brain implements a cognitive process. Identifying an appropriate model typically requires a great deal of effort and ingenuity on the part of a human scientist.Here, we adapt FunSearch (Romera-Paredes et al. 2024), a recently developed tool that uses Large Language Models (LLMs) in an evolutionary algorithm, to automatically discover symbolic cognitive models that accurately capture human and animal behavior.We consider datasets from three species performing a classic reward-learning task that has been the focus of substantial modeling effort, and find that the discovered programs outperform state-of-the-art cognitive models for each.The discovered programs can readily be interpreted as hypotheses about human and animal cognition, instantiating interpretable symbolic learning and decision-making algorithms. Broadly, these results demonstrate the viability of using LLM-powered program synthesis to propose novel scientific hypotheses regarding mechanisms of human and animal cognition.
Spotlight Poster
Abdulrahman Diaa · Toluwani Aremu · Nils Lukas

[ East Exhibition Hall A-B ]

Abstract
Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret *watermarking key*. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against *non-adaptive* attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune *adaptive* attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against *any* watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively tuned paraphrasers at <https://github.com/nilslukas/ada-wm-evasion>.
Spotlight Poster
Harikrishna Metta · Venkatesh Babu Radhakrishnan

[ East Exhibition Hall A-B ]

Abstract
In Machine learning, separating data into classes is a very fundamental problem. A mathematical framework around the classes is presented in this work to deepen the understanding of classes. The classes are defined as vectors in a Vector Space, where addition corresponds to the union of classes, and scalar multiplication resembles set complement of classes. The Zero-Vector in the vector space corresponds to a class referred to as the Metta-Class. This discovery enables numerous applications. One such application, termed 'clear learning' in this work, focuses on learning the true nature (manifold) of the data instead of merely learning a boundary sufficient for classification. Another application, called 'unary class learning', involves learning a single class in isolation rather than learning by comparing two or more classes. Additionally, 'set operations on classes' is another application highlighted in this work. Furthermore, Continual Learning of classes is facilitated by smaller networks. The Metta-Class enables neural networks to learn only the data manifold; therefore, it can also be used for generation of new data. Results for the key applications are shown using the MNIST dataset. To further strengthen the claims, some results are also produced using the CIFAR-10 and ImageNet-1k embeddings. The code supporting these …
Spotlight Poster
Shayan Kiyani · George Pappas · Aaron Roth · Hamed Hassani

[ West Exhibition Hall B2-B3 ]

Abstract
A fundamental question in data-driven decision making is how to quantify the uncertainty of predictions to inform risk-sensitive downstream actions, as often required in domains such as medicine. We develop a decision-theoretic foundation linking prediction sets to risk-averse decision-making, addressing three questions: (1) What is the correct notion of uncertainty quantification for risk-averse decision makers? We prove that prediction sets are optimal for decision makers who wish to optimize their value at risk. (2) What is the optimal policy that a risk averse decision maker should use to map prediction sets to actions? We show that a simple max-min decision policy is optimal for risk-averse decision makers. Finally, (3) How can we derive prediction sets that are optimal for such decision makers? We provide an exact characterization in the population regime and a distribution free finite-sample construction. These insights leads to *Risk-Averse Calibration (RAC)*, a principled algorithm that is both *practical*—exploiting black-box predictions to enhance downstream utility—and *safe*—adhering to user-defined risk thresholds. We experimentally demonstrate RAC's advantages in medical diagnosis and recommendation systems, showing that it substantially improves the trade-off between safety and utility, delivering higher utility than existing methods while avoiding critical errors.
Spotlight Poster
Wenjun Zhang · Liangxiao Jiang · Chaoqun Li

[ East Exhibition Hall A-B ]

Abstract
Label completion serves as a preprocessing approach to handling the sparse crowdsourced label matrix problem, significantly boosting the effectiveness of the downstream label aggregation. In recent advances, worker modeling has been proved to be a powerful strategy to further improve the performance of label completion. However, in real-world scenarios, workers typically annotate only a few instances, leading to insufficient worker modeling and thus limiting the improvement of label completion. To address this issue, we propose a novel transfer learning-based label completion (TLLC) method. Specifically, we first identify all high-confidence instances from the whole crowdsourced data as a source domain and use it to pretrain a Siamese network. The abundant annotated instances in the source domain provide essential knowledge for worker modeling. Then, we transfer the pretrained network to the target domain with the instances annotated by each worker separately, ensuring worker modeling captures unique characteristics of each worker. Finally, we leverage the new embeddings learned by the transferred network to complete each worker’s missing labels. Extensive experiments on several widely used real-world datasets demonstrate the effectiveness of TLLC. Our codes and datasets are available at https://github.com/jiangliangxiao/TLLC.
Spotlight Poster
Qinglin Zhu · Runcong Zhao · Hanqi Yan · Yulan He · Yudong Chen · Lin Gui

[ East Exhibition Hall A-B ]

Abstract
Large Language Models (LLMs) struggle with complex reasoning due to limited diversity and inefficient search. We propose Soft Reasoning, an embedding-based search framework that optimises the embedding of the first token to guide generation. It combines (1) embedding perturbation for controlled exploration and (2) Bayesian optimisation to refine embeddings via a verifier-guided objective, balancing exploration and exploitation. This approach improves reasoning accuracy and coherence while avoiding reliance on heuristic search. Experiments demonstrate superior correctness with minimal computation, making it a scalable, model-agnostic solution.
Spotlight Poster
Tinglin Huang · Tianyu Liu · Mehrtash Babadi · Wengong Jin · ZHITAO YING

[ West Exhibition Hall B2-B3 ]

Abstract
Spatial transcriptomics (ST) has emerged as a powerful technology for bridging histology imaging with gene expression profiling. However, its application has been limited by low throughput and the need for specialized experimental facilities. Prior works sought to predict ST from whole-slide histology images to accelerate this process, but they suffer from two major limitations. First, they do not explicitly model cell-cell interaction as they factorize the joint distribution of whole-slide ST data and predict the gene expression of each spot independently. Second, their encoders struggle with memory constraints due to the large number of spots (often exceeding 10,000) in typical ST datasets. Herein, we propose STFlow, a flow matching generative model that considers cell-cell interaction by modeling the joint distribution of gene expression of an entire slide. It also employs an efficient slide-level encoder with local spatial attention, enabling whole-slide processing without excessive memory overhead. On the recently curated HEST-1k and STImage-1K4M benchmarks, STFlow substantially outperforms state-of-the-art baselines and achieves over 18% relative improvements over the pathology foundation models.
Spotlight Poster
Jinyu Cai · Yunhe Zhang · Fusheng Liu · See-Kiong Ng

[ East Exhibition Hall A-B ]

Abstract
A fundamental challenge in graph-level anomaly detection (GLAD) is the scarcity of anomalous graph data, as the training dataset typically contains only normal graphs or very few anomalies. This imbalance hinders the development of robust detection models. In this paper, we propose **A**nomalous **G**raph **Diff**usion (AGDiff), a framework that explores the potential of diffusion models in generating pseudo-anomalous graphs for GLAD. Unlike existing diffusion-based methods that focus on modeling data normality, AGDiff leverages the latent diffusion framework to incorporate subtle perturbations into graph representations, thereby generating pseudo-anomalous graphs that closely resemble normal ones. By jointly training a classifier to distinguish these generated graph anomalies from normal graphs, AGDiff learns more discriminative decision boundaries. The shift from solely modeling normality to explicitly generating and learning from pseudo graph anomalies enables AGDiff to effectively identify complex anomalous patterns that other approaches might overlook. Comprehensive experimental results demonstrate that the proposed AGDiff significantly outperforms several state-of-the-art GLAD baselines.
Spotlight Poster
Hanshi Sun · Li-Wen Chang · Wenlei Bao · Size Zheng · Ningxin Zheng · Xin Liu · Harry Dong · Yuejie Chi · Beidi Chen

[ East Exhibition Hall A-B ]

Abstract
With the widespread deployment of long-context large language models (LLMs), there has been a growing demand for efficient support of high-throughput inference. However, as the key-value (KV) cache expands with the sequence length, the increasing memory footprint and the need to access it for decoding both result in low throughput when serving long-context LLMs. While various dynamic sparse attention methods have been proposed to accelerate inference while maintaining generation quality, they either fail to sufficiently reduce GPU memory usage or introduce significant decoding latency by offloading the KV cache to the CPU. We present ShadowKV, a high-throughput long-context LLM inference system that stores the low-rank key cache and offloads the value cache to reduce the memory footprint for larger batch sizes and longer sequences. To minimize decoding latency, ShadowKV employs an accurate KV selection strategy that reconstructs minimal sparse KV pairs on-the-fly. By evaluating ShadowKV on benchmarks like RULER, LongBench, and models such as Llama-3.1-8B and GLM-4-9B-1M, we demonstrate that it achieves up to 6$\times$ larger batch sizes and 3.04$\times$ higher throughput on an A100 GPU without sacrificing accuracy, even surpassing the performance achievable with infinite batch size under the assumption of infinite GPU memory.
Spotlight Poster
Saketh Bachu · Erfan Shayegani · Rohit Lal · Trishna Chakraborty · Arindam Dutta · Chengyu Song · Yue Dong · Nael Abu-Ghazaleh · Amit Roy-Chowdhury

[ East Exhibition Hall A-B ]

Abstract
Vision-language models (VLMs) have improved significantly in their capabilities, but their complex architecture makes their safety alignment challenging. In this paper, we reveal an uneven distribution of harmful information across the intermediate layers of the image encoder and show that skipping a certain set of layers and exiting early can increase the chance of the VLM generating harmful responses. We call it as “Image enCoder Early-exiT” based vulnerability (ICET). Our experiments across three VLMs: LLaVA-1.5, LLaVA-NeXT, and Llama 3.2 show that performing early exits from the image encoder significantly increases the likelihood of generating harmful outputs. To tackle this, we propose a simple yet effective modification of the Clipped-Proximal Policy Optimization (Clip-PPO) algorithm for performing layer-wise multi-modal RLHF for VLMs. We term this as Layer-Wise PPO (L-PPO). We evaluate our L-PPO algorithm across three multi-modal datasets and show that it consistently reduces the harmfulness caused by early exits.
Spotlight Poster
Diyuan Wu · Marco Mondelli

[ East Exhibition Hall A-B ]

Abstract
Neural Collapse is a phenomenon where the last-layer representations of a well-trained neural network converge to a highly structured geometry. In this paper, we focus on its first (and most basic) property, known as NC1: the within-class variability vanishes. While prior theoretical studies establish the occurrence of NC1 via the data-agnostic unconstrained features model, our work adopts a data-specific perspective, analyzing NC1 in a three-layer neural network, with the first two layers operating in the mean-field regime and followed by a linear layer. In particular, we establish a fundamental connection between NC1 and the loss landscape: we prove that points with small empirical loss and gradient norm (thus, close to being stationary) approximately satisfy NC1, and the closeness to NC1 is controlled by the residual loss and gradient norm. We then show that (i) gradient flow on the mean squared error converges to NC1 solutions with small empirical loss, and (ii) for well-separated data distributions, both NC1 and vanishing test loss are achieved simultaneously. This aligns with the empirical observation that NC1 emerges during training while models attain near-zero test error. Overall, our results demonstrate that NC1 arises from gradient training due to the properties of the loss landscape, and …
Spotlight Poster
Xingyu Wu · Jibin Wu · Yu Zhou · Liang Feng · KC Tan

[ East Exhibition Hall A-B ]

Abstract
Algorithm selection aims to identify the optimal performing algorithm before execution. Existing techniques typically focus on the observed correlations between algorithm performance and meta-features. However, little research has explored the underlying mechanisms of algorithm selection, specifically what characteristics an algorithm must possess to effectively tackle problems with certain feature values. This gap not only limits the explainability but also makes existing models vulnerable to data bias and distribution shift. This paper introduces directed acyclic graph (DAG) to describe this mechanism, proposing a novel modeling paradigm that aligns more closely with the fundamental logic of algorithm selection. By leveraging DAG to characterize the algorithm feature distribution conditioned on problem features, our approach enhances robustness against marginal distribution changes and allows for finer-grained predictions through the reconstruction of optimal algorithm features, with the final decision relying on differences between reconstructed and rejected algorithm features. Furthermore, we demonstrate that, the learned DAG and the proposed counterfactual calculations offer our approach with both model-level and instance-level explainability.
Spotlight Poster
Gaotang Li · Yuzhong Chen · Hanghang Tong

[ East Exhibition Hall A-B ]

Abstract
Language Models (LMs) often encounter knowledge conflicts when parametric memory contradicts contextual knowledge. Previous works attribute this conflict to the interplay between "memory heads" and "context heads", attention heads assumed to promote either memory or context exclusively. In this study, we go beyond this fundamental assumption by uncovering a critical phenomenon we term the *superposition of contextual information and parametric memory*, where highly influential attention heads simultaneously contribute to both memory and context. Building upon this insight, we propose Just Run Twice (JuICE), a test-time attention intervention method that steers LMs toward either parametric beliefs or contextual knowledge without requiring fine-tuning. JuICE identifies a set of reliable attention heads and leverages a dual-run approach to mitigate the superposition effects. Extensive experiments across 11 datasets and 6 model architectures demonstrate that JuICE sets the new state-of-the-art performance and robust generalization, achieving significant and consistent improvement across different domains under various conflict types. Finally, we theoretically analyze knowledge conflict and the superposition of contextual information and parametric memory in attention heads, which further elucidates the effectiveness of JuICE in these settings. Our code is available at https://github.com/GaotangLi/JUICE.
Spotlight Poster
Damian Smith · Harvey Mannering · Antonia Marcu

[ East Exhibition Hall A-B ]

Abstract
A common belief in the representational comparison literature is that if two representations can be functionally aligned, they must capture similar information. In this paper we focus on model stitching and show that models can be functionally aligned, but represent very different information. Firstly, we show that discriminative models with very different biases can be stitched together. We then show that models trained to solve entirely different tasks on different data modalities, and even clustered random noise, can be successfully stitched into MNIST or ImageNet-trained models. We end with a discussion of the wider impact of our results on the community's current beliefs. Overall, our paper draws attention to the need to correctly interpret the results of such functional similarity measures and highlights the need for approaches that capture informational similarity.
Spotlight Poster
Marvin F, da Silva · Felix Dangel · Sageev Oore

[ East Exhibition Hall A-B ]

Abstract
The concept of sharpness has been successfully applied to traditional architectures like MLPs and CNNs to predict their generalization. For transformers, however, recent work reported weak correlation between flatness and generalization. We argue that existing sharpness measures fail for transformers, because they have much richer symmetries in their attention mechanism that induce directions in parameter space along which the network or its loss remain identical. We posit that sharpness must account fully for these symmetries, and thus we redefine it on a quotient manifold that results from quotienting out the transformer symmetries, thereby removing their ambiguities. Leveraging tools from Riemannian geometry, we propose a fully general notion of sharpness, in terms of a geodesic ball on the symmetry-corrected quotient manifold. In practice, we need to resort to approximating the geodesics. Doing so up to first order yields existing adaptive sharpness measures, and we demonstrate that including higher-order terms is crucial to recover correlation with generalization. We present results on diagonal networks with synthetic data, and show that our geodesic sharpness reveals strong correlation for real-world transformers on both text and image classification tasks.
Spotlight Poster
Etowah Adams · Liam Bai · Minji Lee · Yiyang Yu · Mohammed AlQuraishi

[ West Exhibition Hall B2-B3 ]

Abstract
Protein language models (pLMs) are powerful predictors of protein structure and function, learning through unsupervised training on millions of protein sequences. pLMs are thought to capture common motifs in protein sequences, but the specifics of pLM features are not well understood. Identifying these features would not only shed light on how pLMs work, but potentially uncover novel protein biology––studying the model to study the biology. Motivated by this, we train sparse autoencoders (SAEs) on the residual stream of a pLM, ESM-2. By characterizing SAE features, we determine that pLMs use a combination of generic features and family-specific features to represent a protein. In addition, we demonstrate how known sequence determinants of properties such as thermostability and subcellular localization can be identified by linear probing of SAE features. For predictive features without known functional associations, we hypothesize their role in unknown mechanisms and provide visualization tools to aid their interpretation. Our study gives a better understanding of the limitations of pLMs, and demonstrates how SAE features can be used to help generate hypotheses for biological mechanisms. We release our code, model weights, and feature visualizer.
Spotlight Poster
Fan Li · Xuan Wang · Min Qi · Zhaoxiang Zhang · yuelei xu

[ West Exhibition Hall B2-B3 ]

Abstract
Domain Generalized Semantic Segmentation (DGSS) trains a model on a labeled source domain to generalize to unseen target domains with consistent contextual distribution and varying visual appearance.Most existing methods rely on domain randomization or data generation but struggle to capture the underlying scene distribution, resulting in the loss of useful semantic information. Inspired by the diffusion model's capability to generate diverse variations within a given scene context, we consider harnessing its rich prior knowledge of scene distribution to tackle the challenging DGSS task.In this paper, we propose a novel agent \textbf{Query}-driven learning framework based on \textbf{Diff}usion model guidance for DGSS, named QueryDiff. Our recipe comprises three key ingredients: (1) generating agent queries from segmentation features to aggregate semantic information about instances within the scene; (2) learning the inherent semantic distribution of the scene through agent queries guided by diffusion features; (3) refining segmentation features using optimized agent queries for robust mask predictions.Extensive experiments across various settings demonstrate that our method significantly outperforms previous state-of-the-art methods. Notably, it enhances the model's ability to generalize effectively to extreme domains, such as cubist art styles. Code is available at https://github.com/FanLiHub/QueryDiff.
Spotlight Poster
Yuanhong Zhang · Muyao Yuan · Weizhan Zhang · Tieliang Gong · Wen Wen · Jiangyong Ying · Weijie Shi

[ West Exhibition Hall B2-B3 ]

Abstract
The Segment Anything Model (SAM), a vision foundation model, exhibits impressive zero-shot capabilities in general tasks but struggles in specialized domains. Parameter-efficient fine-tuning (PEFT) is a promising approach to unleash the potential of SAM in novel scenarios. However, existing PEFT methods for SAM neglect the domain-invariant relations encoded in the pre-trained model. To bridge this gap, we propose InfoSAM, an information-theoretic approach that enhances SAM fine-tuning by distilling and preserving its pre-trained segmentation knowledge. Specifically, we formulate the knowledge transfer process as two novel mutual information-based objectives: (i) to compress the domain-invariant relation extracted from pre-trained SAM, excluding pseudo-invariant information as possible, and (ii) to maximize mutual information between the relational knowledge learned by the teacher (pre-trained SAM) and the student (fine-tuned model). The proposed InfoSAM establishes a robust distillation framework for PEFT of SAM. Extensive experiments across diverse benchmarks validate InfoSAM's effectiveness in improving SAM family's performance on real-world tasks, demonstrating its adaptability and superiority in handling specialized scenarios. The code and models are available at https://muyaoyuan.github.io/InfoSAM_Page.
Spotlight Poster
Elias Kempf · Simon Schrodi · Max Argus · Thomas Brox

[ East Exhibition Hall A-B ]

Abstract
The remarkable generalization performance of contrastive vision-language models like CLIP is often attributed to the diversity of their training distributions. However, key questions remain unanswered: Can CLIP generalize to an entirely unseen domain when trained on a diverse mixture of domains (domain generalization)? Can it generalize to unseen classes within partially seen domains (compositional generalization)? What factors affect such generalization? To answer these questions, we trained CLIP models on systematically constructed training distributions with controlled domain diversity and object class exposure. Our experiments show that domain diversity is essential for both domain and compositional generalization, yet compositional generalization can be surprisingly weaker than domain generalization when the training distribution contains a suboptimal subset of the test domain. Through data-centric *and* mechanistic analyses, we find that successful generalization requires the learning of sufficiently shared representations in intermediate layers and circuits.
Spotlight Poster
Antonis Vasileiou · Ben Finkelshtein · Floris Geerts · Ron Levie · Christopher Morris

[ East Exhibition Hall A-B ]

Abstract
The expressive power of message-passing graph neural networks (MPNNs) is reasonably well understood, primarily through combinatorial techniques from graph isomorphism testing. However, MPNNs' generalization abilities---making meaningful predictions beyond the training set---remain less explored. Current generalization analyses often overlook graph structure, limit the focus to specific aggregation functions, and assume the impractical, hard-to-optimize $0$-$1$ loss function. Here, we extend recent advances in graph similarity theory to assess the influence of graph structure, aggregation, and loss functions on MPNNs' generalization abilities. Our empirical study supports our theoretical insights, improving our understanding of MPNNs' generalization properties.
Spotlight Poster
Hong-Ming Chiu · Hao Chen · Huan Zhang · Richard Zhang

[ East Exhibition Hall A-B ]

Abstract
Neural network verifiers based on linear bound propagation scale impressively to massive models but can be surprisingly loose when neuron coupling is crucial. Conversely, semidefinite programming (SDP) verifiers capture inter-neuron coupling naturally, but their cubic complexity restricts them to only small models. In this paper, we propose SDP-CROWN, a novel hybrid verification framework that combines the tightness of SDP relaxations with the scalability of bound-propagation verifiers. At the core of SDP-CROWN is a new linear bound---derived via SDP principles---that explicitly captures $\ell_{2}$-norm-based inter-neuron coupling while adding only one extra parameter per layer. This bound can be integrated seamlessly into any linear bound-propagation pipeline, preserving the inherent scalability of such methods yet significantly improving tightness. In theory, we prove that our inter-neuron bound can be up to a factor of $\sqrt{n}$ tighter than traditional per-neuron bounds. In practice, when incorporated into the state-of-the-art $\alpha$-CROWN verifier, we observe markedly improved verification performance on large models with up to 65 thousand neurons and 2.47 million parameters, achieving tightness that approaches that of costly SDP-based methods.
Spotlight Poster
Cosimo Gregucci · Bo Xiong · Daniel Hernández · Lorenzo Loconte · Pasquale Minervini · Steffen Staab · Antonio Vergari

[ East Exhibition Hall A-B ]

Abstract
Complex query answering (CQA) on knowledge graphs (KGs) is gaining momentum as a challenging reasoning task.In this paper, we show that the current benchmarks for CQA might not be as *complex* as we think, as the way they are built distorts our perception of progress in this field.For example, we find that in these benchmarks most queries (up to 98% for some query types) can be reduced to simpler problems, e.g., link prediction, where only one link needs to be predicted.The performance of state-of-the-art CQA models decreses significantly when such models are evaluated on queries that cannot be reduced to easier types.Thus, we propose a set of more challenging benchmarks composed of queries that *require* models to reason over multiple hops and better reflect the construction of real-world KGs.In a systematic empirical investigation, the new benchmarks show that current methods leave much to be desired from current CQA methods.
Spotlight Poster
Hao Chen · Yujin Han · Fangyi Chen · Xiang Li · Yidong Wang · Jindong Wang · Ze Wang · Zicheng Liu · Difan Zou · Bhiksha Raj

[ East Exhibition Hall A-B ]

Abstract
Recent advances in latent diffusion models have demonstrated their effectiveness for high-resolution image synthesis. However, the properties of the latent space from tokenizer for better learning and generation of diffusion models remain under-explored. Theoretically and empirically, we find that improved generation quality is closely tied to the latent distributions with better structure, such as the ones with fewer Gaussian Mixture modes and more discriminative features. Motivated by these insights, we propose MAETok, an autoencoder (AE) leveraging mask modeling to learn semantically rich latent space while maintaining reconstruction fidelity. Extensive experiments validate our analysis, demonstrating that the variational form of autoencoders is not necessary, and a discriminative latent space from AE alone enables state-of-the-art performance on ImageNet generation using only 128 tokens. MAETok achieves significant practical improvements, enabling a gFID of 1.69 with 76× faster training and 31× higher inference throughput for 512×512 generation. Our findings show that the structure of the latent space, rather than variational constraints, is crucial for effective diffusion models. Code and trained models will be released.
Spotlight Poster
John Schultz · Jakub Adamek · Matej Jusup · Marc Lanctot · Michael Kaisers · Sarah Perrin · Daniel Hennes · Jeremy Shar · Cannada Lewis · Anian Ruoss · Tom Zahavy · Petar Veličković · Laurel Prince · Satinder Singh · Eric Malmi · Nenad Tomasev

[ East Exhibition Hall A-B ]

Abstract
Advancing planning and reasoning capabilities of Large Language Models (LLMs) is one of the key prerequisites towards unlocking their potential for performing reliably in complex and impactful domains. In this paper, we aim to demonstrate this across board games (Chess, Fischer Random / Chess960, Connect Four, and Hex), and we show that search-based planning can yield significant improvements in LLM game-playing strength. We introduce, compare and contrast two major approaches: In *external search*, the model guides Monte Carlo Tree Search (MCTS) rollouts and evaluations without calls to an external game engine, and in *internal search*, the model is trained to generate in-context a linearized tree of search and a resulting final choice. Both build on a language model pre-trained on relevant domain knowledge, reliably capturing the transition and value functions in the respective environments, with minimal hallucinations. We evaluate our LLM search implementations against game-specific state-of-the-art engines, showcasing substantial improvements in strength over the base model, and reaching Grandmaster-level performance in chess while operating closer to the human search budget. Our proposed approach, combining search with domain knowledge, is not specific to board games, hinting at more general future applications.
Spotlight Poster
Guibin Zhang · Yanwei Yue · Xiangguo Sun · Guancheng Wan · Miao Yu · Junfeng Fang · Kun Wang · Tianlong Chen · Dawei Cheng

[ East Exhibition Hall A-B ]

Abstract
Recent advancements in large language model (LLM)-based agents have demonstrated that collective intelligence can significantly surpass the capabilities of individual agents, primarily due to well-crafted inter-agent communication topologies. Despite the diverse and high-performing designs available, practitioners often face confusion when selecting the most effective pipeline for their specific task: \textit{Which topology is the best choice for my task, avoiding unnecessary communication token overhead while ensuring high-quality solution?} In response to this dilemma, we introduce G-Designer, an adaptive, efficient, and robust solution for multi-agent deployment, which dynamically designs task-aware, customized communication topologies. Specifically, G-Designer models the multi-agent system as a multi-agent network, leveraging a variational graph auto-encoder to encode both the nodes (agents) and a task-specific virtual node, and decodes a task-adaptive and high-performing communication topology. Extensive experiments on six benchmarks showcase that G-Designer is: \textbf{(1) high-performing}, achieving superior results on MMLU with accuracy at $84.50\\%$ and on HumanEval with pass@1 at $89.90\\%$; \textbf{(2) task-adaptive}, architecting communication protocols tailored to task difficulty, reducing token consumption by up to $95.33\\%$ on HumanEval; and \textbf{(3) adversarially robust}, defending against agent adversarial attacks with merely $0.3\\%$ accuracy drop.
Spotlight Poster
Zheyang Xiong · Jack Cai · John Cooper · Albert Ge · Vasilis Papageorgiou · Zack Sifakis · Angeliki Giannou · Ziqian Lin · Liu Yang · Saurabh Agarwal · Grigorios Chrysos · Samet Oymak · Kangwook Lee · Dimitris Papailiopoulos

[ East Exhibition Hall A-B ]

Abstract
Large Language Models (LLMs) have demonstrated remarkable in-context learning (ICL) capabilities. In this study, we explore a surprising phenomenon related to ICL: LLMs can perform multiple, computationally distinct ICL tasks simultaneously, during a single inference call, a capability we term task superposition". We provide empirical evidence of this phenomenon across various LLM families and scales and show that this phenomenon emerges even if we train the model to in-context learn one task at a time. We offer theoretical explanations that this capability is well within the expressive power of transformers. We also explore how LLMs internally compose task vectors during superposition. Furthermore, we show that larger models can solve more ICL tasks in parallel, and better calibrate their output distribution. Our findings offer insights into the latent capabilities of LLMs, further substantiate the perspective of "LLMs as superposition of simulators", and raise questions about the mechanisms enabling simultaneous task execution.
Spotlight Poster
Anders Aamand · Fabrizio Boninsegna · Abigail Gentle · Jacob Imola · Rasmus Pagh

[ East Exhibition Hall A-B ]

Abstract
Distributed data analysis is a large and growing field driven by a massive proliferation of user devices, and by privacy concerns surrounding the centralised storage of data. We consider two \emph{adaptive} algorithms for estimating one quantile (e.g.~the median) when each user holds a single data point lying in a domain $[B]$ that can be queried once through a private mechanism; one under local differential privacy (LDP) and another for shuffle differential privacy (shuffle-DP). In the adaptive setting we present an $\varepsilon$-LDP algorithm which can estimate any quantile within error $\alpha$ only requiring $O(\frac{\log B}{\varepsilon^2\alpha^2})$ users, and an $(\varepsilon,\delta)$-shuffle DP algorithm requiring only $\widetilde{O}((\frac{1}{\varepsilon^2}+\frac{1}{\alpha^2})\log B)$ users. Prior (nonadaptive) algorithms require more users by several logarithmic factors in $B$. We further provide a matching lower bound for adaptive protocols, showing that our LDP algorithm is optimal in the low-$\varepsilon$ regime. Additionally, we establish lower bounds against non-adaptive protocols which paired with our understanding of the adaptive case, proves a fundamental separation between these models.
Spotlight Poster
Attila Szász · Balázs Bánhelyi · Mark Jelasity

[ East Exhibition Hall A-B ]

Abstract
The ultimate goal of verification is to guarantee the safety of deployed neural networks. Here, we claim that all the state-of-the-art verifiers we are aware of fail to reach this goal. Our key insight is that theoretical soundness (bounding the full-precision output while computing with floating point) does not imply practical soundness (bounding the floating point output in a potentially stochastic environment). We prove this observation for the approaches that are currently used to achieve provable theoretical soundness, such as interval analysis and its variants. We also argue that achieving practical soundness is significantly harder computationally. We support our claims empirically as well by evaluating several well-known verification methods. To mislead the verifiers, we create adversarial networks that detect and exploit features of the deployment environment, such as the order and precision of floating point operations. We demonstrate that all the tested verifiers are vulnerable to our new deployment-specific attacks, which proves that they are not practically sound.
Spotlight Poster
Seung Lee · Hojoon Kim · Yutack Park · Dawoon Jeong · Seungwu Han · Yeonhong Park · Jae W. Lee

[ West Exhibition Hall B2-B3 ]

Abstract
Machine Learning Interatomic Potentials (MLIPs) enable efficient molecular dynamics (MD) simulations with high accuracy. While equivariant MLIPs achieve state-of-the-art accuracy, they face significant computational bottlenecks centered around their Tensor-Product layer, which account for up to 75\% of training time and cause substantial memory overhead. We present FlashTP, a highly optimized tensor-product library that addresses these inefficiencies through kernel fusion, sparse computation, and path-aggregated execution. FlashTP achieves up to 41.6$\times$ and 60.8$\times$ kernel speedups over _e3nn_ and NVIDIA cuEquivariance, respectively. For SevenNet-l3i5, it delivers 4.2$\times$ and 3.5$\times$ speedup while reducing peak memory usage by 6.3$\times$ and 6.2$\times$ for inference and training, respectively. The code is available at https://github.com/SNU-ARC/flashTP.
Spotlight Poster
Ken Ziyu Liu · Christopher A. Choquette Choo · Matthew Jagielski · Peter Kairouz · Sanmi Koyejo · Percy Liang · Nicolas Papernot

[ East Exhibition Hall A-B ]

Abstract
An important question today is whether a given text was used to train a large language model (LLM). A completion test is often employed: check if the LLM completes a sufficiently complex text. This, however, requires a ground-truth definition of membership; most commonly, it is defined as a member based on the n-gram overlap between the target text and any text in the dataset. In this work, we demonstrate that this n-gram based membership definition can be effectively gamed. We study scenarios where sequences are non-members for a given n and we find that completion tests still succeed. We find many natural cases of this phenomenon by retraining LLMs from scratch after removing all training samples that were completed; these cases include exact duplicates, near-duplicates, and even short overlaps. They showcase that it is difficult to find a single viable choice of n for membership definitions. Using these insights, we design adversarial datasets that can cause a given target sequence to be completed without containing it, for any reasonable choice of n. Our findings highlight the inadequacy of n-gram membership, suggesting membership definitions fail to account for auxiliary information available to the training algorithm.
Spotlight Poster
Thibaut Issenhuth · Sangchul Lee · Ludovic Dos Santos · Jean-Yves Franceschi · Chansoo Kim · alain rakotomamonjy

[ East Exhibition Hall A-B ]

Abstract
Consistency models imitate the multi-step sampling of score-based diffusion in a single forward pass of a neural network.They can be learned in two ways: consistency distillation and consistency training. The former relies on the true velocity field of the corresponding differential equation, approximated by a pre-trained neural network.In contrast, the latter uses a single-sample Monte Carlo estimate of this velocity field.The related estimation error induces a discrepancy between consistency distillation and training that, we show, still holds in the continuous-time limit.To alleviate this issue, we propose a novel flow that transports noisy data towards their corresponding outputs derived from a consistency model.We prove that this flow reduces the previously identified discrepancy and the noise-data transport cost.Consequently, our method not only accelerates consistency training convergence but also enhances its overall performance. The code is available at https://github.com/thibautissenhuth/consistency_GC.
Spotlight Poster
Giovanni Luca Marchetti · Vahid Shahverdi · Stefano Mereta · Matthew Trager · Kathlén Kohn

[ East Exhibition Hall A-B ]

Abstract
In this position paper, we promote the study of function spaces parameterized by machine learning models through the lens of algebraic geometry. To this end, we focus on algebraic models, such as neural networks with polynomial activations, whose associated function spaces are semi-algebraic varieties. We outline a dictionary between algebro-geometric invariants of these varieties, such as dimension, degree, and singularities, and fundamental aspects of machine learning, such as sample complexity, expressivity, training dynamics, and implicit bias. Along the way, we review the literature and discuss ideas beyond the algebraic domain. This work lays the foundations of a research direction bridging algebraic geometry and deep learning, that we refer to as neuroalgebraic geometry.
Spotlight Poster
Hojoon Lee · Youngdo Lee · Takuma Seno · Donghu Kim · Peter Stone · Jaegul Choo

[ West Exhibition Hall B2-B3 ]

Abstract
Scaling up the model size and computation has brought consistent performance improvements in supervised learning. However, this lesson often fails to apply to reinforcement learning (RL) because training the model on non-stationary data easily leads to overfitting and unstable optimization.In response, we introduce SimbaV2, a novel RL architecture designed to stabilize optimization by (i) constraining the growth of weight and feature norm by hyperspherical normalization; and (ii) using a distributional value estimation with reward scaling to maintain stable gradients under varying reward magnitudes. Using the soft actor-critic as a base algorithm, SimbaV2 scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks across 4 domains.
Spotlight Poster
Huigen Ye · Hua Xu · An Yan · Yaoyang Cheng

[ West Exhibition Hall B2-B3 ]

Abstract
Large Neighborhood Search (LNS) is a widely used method for solving large-scale Mixed Integer Linear Programming (MILP) problems. The effectiveness of LNS crucially depends on the choice of the search neighborhood. However, existing strategies either rely on expert knowledge or computationally expensive Machine Learning (ML) approaches, both of which struggle to scale effectively for large problems. To address this, we propose LLM-LNS, a novel Large Language Model (LLM)-driven LNS framework for large-scale MILP problems. Our approach introduces a dual-layer self-evolutionary LLM agent to automate neighborhood selection, discovering effective strategies with scant small-scale training data that generalize well to large-scale MILPs. The inner layer evolves heuristic strategies to ensure convergence, while the outer layer evolves evolutionary prompt strategies to maintain diversity. Experimental results demonstrate that the proposed dual-layer agent outperforms state-of-the-art agents such as FunSearch and EOH. Furthermore, the full LLM-LNS framework surpasses manually designed LNS algorithms like ACP, ML-based LNS methods like CL-LNS, and large-scale solvers such as Gurobi and SCIP. It also achieves superior performance compared to advanced ML-based MILP optimization frameworks like GNN&GBDT and Light-MILPopt, further validating the effectiveness of our approach.
Spotlight Poster
Hao Li · Qi Lv · Rui Shao · Xiang Deng · Yinchuan Li · Jianye Hao · Liqiang Nie

[ East Exhibition Hall A-B ]

Abstract
Transforming complex actions into discrete skill abstractions has demonstrated strong potential for robotic manipulation.Existing approaches mainly leverage latent variable models, e.g., VQ-VAE, to learn skill abstractions through learned vectors (codebooks), while they suffer from codebook collapse and modeling the causal relationship between learned skills. To address these limitations, we present **S**kill **T**raining with **A**ugmented **R**otation (**STAR**), a framework that advances both skill learning and composition to complete complex behaviors. Specifically, to prevent codebook collapse, we devise rotation-augmented residual skill quantization (RaRSQ).It encodes relative angles between encoder outputs into the gradient flow by rotation-based gradient mechanism. Points within the same skill code are forced to be either pushed apart or pulled closer together depending on gradient directions.Further, to capture the casual relationship between skills, we present causal skill transformer (CST) which explicitly models dependencies between skill representations through an autoregressive mechanism for coherent action generation.Extensive experiments demonstrate the superiority of STAR on both LIBERO benchmark and realworld tasks, with around 12% improvement over the baselines.
Spotlight Poster
Unique Subedi · Ambuj Tewari

[ West Exhibition Hall B2-B3 ]

Abstract
We study active data collection strategies for operator learning when the target operator is linear and the input functions are drawn from a mean-zero stochastic process with continuous covariance kernels. With an active data collection strategy, we establish an error convergence rate in terms of the decay rate of the eigenvalues of the covariance kernel. We can achieve arbitrarily fast error convergence rates with sufficiently rapid eigenvalue decay of the covariance kernels. This contrasts with thepassive (i.i.d.) data collection strategies, where the convergence rate is never faster than linear decay ($\sim n^{-1}$). In fact, for our setting, we show a \emph{non-vanishing} lower bound for any passive data collection strategy, regardless of the eigenvalues decay rate of the covariance kernel. Overall, our results show the benefit of active data collection strategies in operator learning over their passive counterparts.
Spotlight Poster
Charita Dellaporta · Patrick O'Hara · Theodoros Damoulas

[ East Exhibition Hall A-B ]

Abstract
Decision making under uncertainty is challenging as the data-generating process (DGP) is often unknown. Bayesian inference proceeds by estimating the DGP through posterior beliefs on the model’s parameters. However, minimising the expected risk under these beliefs can lead to suboptimal decisions due to model uncertainty or limited, noisy observations. To address this, we introduce Distributionally Robust Optimisation with Bayesian Ambiguity Sets (DRO-BAS) which hedges against model uncertainty by optimising the worst-case risk over a posterior-informed ambiguity set. We provide two such sets, based on the posterior expectation (DRO-BAS(PE)) or the posterior predictive (DRO-BAS(PP)) and prove that both admit, under conditions, strong dual formulations leading to efficient single-stage stochastic programs which are solved with a sample average approximation. For DRO-BAS(PE), this covers all conjugate exponential family members while for DRO-BAS(PP) this is shown under conditions on the predictive's moment generating function. Our DRO-BAS formulations outperform existing Bayesian DRO on the Newsvendor problem and achieve faster solve times with comparable robustness on the Portfolio problem.
Spotlight Poster
Ruichuan Huang · Jiawei Zhang · Ahmet Alacaoglu

[ West Exhibition Hall B2-B3 ]

Abstract
We propose smoothed primal-dual algorithms for solving stochastic nonconvex optimization problems with linear \emph{inequality} constraints. Our algorithms are single-loop and only require a single (or two) samples of stochastic gradients at each iteration. A defining feature of our algorithm is that it is based on an inexact gradient descent framework for the Moreau envelope, where the gradient of the Moreau envelope is estimated using one step of a stochastic primal-dual (linearized) augmented Lagrangian algorithm. To handle inequality constraints and stochasticity, we combine the recently established global error bounds in constrained optimization with a Moreau envelope-based analysis of stochastic proximal algorithms. We establish the optimal (in their respective cases) $O(\varepsilon^{-4})$ and $O(\varepsilon^{-3})$ sample complexity guarantees for our algorithms and provide extensions to stochastic linear constraints. Unlike existing methods, iterations of our algorithms are free of subproblems, large batch sizes or increasing penalty parameters in their iterations and they use dual variable updates to ensure feasibility.
Spotlight Poster
Taj Jones-McCormick · Aukosh Jagannath · Subhabrata Sen

[ West Exhibition Hall B2-B3 ]

Abstract
Unsupervised pre-training and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.
Spotlight Poster
Ali Ebrahimpour-Boroojeny · Hari Sundaram · Varun Chandrasekaran

[ East Exhibition Hall A-B ]

Abstract
Machine unlearning, where users can request the deletion of a forget dataset, is becoming increasingly important because of numerous privacy regulations. Initial works on "exact'' unlearning (e.g., retraining) incur large computational overheads. However, while computationally inexpensive, "approximate'' methods have fallen short of reaching the effectiveness of exact unlearning: models produced fail to obtain comparable accuracy and prediction confidence on both the forget and test (i.e., unseen) dataset. Exploiting this observation, we propose a new unlearning method, Adversarial Machine UNlearning (AMUN), that outperforms prior state-of-the-art (SOTA) methods for image classification. AMUN lowers the confidence of the model on the forget samples by fine-tuning the model on their corresponding adversarial examples. Adversarial examples naturally belong to the distribution imposed by the model on the input space; fine-tuning the model on the adversarial examples closest to the corresponding forget samples (a) localizes the changes to the decision boundary of the model around each forget sample and (b) avoids drastic changes to the global behavior of the model, thereby preserving the model's accuracy on test samples. Using AMUN for unlearning a random 10% of CIFAR-10 samples, we observe that even SOTA membership inference attacks cannot do better than random guessing.
Spotlight Poster
Miriam Doh · Benedikt Höltgen · Piera Riccio · Nuria Oliver

[ East Exhibition Hall A-B ]

Abstract
This position paper critiques the reliance on rigid racial taxonomies in machine learning, exposing their U.S.-centric nature and lack of global applicability—particularly in Europe, where race categories are not commonly used. These classifications oversimplify racial identity, erasing the experiences of mixed-race individuals and reinforcing outdated essentialist views that contradict the social construction of race. We suggest research agendas in machine learning that move beyond categorical variables to better address discrimination and social inequality.
Spotlight Poster
Junchao Zhou · Yao Lu · Jie Wen · Guangming Lu

[ West Exhibition Hall B2-B3 ]

Abstract
Image steganography hides multiple images for multiple recipients into a single cover image. All secret images are usually revealed without authentication, which reduces security among multiple recipients. It is elegant to design an authentication mechanism for isolated reception. We explore such mechanism through sufficient experiments, and uncover that additional authentication information will affect the distribution of hidden information and occupy more hiding space of the cover image. This severely decreases effectiveness and efficiency in large-capacity hiding. To overcome such a challenge, we first prove the authentication feasibility within image steganography. Then, this paper proposes an image steganography network collaborating with separate authentication and efficient scheme. Specifically, multiple pairs of lock-key are generated during hiding and revealing. Unlike traditional methods, our method has two stages to make appropriate distribution adaptation between locks and secret images, simultaneously extracting more reasonable primary information from secret images, which can release hiding space of the cover image to some extent. Furthermore, due to separate authentication, fused information can be hidden in parallel with a single network rather than traditional serial hiding with multiple networks, which can largely decrease the model size. Extensive experiments demonstrate that the proposed method achieves more secure, effective, and efficient image …
Spotlight Poster
Chengyuan Li · Liangxiao Jiang · Wenjun Zhang · Liangjun Yu · Huan Zhang

[ East Exhibition Hall A-B ]

Abstract
Due to its simplicity, effectiveness and robustness, naive Bayes (NB) has continued to be one of the top 10 data mining algorithms. To improve its performance, a large number of improved algorithms have been proposed in the last few decades. However, in addition to Gaussian naive Bayes (GNB), there is little work on numerical attributes. At the same time, none of them takes into account the correlations among instances. To fill this gap, we propose a novel algorithm called instance correlation graph-based naive Bayes (ICGNB). Specifically, it first uses original attributes to construct an instance correlation graph (ICG) to represent the correlations among instances. Then, it employs a variational graph auto-encoder (VGAE) to generate new attributes from the constructed ICG and uses them to augment original attributes.Finally, it weights each augmented attribute to alleviate the attribute redundancy and builds GNB on the weighted attributes. The experimental results on tens of datasets show that ICGNB significantly outperforms its deserved competitors.Our codes and datasets are available at https://github.com/jiangliangxiao/ICGNB.
Spotlight Poster
Jade Garcia Bourrée · Augustin Godinot · Sayan Biswas · Anne-Marie Kermarrec · Erwan Le Merrer · Gilles Tredan · Martijn de Vos · Milos Vujasinovic

[ East Exhibition Hall A-B ]

Abstract
Among the many technical challenges to enforcing AI regulations, one crucial yet underexplored problem is the risk of audit manipulation.This manipulation occurs when a platform deliberately alters its answers to a regulator to pass an audit without modifying its answers to other users.In this paper, we introduce a novel approach to manipulation-proof auditing by taking into account the auditor's prior knowledge of the task solved by the platform. We first demonstrate that regulators must not rely on public priors (e.g. a public dataset), as platforms could easily fool the auditor in such cases. We then formally establish the conditions under which an auditor can prevent audit manipulations using prior knowledge about the ground truth. Finally, our experiments with two standard datasets illustrate the maximum level of unfairness a platform can hide before being detected as malicious.Our formalization and generalization of manipulation-proof auditing with a prior opens up new research directions for more robust fairness audits.
Spotlight Poster
Chenlu Ye · Yujia Jin · Alekh Agarwal · Tong Zhang

[ West Exhibition Hall B2-B3 ]

Abstract
Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.
Spotlight Poster
John Hewitt · Robert Geirhos · Been Kim

[ East Exhibition Hall A-B ]

Abstract
This position paper argues that, in order to understand AI, we cannot rely on our existing vocabulary of human words. Instead, we shouldstrive to develop neologisms: new words thatrepresent precise human concepts that we wantto teach machines, or machine concepts that weneed to learn. We start from the premise thathumans and machines have differing concepts.This means interpretability can be framed as acommunication problem: humans must be able toreference and control machine concepts, and communicate human concepts to machines. Creatinga shared human-machine language through developing neologisms, we believe, could solve thiscommunication problem. Successful neologismsachieve a useful amount of abstraction: not toodetailed, so they’re reusable in many contexts, andnot too high-level, so they convey precise information. As a proof of concept, we demonstrate howa “length neologism” enables controlling LLMresponse length, while a “diversity neologism” allows sampling more variable responses. Takentogether, we argue that we cannot understand AIusing our existing vocabulary, and expanding itthrough neologisms creates opportunities for bothcontrolling and understanding machines better.
Spotlight Poster
Shentong Mo · Sukmin Yun

[ East Exhibition Hall A-B ]

Abstract
Generative models have made it possible to synthesize highly realistic images, potentially providing an abundant data source for training machine learning models. Despite the advantages of these synthesizable data sources, the indiscriminate use of generated images as real images for training can even cause mode collapse due to modality discrepancies between real and synthetic domains. In this paper, we propose a novel framework for discriminative use of generated images, coined \textit{GMAIL}, that explicitly treats generated images as a separate modality from real images. Instead of indiscriminately replacing real images with generated ones in the pixel space, our approach bridges the two distinct modalities in the same latent space through a multi-modal learning approach. To be specific, we first fine-tune a model exclusively on generated images using a cross-modality alignment loss and then employ this aligned model to further train various vision-language models with generated images. By aligning the two modalities, our approach effectively leverages the benefits of recent advances in generative models, thereby boosting the effectiveness of generated image learning across a range of vision-language tasks. Our framework can be easily incorporated with various vision-language models, and we demonstrate its efficacy throughout extensive experiments. For example, our framework significantly improves …
Spotlight Poster
Lena Stempfle · Anton Matsson · Newton Mwai · Fredrik Johansson

[ East Exhibition Hall A-B ]

Abstract
Handling missing values at test time is challenging for machine learning models, especially when aiming for both high accuracy and interpretability. Established approaches often add bias through imputation or excessive model complexity via missingness indicators. Moreover, either method can obscure interpretability, making it harder to understand how the model utilizes the observed variables in predictions. We propose *missingness-avoiding* (MA) machine learning, a general framework for training models to rarely require the values of missing (or imputed) features at test time. We create tailored MA learning algorithms for decision trees, tree ensembles, and sparse linear models by incorporating classifier-specific regularization terms in their learning objectives. The tree-based models leverage contextual missingness by reducing reliance on missing values based on the observed context. Experiments on real-world datasets demonstrate that **MA-DT, MA-LASSO, MA-RF**, and **MA-GBT** effectively reduce the reliance on features with missing values while maintaining predictive performance competitive with their unregularized counterparts. This shows that our framework gives practitioners a powerful tool to maintain interpretability in predictions with test-time missing values.
Spotlight Poster
Kevin Wei · Patricia Paskov · Sunishchal Dev · Michael Byun · Anka Reuel · Xavier Roberts-Gaal · Rachel Calcott · Evie Coxon · Chinmay Deshpande

[ East Exhibition Hall A-B ]

Abstract
**In this position paper, we argue that human baselines in foundation model evaluations must be more rigorous and more transparent to enable meaningful comparisons of human vs. AI performance, and we provide recommendations and a reporting checklist towards this end.** Human performance baselines are vital for the machine learning community, downstream users, and policymakers to interpret AI evaluations. Models are often claimed to achieve "super-human" performance, but existing baselining methods are neither sufficiently rigorous nor sufficiently well-documented to robustly measure and assess performance differences. Based on a meta-review of the measurement theory and AI evaluation literatures, we derive a framework with recommendations for designing, executing, and reporting human baselines. We synthesize our recommendations into a checklist that we use to systematically review 115 human baselines (studies) in foundation model evaluations and thus identify shortcomings in existing baselining methods; our checklist can also assist researchers in conducting human baselines and reporting results. We hope our work can advance more rigorous AI evaluation practices that can better serve both the research community and policymakers. Data is available at: [https://github.com/kevinlwei/human-baselines](https://github.com/kevinlwei/human-baselines).
Spotlight Poster
Chunhui Zhang · Zhongyu Ouyang · Kwonjoon Lee · Nakul Agarwal · Sean Houlihan · Soroush Vosoughi · Shao-Yuan Lo

[ East Exhibition Hall A-B ]

Abstract
Theory-of-mind (ToM) enables humans to infer mental states—such as beliefs, desires, and intentions—forming the foundation of social cognition. Existing computational ToM methods rely on structured workflows with ToM-specific priors or deep model fine-tuning but struggle with scalability in multimodal environments. They remain trapped within the gravitational pull of multi-step planning complexity, failing to generalize as task demands increase. To overcome these limitations, we propose a scalable Bayesian ToM planner. It breaks down ToM complexity into stepwise Bayesian updates. Meanwhile, weak-to-strong control specializes smaller LMs to refine ToM-specific likelihood estimation, transferring their ToM reasoning behavior to larger LMs (7B to 405B) for social and world knowledge integration. This synergistic approach enables scalability, aligning large-model inference with human mental states with Bayesian principles. Extensive experiments demonstrate a 4.6% improvement in accuracy over state-of-the-art methods on multimodal ToM benchmarks, including unseen scenarios, establishing a new standard for modeling human mental states in complex environments.
Spotlight Poster
Huanjian Zhou · Masashi Sugiyama

[ East Exhibition Hall A-B ]

Abstract
Sampling from high-dimensional probability distributions is fundamental in machine learning and statistics. As datasets grow larger, computational efficiency becomes increasingly important, particularly in reducing *adaptive complexity*, namely the number of sequential rounds required for sampling algorithms. While recent works have introduced several parallelizable techniques, they often exhibit suboptimal convergence rates and remain significantly weaker than the latest lower bounds for log-concave sampling.To address this, we propose a novel parallel sampling method that improves adaptive complexity dependence on dimension $d$ reducing it from $\widetilde{\mathcal{O}}(\log^2 d)$ to $\widetilde{\mathcal{O}}(\log d)$. Our approach builds on parallel simulation techniques from scientific computing.
Spotlight Poster
Paul McVay · Sergio Arnaud · Ada Martin · Arjun Majumdar · Krishna Murthy Jatavallabhula · Phillip Thomas · Ruslan Partsey · Daniel Dugas · Abha Gejji · Alexander Sax · Vincent-Pierre Berges · Mikael Henaff · Ayush Jain · Ang Cao · Ishita Prasad · Mrinal Kalakrishnan · Michael Rabbat · Nicolas Ballas · Mahmoud Assran · Oleksandr Maksymets · Aravind Rajeswaran · Franziska Meier

[ West Exhibition Hall B2-B3 ]

Abstract
We present LOCATE 3D, a model for localizing objects in 3D scenes from referring expressions like "the small coffee table between the sofa and the lamp." LOCATE 3D sets a new state-of-the-art on standard referential grounding benchmarks and showcases robust generalization capabilities. Notably, LOCATE 3D operates directly on sensor observation streams (posed RGB-D frames), enabling real-world deployment on robots and AR devices. Key to our approach is 3D-JEPA, a novel self-supervised learning (SSL) algorithm applicable to sensor point clouds. It takes as input a 3D pointcloud featurized using 2D foundation models (CLIP, DINO). Subsequently, masked prediction in latent space is employed as a pretext task to aid the self-supervised learning of contextualized pointcloud features. Once trained, the 3D-JEPA encoder is finetuned alongside a language-conditioned decoder to jointly predict 3D masks and bounding boxes. Additionally, we introduce LOCATE 3D DATASET, a new dataset for 3D referential grounding, spanning multiple capture setups with over 130K annotations. This enables a systematic study of generalization capabilities as well as a stronger model. Code, models and dataset can be found at the project website: locate3d.atmeta.com
Spotlight Poster
Zirui Liu · Jiatong Li · Yan Zhuang · Qi Liu · Shuanghong Shen · Jie Ouyang · Mingyue Cheng · Shijin Wang

[ East Exhibition Hall A-B ]

Abstract
Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating’s probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.
Spotlight Poster
Thomas Pethick · Wanyun Xie · Kimon Antonakopoulos · Zhenyu Zhu · Antonio Silveti-Falls · Volkan Cevher

[ East Exhibition Hall A-B ]

Abstract
In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.
Spotlight Poster
Connor Schenck · Isaac Reid · Mithun Jacob · Alex Bewley · Joshua Ainslie · David Rendleman · Deepali Jain · Mohit Sharma · Kumar Avinava Dubey · Ayzaan Wahid · Sumeet Singh · René Wagner · Tianli Ding · Chuyuan Fu · Arunkumar Byravan · Jacob J Varley · Alexey Gritsenko · Matthias Minderer · Dmitry Kalashnikov · Jonathan Tompson · Vikas Sindhwani · Krzysztof Choromanski

[ East Exhibition Hall A-B ]

Abstract
We introduce $\textbf{STRING}$: Separable Translationally Invariant Position Encodings. STRING extends Rotary Position Encodings, a recently proposed and widely used algorithm in large language models, via a unifying theoretical framework. Importantly, STRING still provides $\textbf{exact}$ translation invariance, including token coordinates of arbitrary dimensionality, whilst maintaining a low computational footprint. These properties are especially important in robotics, where efficient 3D token representation is key. We integrate STRING into Vision Transformers with RGB(-D) inputs (color plus optional depth), showing substantial gains, e.g. in open-vocabulary object detection and for robotics controllers.We complement our experiments with a rigorous mathematical analysis, proving the universality of our methods. Videos of STRING-based robotics controllers can be found here: https://sites.google.com/view/string-robotics.
Spotlight Poster
Tejas Jayashankar · Jongha (Jon) Ryu · Gregory Wornell

[ East Exhibition Hall A-B ]

Abstract
We propose *Score-of-Mixture Training* (SMT), a novel framework for training one-step generative models by minimizing a class of divergences called the$\alpha$-skew Jensen–Shannon divergence. At its core, SMT estimates the score of mixture distributions between real and fake samples across multiple noise levels.Similar to consistency models, our approach supports both training from scratch (SMT) and distillation using a pretrained diffusion model, which we call *Score-of-Mixture Distillation* (SMD).It is simple to implement, requires minimal hyperparameter tuning, and ensures stable training. Experiments on CIFAR-10 and ImageNet 64×64 show that SMT/SMD are competitive with and can even outperform existing methods.
Spotlight Poster
Dongyeop Lee · Kwanhee Lee · Jinseok Chung · Namhoon Lee

[ East Exhibition Hall A-B ]

Abstract
Sparsifying neural networks often suffers from seemingly inevitable performance degradation, and it remains challenging to restore the original performance despite much recent progress.Motivated by recent studies in robust optimization, we aim to tackle this problem by finding subnetworks that are both sparse and flat at the same time.Specifically, we formulate pruning as a sparsity-constrained optimization problem where flatness is encouraged as an objective.We solve it explicitly via an augmented Lagrange dual approach and extend it further by proposing a generalized projection operation, resulting in novel pruning methods called SAFE and its extension, SAFE$^+$.Extensive evaluations on standard image classification and language modeling tasks reveal that SAFE consistently yields sparse networks with improved generalization performance, which compares competitively to well-established baselines.In addition, SAFE demonstrates resilience to noisy data, making it well-suited for real-world conditions.
Spotlight Poster
Fabiola Ricci · Lorenzo Bardone · Sebastian Goldt

[ West Exhibition Hall B2-B3 ]

Abstract
Deep neural networks learn structured features from complex, non-Gaussian inputs, but the mechanisms behind this process remain poorly understood. Our work is motivated by the observation that the first-layer filters learnt by deep convolutional neural networks from natural images resemble those learnt by independent component analysis (ICA), a simple unsupervised method that seeks the most non-Gaussian projections of its inputs. This similarity suggests that ICA provides a simple, yet principled model for studying feature learning. Here, we leverage this connection to investigate the interplay between data structure and optimisation in feature learning for the most popular ICA algorithm, FastICA, and stochastic gradient descent (SGD), which is used to train deep networks. We rigorously establish that FastICA requires at least $n\gtrsim d^4$ samples to recover a single non-Gaussian direction from $d$-dimensional inputs on a simple synthetic data model. We show that vanilla online SGD outperforms FastICA, and prove that the optimal sample complexity $n\gtrsim d^2$ can be reached by smoothing the loss, albeit in a data-dependent way. We finally demonstrate the existence of a search phase for FastICA on ImageNet, and discuss how the strong non-Gaussianity of said images compensates for the poor sample complexity of FastICA.
Spotlight Poster
Armin Behnamnia · Gholamali Aminian · Alireza Aghaei · Chengchun Shi · Vincent Tan · Hamid R Rabiee

[ West Exhibition Hall B2-B3 ]

Abstract
Off-policy learning and evaluation leverage logged bandit feedback datasets, which contain context, action, propensity score, and feedback for each data point. These scenarios face significant challenges due to high variance and poor performance with low-quality propensity scores and heavy-tailed reward distributions. We address these issues by introducing a novel estimator based on the log-sum-exponential (LSE) operator, which outperforms traditional inverse propensity score estimators. Our LSE estimator demonstrates variance reduction and robustness under heavy-tailed conditions. For off-policy evaluation, we derive upper bounds on the estimator's bias and variance. In the off-policy learning scenario, we establish bounds on the regret—the performance gap between our LSE estimator and the optimal policy—assuming bounded $(1+\epsilon)$-th moment of weighted reward. Notably, we achieve a convergence rate of $O(n^{-\epsilon/(1+\epsilon)})$ for the regret bounds, where $\epsilon\in[0,1]$ and $n$ is the size of logged bandit feedback dataset. Theoretical analysis is complemented by comprehensive empirical evaluations in both off-policy learning and evaluation scenarios, confirming the practical advantages of our approach. The code for our estimator is available at the following link: https://github.com/armin-behnamnia/lse-offpolicy-learning .
Spotlight Poster
Chen Zhang · Weixin Bu · Zeyi Ren · Zhengwu Liu · Yik-Chung WU · Ngai Wong

[ East Exhibition Hall A-B ]

Abstract
Inferring properties of graph-structured data, *e.g.*, the solubility of molecules, essentially involves learning the implicit mapping from graphs to their properties. This learning process is often costly for graph property learners like Graph Convolutional Networks (GCNs). To address this, we propose a paradigm called Graph Nonparametric Teaching (GraNT) that reinterprets the learning process through a novel nonparametric teaching perspective. Specifically, the latter offers a theoretical framework for teaching implicitly defined (*i.e.*, nonparametric) mappings via example selection. Such an implicit mapping is realized by a dense set of graph-property pairs, with the GraNT teacher selecting a subset of them to promote faster convergence in GCN training. By analytically examining the impact of graph structure on parameter-based gradient descent during training, and recasting the evolution of GCNs—shaped by parameter updates—through functional gradient descent in nonparametric teaching, we show *for the first time* that teaching graph property learners (*i.e.*, GCNs) is consistent with teaching structure-aware nonparametric learners. These new findings readily commit GraNT to enhancing learning efficiency of the graph property learner, showing significant reductions in training time for graph-level regression (-36.62\%), graph-level classification (-38.19\%), node-level regression (-30.97\%) and node-level classification (-47.30\%), all while maintaining its generalization performance.
Spotlight Poster
Bo-Han Lai · Pin-Han Huang · Bo-Han Kung · Shang-Tse Chen

[ East Exhibition Hall A-B ]

Abstract
Lipschitz neural networks are well-known for providing certified robustness in deep learning. In this paper, we present a novel, efficient Block Reflector Orthogonal (BRO) layer that enhances the capability of orthogonal layers on constructing more expressive Lipschitz neural architectures. In addition, by theoretically analyzing the nature of Lipschitz neural networks, we introduce a new loss function that employs an annealing mechanism to increase margin for most data points. This enables Lipschitz models to provide better certified robustness. By employing our BRO layer and loss function, we design BRONet — a simple yet effective Lipschitz neural network that achieves state-of-the-art certified robustness. Extensive experiments and empirical analysis on CIFAR-10/100, Tiny-ImageNet, and ImageNet validate that our method outperforms existing baselines. The implementation is available at [GitHub Link](https://github.com/ntuaislab/BRONet).
Spotlight Poster
Artem Moskalev · Mangal Prakash · Junjie Xu · Tianyu Cui · Rui Liao · Tommaso Mansi

[ East Exhibition Hall A-B ]

Abstract
Processing global geometric context while preserving equivariance is crucial when modeling biological, chemical, and physical systems. Yet, this is challenging due to the computational demands of equivariance and global context at scale. Standard methods such as equivariant self-attention suffer from quadratic complexity, while local methods such as distance-based message passing sacrifice global information. Inspired by the recent success of state-space and long-convolutional models, we introduce Geometric Hyena, the first equivariant long-convolutional model for geometric systems. Geometric Hyena captures global geometric context at sub-quadratic complexity while maintaining equivariance to rotations and translations. Evaluated on all-atom property prediction of large RNA molecules and full protein molecular dynamics, Geometric Hyena outperforms existing equivariant models while requiring significantly less memory and compute that equivariant self-attention. Notably, our model processes the geometric context of $30k$ tokens $20 \times$ faster than the equivariant transformer and allows $72 \times$ longer context within the same budget.
Spotlight Poster
Junyi Lu · Lili Jiang · Xiaojia Li · Jianbing Fang · Fengjun Zhang · Li Yang · Chun Zuo

[ West Exhibition Hall B2-B3 ]

Abstract
The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2× improvement over standard LLMs and a 10× gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.
Spotlight Poster
Shayne Longpre · Kevin Klyman · Ruth Elisabeth Appel · Sayash Kapoor · Rishi Bommasani · Michelle Sahar · Sean McGregor · Avijit Ghosh · Borhane Blili-Hamelin · Nathan Butters · Alondra Nelson · Amit Elazari · Andrew Sellars · Casey Ellis · Dane Sherrets · Dawn Song · Harley Geiger · Ilona Cohen · Lauren McIlvenny · Madhulika Srikumar · Mark Jaycox · Markus Anderljung · Nadine Johnson · Nicholas Carlini · Nicolas Miailhe · Nik Marda · Peter Henderson · Rebecca Portnoff · Rebecca Weiss · Victoria Westerhoff · Yacine Jernite · Rumman Chowdhury · Percy Liang · Arvind Narayanan

[ East Exhibition Hall A-B ]

Abstract
The widespread deployment of general-purpose AI (GPAI) systems introduces significant new risks. Yet the infrastructure, practices, and norms for reporting flaws in GPAI systems remain seriously underdeveloped, lagging far behind more established fields like software security. Based on a collaboration between experts from the fields of software security, machine learning, law, social science, and policy, we identify key gaps in the evaluation and reporting of flaws in GPAI systems. We call for three interventions to advance system safety. First, we propose using standardized AI flaw reports and rules of engagement for researchers in order to ease the process of submitting, reproducing, and triaging flaws in GPAI systems. Second, we propose GPAI system providers adopt broadly-scoped flaw disclosure programs, borrowing from bug bounties, with legal safe harbors to protect researchers. Third, we advocate for the development of improved infrastructure to coordinate distribution of flaw reports across the many stakeholders who may be impacted. These interventions are increasingly urgent, as evidenced by the prevalence of jailbreaks and other flaws that can transfer across different providers' GPAI systems. By promoting robust reporting and coordination in the AI ecosystem, these proposals could significantly improve the safety, security, and accountability of GPAI systems.
Spotlight Poster
Yu He · Ellen Vitercik

[ East Exhibition Hall A-B ]

Abstract
Neural Algorithmic Reasoning (NAR) trains neural networks to simulate classical algorithms, enabling structured and interpretable reasoning over complex data. While prior research has predominantly focused on learning exact algorithms for polynomial-time-solvable problems, extending NAR to harder problems remains an open challenge. In this work, we introduce a general NAR framework grounded in the primal-dual paradigm, a classical method for designing efficient approximation algorithms. By leveraging a bipartite representation between primal and dual variables, we establish an alignment between primal-dual algorithms and Graph Neural Networks. Furthermore, we incorporate optimal solutions from small instances to greatly enhance the model’s reasoning capabilities. Our empirical results demonstrate that our model not only simulates but also outperforms approximation algorithms for multiple tasks, exhibiting robust generalization to larger and out-of-distribution graphs. Moreover, we highlight the framework’s practical utility by integrating it with commercial solvers and applying it to real-world datasets.
Spotlight Poster
Jintao Tong · Ran Ma · Yixiong Zou · Guangyao Chen · Yuhua Li · Ruixuan Li

[ East Exhibition Hall A-B ]

Abstract
Cross-domain few-shot segmentation (CD-FSS) is proposed to first pre-train the model on a source-domain dataset with sufficient samples, and then transfer the model to target-domain datasets where only a few training samples are available for efficient finetuning. There are majorly two challenges in this task: (1) the domain gap and (2) finetuning with scarce data. To solve these challenges, we revisit the adapter-based methods, and discover an intriguing insight not explored in previous works: the adapter not only helps the fine-tuning of downstream tasks but also naturally serves as a domain information decoupler. Then, we delve into this finding for an interpretation, and we find the model's inherent structure could lead to a natural decoupling of domain information. Building upon this insight, we propose the Domain Feature Navigator (DFN), which is a structure-based decoupler instead of loss-based ones like current works, to capture domain-specific information, thereby directing the model's attention towards domain-agnostic knowledge. Moreover, to prevent the potential excessive overfitting of DFN during the source-domain training, we further design the SAM-SVN method to constrain DFN from learning sample-specific knowledge. On target domains, we freeze the model and fine-tune the DFN to learn knowledge specific to target domains. Extensive experiments demonstrate …
Spotlight Poster
Andy Zhang · Kevin Klyman · Yifan Mai · Yoav Levine · Yian Zhang · Rishi Bommasani · Percy Liang

[ East Exhibition Hall A-B ]

Abstract
Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap, which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 models, finding that just 9 models report train-test overlap: 4 models release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 models publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional models. Overall, this position paper argues that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.
Spotlight Poster
Thomas Pouplin · Katarzyna Kobalczyk · Hao Sun · Mihaela van der Schaar

[ East Exhibition Hall A-B ]

Abstract
Developing autonomous agents capable of performing complex, multi-step decision-making tasks specified in natural language remains a significant challenge, particularly in realistic settings where labeled data is scarce and real-time experimentation is impractical. Existing reinforcement learning (RL) approaches often struggle to generalize to unseen goals and states, limiting their applicability. In this paper, we introduce $\textit{TEDUO}$, a novel training pipeline for offline language-conditioned policy learning in symbolic environments. Unlike conventional methods, $\textit{TEDUO}$ operates on readily available, unlabeled datasets and addresses the challenge of generalization to previously unseen goals and states. Our approach harnesses large language models (LLMs) in a dual capacity: first, as automatization tools augmenting offline datasets with richer annotations, and second, as generalizable instruction-following agents. Empirical results demonstrate that $\textit{TEDUO}$ achieves data-efficient learning of robust language-conditioned policies, accomplishing tasks beyond the reach of conventional RL frameworks or out-of-the-box LLMs alone.
Spotlight Poster
Zhengxuan Wu · Aryaman Arora · Atticus Geiger · Zheng Wang · Jing Huang · Dan Jurafsky · Christopher Manning · Christopher Potts

[ East Exhibition Hall A-B ]

Abstract
Fine-grained steering of language model outputs is essential for safety and reliability. Prompting and finetuning are widely used to achieve these goals, but interpretability researchers have proposed a variety of representation-based techniques as well, including sparse autoencoders (SAEs), linear artificial tomography, supervised steering vectors, linear probes, and representation finetuning. At present, there is no benchmark for making direct comparisons between these proposals. Therefore, we introduce AxBench, a large-scale benchmark for steering and concept detection, and report experiments on Gemma-2-2B and 9B. For steering, we find that prompting outperforms all existing methods, followed by finetuning. For concept detection, representation-based methods such as difference-in-means, perform the best. On both evaluations, SAEs are not competitive. We introduce a novel weakly-supervised representational method (Rank-1 Representation Finetuning; ReFT-r1), which is competitive on both tasks while providing the interpretability advantages that prompting lacks. Along with AxBench, we train and publicly release SAE-scale feature dictionaries for ReFT-r1 and DiffMean.
Spotlight Poster
Yuan Deng · Amin Karbasi · Vahab Mirrokni · Renato Leme · Grigorios Velegkas · Song Zuo

[ West Exhibition Hall B2-B3 ]

Abstract
We study the problem of procurement auctions, in which an auctioneer seeks to acquire services from a group of strategic sellers with private costs. The quality of the services is measured through some submodular function that is known to the auctioneer. Our goal is to design computationally efficient procurement auctions that (approximately) maximize the difference between the quality of the acquired services and the total cost of the sellers, in a way that is incentive compatible (IC) and individual rational (IR) for the sellers, and generates non-negative surplus (NAS) for the auctioneer. {Our contribution is twofold: \textbf{i)} we provide an improved analysis of existing algorithms for non-positive submodular function maximization and \textbf{ii)} we design computationally efficient frameworks that transform submodular function optimization algorithms to mechanisms that are IC and IR for the sellers, NAS for the auctioneer, and approximation-preserving.} Our frameworks are general and work both in the offline setting where the auctioneer can observe the bids and the services of all the sellers simultaneously, and in the online setting where the sellers arrive in an adversarial order and the auctioneer has to make an irrevocable decision whether to purchase their service or not. We further investigate whether it is …
Spotlight Poster
Moshe Eliasof · Alessio Gravina · Andrea Ceni · Claudio Gallicchio · Davide Bacciu · Carola-Bibiane Schönlieb

[ East Exhibition Hall A-B ]

Abstract
Graph State Space Models (SSMs) have recently been introduced to enhance Graph Neural Networks (GNNs) in modeling long-range interactions. Despite their success, existing methods either compromise on permutation equivariance or limit their focus to pairwise interactions rather than sequences. Building on the connection between Autoregressive Moving Average (ARMA) and SSM, in this paper, we introduce GRAMA, a Graph Adaptive method based on a learnable ARMA framework that addresses these limitations. By transforming from static to sequential graph data, GRAMA leverages the strengths of the ARMA framework, while preserving permutation equivariance. Moreover, GRAMA incorporates a selective attention mechanism for dynamic learning of ARMA coefficients, enabling efficient and flexible long-range information propagation. We also establish theoretical connections between GRAMA and Selective SSMs, providing insights into its ability to capture long-range dependencies. Experiments on 26 synthetic and real-world datasets demonstrate that GRAMA consistently outperforms backbone models and performs competitively with state-of-the-art methods.
Spotlight Poster
Xujie Song · Liangfa Chen · Tong Liu · Wenxuan Wang · Yinuo Wang · Shentao Qin · Yinsong Ma · Jingliang Duan · Shengbo Li

[ West Exhibition Hall B2-B3 ]

Abstract
Deep reinforcement learning (RL) is effective for decision-making and control tasks like autonomous driving and embodied AI. However, RL policies often suffer from the action fluctuation problem in real-world applications, resulting in severe actuator wear, safety risk, and performance degradation. This paper identifies the two fundamental causes of action fluctuation: observation noise and policy non-smoothness. We propose LipsNet++, a novel policy network with Fourier filter layer and Lipschitz controller layer to separately address both causes. The filter layer incorporates a trainable filter matrix that automatically extracts important frequencies while suppressing noise frequencies in the observations. The controller layer introduces a Jacobian regularization technique to achieve a low Lipschitz constant, ensuring smooth fitting of a policy function. These two layers function analogously to the filter and controller in classical control theory, suggesting that filtering and control capabilities can be seamlessly integrated into a single policy network. Both simulated and real-world experiments demonstrate that LipsNet++ achieves the state-of-the-art noise robustness and action smoothness. The code and videos are publicly available at https://xjsong99.github.io/LipsNet_v2.
Spotlight Poster
Mark Schoene · Babak Rahmani · Heiner Kremer · Fabian Falck · Hitesh Ballani · Jannes Gladrow

[ East Exhibition Hall A-B ]

Abstract
State-space models (SSMs) and transformers dominate the language modeling landscape. However, they are constrained to a lower computational complexity than classical recurrent neural networks (RNNs), limiting their expressivity. In contrast, RNNs lack parallelization during training, raising fundamental questions about the trade off between parallelization and expressivity. We propose implicit SSMs, which iterate a transformation until convergence to a fixed point. Theoretically, we show that implicit SSMs implement the non-linear state-transitions of RNNs. Empirically, we find that only approximate fixed-point convergence suffices, enabling the design of a scalable training curriculum that largely retains parallelization, with full convergence required only for a small subset of tokens. Our approach demonstrates superior state-tracking capabilities on regular languages, surpassing transformers and SSMs. We further scale implicit SSMs to natural language reasoning tasks and pretraining of large-scale language models up to 1.3B parameters on 207B tokens - representing, to our knowledge, the largest implicit model trained to date. Notably, our implicit models outperform their explicit counterparts on standard benchmarks. Our code is publicly available at github.com/microsoft/implicit_languagemodels
Spotlight Poster
Yupeng Hou · Jianmo Ni · Zhankui He · Noveen Sachdeva · Wang-Cheng Kang · Ed Chi · Julian McAuley · Derek Cheng

[ East Exhibition Hall A-B ]

Abstract
Generative recommendation (GR) is an emerging paradigm where user actions are tokenized into discrete token patterns and autoregressively generated as predictions. However, existing GR models tokenize each action independently, assigning the same fixed tokens to identical actions across all sequences without considering contextual relationships. This lack of context-awareness can lead to suboptimal performance, as the same action may hold different meanings depending on its surrounding context. To address this issue, we propose ActionPiece to explicitly incorporate context when tokenizing action sequences. In ActionPiece, each action is represented as a *set* of item features. Given the action sequence corpora, we construct the vocabulary by merging feature patterns as new tokens, based on their co-occurrence frequency both within individual sets and across adjacent sets. Considering the unordered nature of feature sets, we further introduce set permutation regularization, which produces multiple segmentations of action sequences with the same semantics. Our code is available at: https://github.com/google-deepmind/action_piece.
Spotlight Poster
Georgy Noarov · Riccardo Fogliato · Martin A Bertran · Aaron Roth

[ East Exhibition Hall A-B ]

Abstract
We study the design of adaptive, sequential experiments for unbiased average treatment effect (ATE) estimation in the design-based potential outcomes setting. Our goal is to develop adaptive designs offering *sublinear Neyman regret*, meaning their efficiency must approach that of the hindsight-optimal nonadaptive design.Recent work [Dai et al, 2023] introduced ClipOGD, the first method achieving $\widetilde{O}(\sqrt{T})$ expected Neyman regret under mild conditions. In this work, we propose adaptive designs with substantially stronger Neyman regret guarantees. In particular, we modify ClipOGD to obtain anytime $\widetilde{O}(\log T)$ Neyman regret under natural boundedness assumptions. Further, in the setting where experimental units have pre-treatment covariates, we introduce and study a class of contextual ``multigroup'' Neyman regret guarantees: Given a set of possibly overlapping groups based on the covariates, the adaptive design outperforms each group's best non-adaptive designs. In particular, we develop a contextual adaptive design with $\widetilde{O}(\sqrt{T})$ anytime multigroup Neyman regret. We empirically validate the proposed designs through an array of experiments.
Spotlight Poster
Ilias Diakonikolas · Daniel Kane · Sushrut Karmalkar · Jasper Lee · Thanasis Pittas

[ West Exhibition Hall B2-B3 ]

Abstract
We study the complexity of learning $k$-mixtures of Gaussians ($k$-GMMs) on $\mathbb R^d$. This task is known to have complexity $d^{\Omega(k)}$ in full generality. To circumvent this exponential lower bound on the number of components, research has focused on learning families of GMMs satisfying additional structural properties. A natural assumption posits that the component weights are not exponentially small and that the components have the same unknown covariance. Recent work gave a $d^{O(\log(1/w_{\min}))}$-time algorithm for this class of GMMs, where $w_{\min}$ is the minimum weight. Our first main result is a Statistical Query (SQ) lower bound showing that this quasi-polynomial upper bound is essentially best possible, even for the special case of uniform weights. Specifically, we show that it is SQ-hard to distinguish between such a mixture and the standard Gaussian. We further explore how the distribution of weights affects the complexity of this task. Our second main result is a quasi-polynomial upper bound for the aforementioned testing task when most of the weights are uniform while a small fraction of the weights are potentially arbitrary.
Spotlight Poster
Seungwook Han · Jinyeop Song · Jeff Gore · Pulkit Agrawal

[ East Exhibition Hall A-B ]

Abstract
Autoregressive transformers exhibit adaptive learning through in-context learning (ICL), which begs the question of how. Prior works have shown that transformers represent the ICL tasks as vectors in their representations. In this paper, we leverage the encoding-decoding framework to study how transformers form task vectors during pretraining and how their task encoding quality predicts ICL task performance. On synthetic ICL tasks, we analyze the training dynamics of a small transformer and report the coupled emergence of task encoding and decoding. As the model learns to encode different latent tasks (e.g., "Finding the first noun in a sentence.") into distinct, separable representations, it concurrently builds conditional decoding algorithms and improves its ICL performance. We validate this phenomenon across pretrained models of varying scales (Gemma-2 2B/9B/27B, Llama-3.1 8B/70B) and over the course of pretraining in OLMo-7B. Further, we demonstrate that the quality of task encoding inferred from representations predicts ICL performance, and that, surprisingly, finetuning the earlier layers can improve the task encoding and performance more than finetuning the latter layers. Our empirical insights shed light into better understanding the success and failure modes of large language models via their representations.
Spotlight Poster
Rina Bao · Shilong Dong · Zhenfang Chen · Sheng He · Patricia Ellen Grant · Yangming Ou

[ West Exhibition Hall B2-B3 ]

Abstract
Medical Visual Question Answering (MVQA) requires AI models to answer questions related to medical images, offering significant potential to assist medical professionals in evaluating and diagnosing diseases, thereby improving early interventions. However, existing MVQA datasets primarily focus on basic questions regarding visual perception and pattern recognition, without addressing the more complex questions that are critical in clinical diagnosis and decision-making. This paper introduces a new benchmark designed for professional-level medical reasoning, simulating the decision-making process. We achieve this by collecting MRI and clinical data related to Hypoxic-Ischemic Encephalopathy, enriched with expert annotations and insights. Building on this data, we generate clinical question-answer pairs and MRI interpretations to enable comprehensive diagnosis, interpretation, and prediction of neurocognitive outcomes. Our evaluation of current large vision-language models (LVLMs) shows limited performance on this benchmark, highlighting both the challenges of the task and the importance of this benchmark for advancing medical AI. Furthermore, we propose a novel ``Clinical Graph of Thoughts" model, which integrates domain-specific medical knowledge and clinical reasoning processes with the interpretive abilities of LVLMs. The model demonstrates promising results, achieving around 15\% absolute gain on the most important neurocognitive outcome task, while the benchmark still reveals substantial opportunities for further research innovation.
Spotlight Poster
Yi Xu · Laura Ruis · Tim Rocktäschel · Robert Kirk

[ East Exhibition Hall A-B ]

Abstract
Automatic evaluation methods based on large language models (LLMs) are emerging as the standard tool for assessing the instruction-following abilities of LLM-based agents. The most common method in this paradigm, pairwise comparisons with a baseline model, critically depends on the assumption of transitive preferences. However, the validity of this assumption remains largely unexplored. In this study, we investigate the presence of non-transitivity within the AlpacaEval framework and analyze its effects on model rankings. We find that LLM judges exhibit non-transitive preferences, leading to rankings that are sensitive to the choice of the baseline model. To mitigate this issue, we show that round-robin tournaments combined with Bradley-Terry models of preference can produce more reliable rankings. Notably, our method increases both the Spearman correlation and the Kendall correlation with Chatbot Arena (95.0\% $\rightarrow$ 96.4\% and 82.1\% $\rightarrow$ 86.3\% respectively). To address the computational cost of round-robin tournaments, we propose Swiss-Wise Iterative Matchmaking (Swim) tournaments, using a dynamic matching strategy to capture the benefits of round-robin tournaments while maintaining computational efficiency.
Spotlight Poster
Jaesik Yoon · Hyeonseo Cho · Doojin Baek · Yoshua Bengio · Sungjin Ahn

[ West Exhibition Hall B2-B3 ]

Abstract
Diffusion models have recently emerged as a powerful tool for planning. However, unlike Monte Carlo Tree Search (MCTS)—whose performance naturally improves with inference-time computation scaling—standard diffusion‐based planners offer only limited avenues for the scalability. In this paper, we introduce Monte Carlo Tree Diffusion (MCTD), a novel framework that integrates the generative strength of diffusion models with the adaptive search capabilities of MCTS. Our method reconceptualizes denoising as a tree‐structured process, allowing partially denoised plans to be iteratively evaluated, pruned, and refined. By selectively expanding promising trajectories while retaining the flexibility to revisit and improve suboptimal branches, MCTD achieves the benefits of MCTS such as controlling exploration-exploitation trade-offs within the diffusion framework. Empirical results on challenging long‐horizon tasks show that MCTD outperforms diffusion baselines, yielding higher‐quality solutions as inference-time computation increases.
Spotlight Poster
Abdulkadir Gokce · Martin Schrimpf

[ West Exhibition Hall B2-B3 ]

Abstract
When trained on large-scale object classification datasets, certain artificial neural network models begin to approximate core object recognition behaviors and neural response patterns in the primate brain. While recent machine learning advances suggest that scaling compute, model size, and dataset size improves task performance, the impact of scaling on brain alignment remains unclear. In this study, we explore scaling laws for modeling the primate visual ventral stream by systematically evaluating over 600 models trained under controlled conditions on benchmarks spanning V1, V2, V4, IT and behavior. We find that while behavioral alignment continues to scale with larger models, neural alignment saturates. This observation remains true across model architectures and training datasets, even though models with stronger inductive biases and datasets with higher-quality images are more compute-efficient. Increased scaling is especially beneficial for higher-level visual areas, where small models trained on few samples exhibit only poor alignment. Our results suggest that while scaling current architectures and datasets might suffice for alignment with human core object recognition behavior, it will not yield improved models of the brain's visual ventral stream, highlighting the need for novel strategies in building brain models.
Spotlight Poster
Weiwei Ye · Zhuopeng Xu · Ning Gui

[ East Exhibition Hall A-B ]

Abstract
Due to the dynamics of underlying physics and external influences, the uncertainty of time series often varies over time. However, existing Denoising Diffusion Probabilistic Models (DDPMs) often fail to capture this non-stationary nature, constrained by their constant variance assumption from the additive noise model (ANM). In this paper, we innovatively utilize the Location-Scale Noise Model (LSNM) to relax the fixed uncertainty assumption of ANM. A diffusion-based probabilistic forecasting framework, termed Non-stationary Diffusion (NsDiff), is designed based on LSNM that is capable of modeling the changing pattern of uncertainty. Specifically, NsDiff combines a denoising diffusion-based conditional generative model with a pre-trained conditional mean and variance estimator, enabling adaptive endpoint distribution modeling. Furthermore, we propose an uncertainty-aware noise schedule, which dynamically adjusts the noise levels to accurately reflect the data uncertainty at each step and integrates the time-varying variances into the diffusion process. Extensive experiments conducted on nine real-world and synthetic datasets demonstrate the superior performance of NsDiff compared to existing approaches. Code is available at https://github.com/wwy155/NsDiff.
Spotlight Poster
John Stewart Fabila-Carrasco · He Sun

[ East Exhibition Hall A-B ]

Abstract
Given two weighted graphs $G = (V, E, w_G)$ and $H = (V, F, w_H)$ defined on the same vertex set, the constrained clustering problem seeks to find a subset $S \subset V$ that minimises the cut ratio between $w_G(S, V \setminus S)$ and $w_H(S, V \setminus S)$. In this work, we establish a Cheeger-type inequality that relates the solution of the constrained clustering problem to the spectral properties of $ G$ and $H$. To reduce computational complexity, we utilise the signed Laplacian of $H$, streamlining calculations while maintaining accuracy. By solving a generalised eigenvalue problem, our proposed algorithm achieves notable performance improvements, particularly in challenging scenarios where traditional spectral clustering methods struggle. We demonstrate its practical effectiveness through experiments on both synthetic and real-world datasets.
Spotlight Poster
Haoyang Li · Xin Wang · Zeyang Zhang · Zongyuan Wu · Linxin Xiao · Wenwu Zhu

[ East Exhibition Hall A-B ]

Abstract
Self-supervised learning (SSL) on graph-structured data has attracted considerable attention recently. Masked graph autoencoder, as one promising generative graph SSL approach that aims to recover masked parts of the input graph data, has shown great success in various downstream graph tasks. However, existing masked graph autoencoders fail to consider the degree of difficulty of recovering the masked edges that often have different impacts on the model performance, resulting in suboptimal node representations. To tackle this challenge, in this paper, we propose a novel curriculum based self-supervised masked graph autoencoder that is able to capture and leverage the underlying degree of difficulty of data dependencies hidden in edges, and design better mask-reconstruction pretext tasks for learning informative node representations. Specifically, we first design a difficulty measurer to identify the underlying structural degree of difficulty of edges during the masking step. Then, we adopt a self-paced scheduler to determine the order of masking edges, which encourages the graph encoder to learn from easy to difficult parts. Finally, the masked edges are gradually incorporated into the reconstruction pretext task, leading to high-quality node representations. Experiments on several real-world node classification and link prediction datasets demonstrate the superiority of our proposed method over state-of-the-art …
Spotlight Poster
Xinyan Liang · Ruijie Sang · Yuhua Qian · Qian Guo · Feijiang Li · Liang Du

[ East Exhibition Hall A-B ]

Abstract
Automatic Modulation Classification (AMC) serves as a foundational pillar for cognitive radio systems, enabling critical functionalities including dynamic spectrum allocation, non-cooperative signal surveillance, and adaptive waveform optimization. However, practical deployment of AMC faces a fundamental challenge: prediction ambiguity arising from intrinsic similarity among modulation schemes and exacerbated under low signal-to-noise ratio (SNR) conditions. This phenomenon manifests as near-identical probability distributions across confusable modulation types, significantly degrading classification reliability. To address this, we propose Fuzzy Regularization-enhanced AMC (FR-AMC), a novel framework that integrates uncertainty quantification into the classification pipeline. The proposed FR has three features: (1) Explicitly model prediction ambiguity during backpropagation, (2) dynamic sample reweighting through adaptive loss scaling, (3) encourage margin maximization between confusable modulation clusters. Experimental results on benchmark datasets demonstrate that the FR achieves superior classification accuracy and robustness compared to compared methods, making it a promising solution for real-world spectrum management and communication applications.
Spotlight Poster
Florian Peter Busch · Roshni Ramanna Kamath · Rupert Mitchell · Wolfgang Stammer · Kristian Kersting · Martin Mundt

[ East Exhibition Hall A-B ]

Abstract
A dataset is confounded if it is most easily solved via a spurious correlation which fails to generalize to new data. In this work, we show that, in a continual learning setting where confounders may vary in time across tasks, the challenge of mitigating the effect of confounders far exceeds the standard forgetting problem normally considered. In particular, we provide a formal description of such continual confounders and identify that, in general, spurious correlations are easily ignored when training for all tasks jointly, but it is harder to avoid confounding when they are considered sequentially. These descriptions serve as a basis for constructing a novel CLEVR-based continually confounded dataset, which we term the ConCon dataset. Our evaluations demonstrate that standard continual learning methods fail to ignore the dataset's confounders. Overall, our work highlights the challenges of confounding factors, particularly in continual learning settings, and demonstrates the need for developing continual learning methods to robustly tackle these.
Spotlight Poster
Quang-Duy Tran · Bao Duong · Phuoc Nguyen · Thin Nguyen

[ East Exhibition Hall A-B ]

Abstract
Telling apart the cause and effect between two random variables with purely observational data is a challenging problem that finds applications in various scientific disciplines. A key principle utilized in this task is the algorithmic Markov condition, which postulates that the joint distribution, when factorized according to the causal direction, yields a more succinct codelength compared to the anti-causal direction. Previous approaches approximate these codelengths by relying on simple functions or Gaussian processes (GPs) with easily evaluable complexity, compromising between model fitness and computational complexity. To overcome these limitations, we propose leveraging the variational Bayesian learning of neural networks as an interpretation of the codelengths. Consequently, we can enhance the model fitness while promoting the succinctness of the codelengths, while avoiding the significant computational complexity of the GP-based approaches. Extensive experiments on both synthetic and real-world benchmarks in cause-effect identification demonstrate the effectiveness of our proposed method, surpassing the overall performance of related complexity-based and structural causal model regression-based approaches.
Spotlight Poster
Yujin Oh · Pengfei Jin · Sangjoon Park · Sekeun Kim · Siyeop yoon · Jin Kim · Kyungsang Kim · Xiang Li · Quanzheng Li

[ West Exhibition Hall B2-B3 ]

Abstract
Ensuring fairness in medical image segmentation is critical due to biases in imbalanced clinical data acquisition caused by demographic attributes (e.g., age, sex, race) and clinical factors (e.g., disease severity). To address these challenges, we introduce Distribution-aware Mixture of Experts (dMoE), inspired by optimal control theory. We provide a comprehensive analysis of its underlying mechanisms and clarify dMoE's role in adapting to heterogeneous distributions in medical image segmentation. Furthermore, we integrate dMoE into multiple network architectures, demonstrating its broad applicability across diverse medical image analysis tasks. By incorporating demographic and clinical factors, dMoE achieves state-of-the-art performance on two 2D benchmark datasets and a 3D in-house dataset. Our results highlight the effectiveness of dMoE in mitigating biases from imbalanced distributions, offering a promising approach to bridging control theory and medical image segmentation within fairness learning paradigms. The source code is available at https://github.com/tvseg/dMoE.
Spotlight Poster
Zian Li · Cai Zhou · Xiyuan Wang · Xingang Peng · Muhan Zhang

[ East Exhibition Hall A-B ]

Abstract
Recent advances in molecular generative models have demonstrated great promise for accelerating scientific discovery, particularly in drug design. However, these models often struggle to generate high-quality molecules, especially in conditional scenarios where specific molecular properties must be satisfied. In this work, we introduce GeoRCG, a general framework to improve molecular generative models by integrating geometric representation conditions with provable theoretical guarantees. We decompose the generation process into two stages: first, generating an informative geometric representation; second, generating a molecule conditioned on the representation. Compared with single-stage generation, the easy-to-generate representation in the first stage guides the second stage generation toward a high-quality molecule in a goal-oriented way. Leveraging EDM and SemlaFlow as base generators, we observe significant quality improvements in unconditional molecule generation on the widely used QM9 and GEOM-DRUG datasets. More notably, in the challenging conditional molecular generation task, our framework achieves an average 50\% performance improvement over state-of-the-art approaches, highlighting the superiority of conditioning on semantically rich geometric representations. Furthermore, with such representation guidance, the number of diffusion steps can be reduced to as small as 100 while largely preserving the generation quality achieved with 1,000 steps, thereby significantly reducing the generation iterations needed.
Spotlight Poster
Wei Qu · Jiawei Guan · Rui Ma · kezhai · weikun wu · haobo Wang

[ West Exhibition Hall B2-B3 ]

Abstract
We introduce Pallatom, an innovative protein generation model capable of producing protein structures with all-atom coordinates. Pallatom directly learns and models the joint distribution $P(\textit{structure}, \textit{seq})$ by focusing on $P(\textit{all-atom})$, effectively addressing the interdependence between sequence and structure in protein generation. To achieve this, we propose a novel network architecture specifically designed for all-atom protein generation. Our model employs a dual-track framework that tokenizes proteins into token-level and atomic-level representations, integrating them through a multi-layer decoding process with "traversing" representations and recycling mechanism. We also introduce the $\texttt{atom14}$ representation method, which unifies the description of unknown side-chain coordinates, ensuring high fidelity between the generated all-atom conformation and its physical structure. Experimental results demonstrate that Pallatom excels in key metrics of protein design, including designability, diversity, and novelty, showing significant improvements across the board. Our model not only enhances the accuracy of protein generation but also exhibits excellent sampling efficiency, paving the way for future applications in larger and more complex systems.
Spotlight Poster
Nikolaus Howe · Ian McKenzie · Oskar Hollinsworth · Michał Zając · Tom Tseng · Aaron Tucker · Pierre-Luc Bacon · Adam Gleave

[ East Exhibition Hall A-B ]

Abstract
Increasing model size has unlocked a dazzling array of capabilities in language models.At the same time, even frontier models remain vulnerable to jailbreaks and prompt injections, despite concerted efforts to make them robust.As both attackers and defenders gain access to more compute, and as models become larger, what will be the effect on robustness?We argue that to answer this question requires a *scaling lens*, which we adopt in an extensive study of language model robustness across several classification tasks, model families, and adversarial attacks.We find that in the absence of explicit safety training, larger models are not consistently more robust; however, scale improves sample efficiency in adversarial training, though it worsens compute efficiency.Further, we find that increasing attack compute smoothly improves attack success rate against both undefended and adversarially trained models.Finally, after exploring robustness transfer across attacks and threat models, we combine attack and defense scaling rates to study the offense-defense balance.We find that while attack scaling outpaces adversarial training across all models studied, larger adversarially trained models might give defense the advantage in the long run.These results underscore the utility of the scaling lens, and provide a paradigm for evaluating future attacks and defenses on frontier models.Code for this …
Spotlight Poster
Liyuan Liang · Xinyi Chen · Gan Luo · Kun Yuan

[ West Exhibition Hall B2-B3 ]

Abstract
A key challenge in decentralized optimization is determining the optimal convergence rate and designing algorithms to achieve it. While this problem has been extensively addressed for doubly-stochastic and column-stochastic mixing matrices, the row-stochastic scenario remains unexplored. This paper bridges this gap by introducing effective metrics to capture the influence of row-stochastic mixing matrices and establishing the first convergence lower bound for decentralized learning over row-stochastic networks. However, existing algorithms fail to attain this lower bound due to two key issues: deviation in the descent direction caused by the adapted gradient tracking (GT) and instability introduced by the Pull-Diag protocol. To address descent deviation, we propose a novel analysis framework demonstrating that Pull-Diag-GT achieves linear speedup—the first such result for row-stochastic decentralized optimization. Moreover, by incorporating a multi-step gossip (MG) protocol, we resolve the instability issue and attain the lower bound, achieving near-optimal complexity for decentralized optimization over row-stochastic networks.
Spotlight Poster
Yuchen Zeng · Tuan Dinh · Wonjun Kang · Andreas Mueller

[ East Exhibition Hall A-B ]

Abstract
Leveraging the in-context learning (ICL) capability of Large Language Models (LLMs) for tabular classification has gained significant attention for its training-free adaptability across diverse datasets. Recent advancements, like TabPFN, excel in small-scale tabular datasets but struggle to scale for large and complex datasets. Our work enhances the efficiency and scalability of TabPFN for larger datasets by incorporating linear attention mechanisms as a scalable alternative to complexity-quadratic self-attention. Our model, TabFlex, efficiently handles tabular datasets with thousands of features and hundreds of classes, scaling seamlessly to millions of samples. For instance, TabFlex processes the poker-hand dataset with over a million samples in just 5 seconds. Our extensive evaluations demonstrate that TabFlex can achieve over a 2× speedup compared to TabPFN and a 1.5× speedup over XGBoost, outperforming 25 tested baselines in terms of efficiency across a diverse range of datasets. Furthermore, TabFlex remains highly effective on large-scale datasets, delivering strong performance with significantly reduced computational costs, especially when combined with data-efficient techniques such as dimensionality reduction and data sampling.
Spotlight Poster
Andreas Kalavas · Evangelos Kipouridis · Nithin Varma

[ West Exhibition Hall B2-B3 ]

Abstract
In the Correlation Clustering problem, we are given an undirected graph and are tasked with computing a clustering (partition of the nodes) that minimizes the sum of the number of edges across different clusters and the number of non-edges within clusters. In the constrained version of this problem, the goal is to compute a clustering that satisfies additional hard constraints mandating certain pairs to be in the same cluster and certain pairs to be in different clusters. Constrained Correlation Clustering is APX-Hard, and the best known approximation factor is 3 (van Zuylen et al. [SODA '07]). In this work, we show that in order to obtain a better-than-2 approximation, solving the (exponentially large) Constrained Cluster LP would be sufficient.[The peer-reviewed version of this article claimed an efficient algorithm for solving the Constrained Cluster LP. An error in the proof, that the authors discovered after the review process, led them to revise the results to be conditional on the existence of a valid LP solution.]
Spotlight Poster
Divya Shyamal · Jiaqi Zhang · Caroline Uhler

[ West Exhibition Hall B2-B3 ]

Abstract
A _combinatorial intervention_, consisting of multiple treatments applied to a single unit with potential interactive effects, has substantial applications in fields such as biomedicine, engineering, and beyond. Given $p$ possible treatments, conducting all possible $2^p$ combinatorial interventions can be laborious and quickly becomes infeasible as $p$ increases. Here we introduce the _probabilistic factorial experimental design_, formalized from how scientists perform lab experiments. In this framework, the experimenter selects a dosage for each possible treatment and applies it to a group of units. Each unit independently receives a random combination of treatments, sampled from a product Bernoulli distribution determined by the dosages. Additionally, the experimenter can carry out such experiments over multiple rounds, adapting the design in an active manner. We address the optimal experimental design problem within a novel intervention model that imposes bounded-degree interactions between treatments. In the passive setting, we provide a closed-form solution for the near-optimal design. Our results prove that a dosage of $\frac{1}{2}$ for each treatment is optimal up to a factor of $1+O(\frac{\ln(n)}{n})$ for estimating any $k$-way interaction model, regardless of $k$, and imply that $O\big(kp^{3k}\ln(p)\big)$ observations are required to accurately estimate this model. For the multi-round setting, we provide a near-optimal acquisition function …
Spotlight Poster
Sally Zhu · Ahmed Ahmed · Rohith Kuditipudi · Percy Liang

[ East Exhibition Hall A-B ]

Abstract
Motivated by liability and intellectual property concerns over open-weight models we consider the following problem: given the weights of two models, can we test whether they were trained independently---i.e., from independent random initializations? We consider two settings: *constrained* and *unconstrained*. In the constrained setting, we make assumptions about model architecture and training and propose statistical tests that yield exact p-values with respect to the null hypothesis that the models are trained from independent random initializations. We compute the p-values by simulating exchangeable copies of each model under our assumptions and comparing various similarity measures between the original two models versus these copies. We report p-values on pairs of 21 open-weight models (210 total pairs) and find we correctly identify all pairs of non-independent models. In the unconstrained setting we make none of the prior assumptions and allow for adversarial evasion attacks that do not change model output. We thus propose a new test which matches hidden activations between two models, which is robust to these transformations and to changes in model architecture and can also identify specific non-independent components of models. Though we no longer obtain exact p-values from this test, empirically we find it reliably distinguishes non-independent models like …
Spotlight Poster
Hjalmar Wijk · Tao Lin · Joel Becker · Sami Jawhar · Neev Parikh · Thomas Broadley · Lawrence Chan · Michael Chen · Joshua Clymer · Jai Dhyani · Elena Ericheva · Katharyn Garcia · Brian Goodrich · Nikola Jurkovic · Megan Kinniment · Aron Lajko · Seraphina Nix · Lucas Jun Koba Sato · William Saunders · Maksym Taran · Ben West · Elizabeth Barnes

[ East Exhibition Hall A-B ]

Abstract
Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a direct comparison to human performance. We introduce RE-Bench (Research Engineering Benchmark, V1), which consists of 7 challenging, open-ended ML research engineering environments and data from 71 8-hour attempts by 61 distinct human experts. We confirm that our experts make progress in the environments given 8 hours, with 82% of expert attempts achieving a non-zero score and 24% matching or exceeding our strong reference solutions. We compare humans to several public frontier models through best-of-$k$ with varying time budgets and agent designs, and find that the best AI agents achieve a score 4× higher than human experts when both are given a total time budget of 2 hours per environment. However, humans currently display better returns to increasing time budgets, narrowly exceeding the top AI agent scores given an 8-hour budget, and achieving 2× the score of the top AI agent when both are given 32 total hours (across different attempts).
Spotlight Poster
Yinhong Liu · Zhijiang Guo · Tianya Liang · Ehsan Shareghi · Ivan Vulić · Nigel Collier

[ East Exhibition Hall A-B ]

Abstract
Large Language Models (LLMs) are expected to be predictable and trustworthy to support reliable decision-making systems. Yet current LLMs often show inconsistencies in their judgments. In this work, we examine \textit{logical preference consistency} as a foundational requirement for building more dependable LLM systems, ensuring stable and coherent decision-making while minimizing erratic or contradictory outputs.To quantify the logical preference consistency, we propose a universal evaluation framework based on three fundamental properties: *transitivity*, *commutativity* and *negation invariance*.Through extensive experimentation across diverse LLMs, we demonstrate that these properties serve as strong indicators of judgment robustness.Furthermore, we introduce a data refinement and augmentation technique, REPAIR, that enhances logical consistency while maintaining alignment with human preferences. Finally, we show that improving consistency leads to better performance in LLM-driven logic-based algorithms, reinforcing stability and coherence in decision-making systems.
Spotlight Poster
Xiang Fang · Arvind Easwaran · Blaise Genest

[ West Exhibition Hall B2-B3 ]

Abstract
Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many ID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where only a few labeled ID samples are available. Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID image samples, we leverage CLIP, connecting text with images, engineering learnable ID and OOD textual prompts. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts, and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.
Spotlight Poster
Phillip Guo · Aaquib Syed · Abhay Sheshadri · Aidan Ewart · Gintare Karolina Dziugaite

[ East Exhibition Hall A-B ]

Abstract
Methods for knowledge editing and unlearning in large language models seek to edit or remove undesirable knowledge or capabilities without compromising general language modeling performance. This work investigates how mechanistic interpretability---which, in part, aims to identify model components (circuits) associated to specific interpretable mechanisms that make up a model capability---can improve the precision and effectiveness of editing and unlearning. We find a stark difference in unlearning and edit robustness when training components localized by different methods. We highlight an important distinction between methods that localize components based primarily on preserving outputs, and those finding high level mechanisms with predictable intermediate states. In particular, localizing edits/unlearning to components associated with the *lookup-table mechanism* for factual recall 1) leads to more robust edits/unlearning across different input/output formats, and 2) resists attempts to relearn the unwanted information, while also reducing unintended side effects compared to baselines, on both a sports facts dataset and the CounterFact dataset across multiple models.We also find that certain localized edits disrupt the latent knowledge in the model more than any other baselines, making unlearning more robust to various attacks.
Spotlight Poster
YuXin Li · Felix Dangel · Derek Tam · Colin Raffel

[ East Exhibition Hall A-B ]

Abstract
The diagonal of a model's Fisher Information Matrix (the "Fisher") has frequently been used as a way to measure parameter sensitivity.Typically, the Fisher is estimated by computing the squared gradient of the model's outputs with respect to its parameters, averaged over a few hundred or thousand examples — a process which incurs nontrivial computational costs.At the same time, adaptive gradient methods like the ubiquitous Adam optimizer compute a moving average of the squared gradient over the course of training.This paper therefore explores whether an approximation of the Fisher can be obtained "for free" by recycling the squared gradient accumulator that has already been computed over the course of training.Through a comprehensive set of experiments covering five applications of the Fisher, we demonstrate that the "Squisher" (**Squ**ared gradient accumulator as an approximation of the F**isher**) consistently performs similarly to the Fisher while outperforming baseline methods.Additionally, we clarify the exact differences between the Squisher and the Fisher and provide empirical quantification of their respective impact.
Spotlight Poster
Taehyun Cho · Seokhun Ju · Seungyub Han · Dohyeong Kim · Kyungjae Lee · Jungwoo Lee

[ West Exhibition Hall B2-B3 ]

Abstract
To design reward that align with human goals, Reinforcement Learning from Human Feedback (RLHF) has emerged as a prominent technique for learning reward functions from human preferences and optimizing models using reinforcement learning algorithms. However, existing RLHF methods often misinterpret trajectories as being generated by an optimal policy, causing inaccurate likelihood estimation and suboptimal learning. To address this, we propose Policy-labeled Preference Learning (PPL) within the Direct Preference Optimization (DPO) framework, which resolves these likelihood mismatch problems by modeling human preferences with regret, reflecting the efficiency of executed policies. Additionally, we introduce a contrastive KL regularization term derived from regret-based principles to enhance sequential contrastive learning. Experiments in high-dimensional continuous control environments demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.
Spotlight Poster
Ulzee An · Moonseong Jeong · Simon Lee · Aditya Gorla · Yuzhe Yang · Sriram Sankararaman

[ West Exhibition Hall B2-B3 ]

Abstract
Current challenges in developing foundational models for volumetric imaging data, such as magnetic resonance imaging (MRI), stem from the computational complexity of state-of-the-art architectures in high dimensions and curating sufficiently large datasets of volumes.To address these challenges, we introduce Raptor (Random Planar Tensor Reduction), a train-free method for generating semantically rich embeddings for volumetric data. Raptor leverages a frozen 2D foundation model, pretrained on natural images, to extract visual tokens from individual cross-sections of medical volumes. These tokens are then spatially compressed using random projections, significantly reducing computational complexity while retaining rich semantic information. Extensive experiments on 10 diverse medical volume tasks verify the superior performance of Raptor over state-of-the-art methods, including those pretrained exclusively on medical volumes (+3 SuPreM, +6 MISFM, +10 Merlin, +13 VoCo, and +14 SLIViT), while entirely bypassing the need for costly training. Our results highlight Raptor's effectiveness and versatility as a foundation for advancing deep learning-based methods for medical volumes (code: github.com/sriramlab/raptor).
Spotlight Poster
Hugh Dance · Pierre Glaser · Peter Orbanz · Ryan P. Adams

[ East Exhibition Hall A-B ]

Abstract
With the advent of automatic vectorization tools (e.g., JAX's vmap), writing multi-chain MCMC algorithms is often now as simple as invoking those tools on single-chain code. Whilst convenient, for various MCMC algorithms this results in a synchronization problem---loosely speaking, at each iteration all chains running in parallel must wait until the last chain has finished drawing its sample. In this work, we show how to design single-chain MCMC algorithms in a way that avoids synchronization overheads when vectorizing with tools like vmap, by using the framework of finite state machines (FSMs). Using a simplified model, we derive an exact theoretical form of the obtainable speed-ups using our approach, and use it to make principled recommendations for optimal algorithm design. We implement several popular MCMC algorithms as FSMs, including Elliptical Slice Sampling, HMC-NUTS, and Delayed Rejection, demonstrating speed-ups of up to an order of magnitude in experiments.
Spotlight Poster
Matteo Sesia · vladimir svetnik

[ East Exhibition Hall A-B ]

Abstract
We present a conformal inference method for constructing lower prediction bounds for survival times from right-censored data, extending recent approaches designed for more restrictive type-I censoring scenarios. The proposed method imputes unobserved censoring times using a machine learning model, and then analyzes the imputed data using a survival model calibrated via weighted conformal inference. This approach is theoretically supported by an asymptotic double robustness property. Empirical studies on simulated and real data demonstrate that our method leads to relatively informative predictive inferences and is especially robust in challenging settings where the survival model may be inaccurate.
Spotlight Poster
Zhenghai Xue · Lang Feng · Jiacheng Xu · Kang Kang · xiang wen · Bo An · Shuicheng YAN

[ West Exhibition Hall B2-B3 ]

Abstract
To learn from data collected in diverse dynamics, Imitation from Observation (IfO) methods leverage expert state trajectories based on thepremise that recovering expert state distributions in other dynamics facilitates policy learning in the current one. However, Imitation Learning inherently imposes a performance upper bound of learned policies. Additionally, as the environment dynamics change, certain expert states may become inaccessible, rendering their distributions less valuable for imitation. To address this, we propose a novel framework that integrates reward maximization with IfO, employing F-distance regularized policy optimization. This framework enforces constraints on globally accessible states—those with nonzero visitation frequency across all considered dynamics—mitigating the challenge posed by inaccessible states. By instantiating F-distance in different ways, we derive two theoretical analysis and develop a practical algorithm called Accessible State Oriented Policy Regularization (ASOR). ASOR serves as a general-purpose module that can be incorporated into various RL approaches, including offline RL and off-policy RL. Extensive experiments across multiple benchmarks demonstrate ASOR’s effectiveness in enhancing state-of-the-art cross-domain policy transfer algorithms, significantly improving their performance.
Spotlight Poster
Xiyuan Wei · Ming Lin · Fanjiang Ye · Fengguang Song · Liangliang Cao · My T. Thai · Tianbao Yang

[ East Exhibition Hall A-B ]

Abstract
This paper formalizes an emerging learning paradigm that uses a trained model as a reference to guide and enhance the training of a target model through strategic data selection or weighting, named **model steering**. While ad-hoc methods have been used in various contexts, including the training of large foundation models, its underlying principles remain insufficiently understood, leading to sub-optimal performance. In this work, we propose a theory-driven framework for model steering called **DRRho risk minimization**, which is rooted in Distributionally Robust Optimization (DRO). Through a generalization analysis, we provide theoretical insights into why this approach improves generalization and data efficiency compared to training without a reference model. To the best of our knowledge, this is the first time such theoretical insights are provided for the new learning paradigm, which significantly enhance our understanding and practice of model steering. Building on these insights and the connection between contrastive learning and DRO, we introduce a novel method for Contrastive Language-Image Pretraining (CLIP) with a reference model, termed DRRho-CLIP. Extensive experiments validate the theoretical insights, reveal a superior scaling law compared to CLIP without a reference model, and demonstrate its strength over existing heuristic approaches. Code is released at [github.com/Optimization-AI/DRRho-CLIP](https://github.com/Optimization-AI/DRRho-CLIP)
Spotlight Poster
Gwanhyeong Koo · Sunjae Yoon · Younghwan Lee · Ji Woo Hong · Chang Yoo

[ West Exhibition Hall B2-B3 ]

Abstract
Drag-based editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.
Spotlight Poster
Zhicheng Zhang · Wuyou Xia · Chenxi Zhao · Zhou Yan · Xiaoqiang Liu · Yongjie Zhu · Wenyu Qin · Pengfei Wan · Di ZHANG · Jufeng Yang

[ East Exhibition Hall A-B ]

Abstract
Multimodal large language models (MLLMs) recently showed strong capacity in integrating data among multiple modalities, empowered by generalizable attention architecture. Advanced methods predominantly focus on language-centric tuning while less exploring multimodal tokens mixed through attention, posing challenges in high-level tasks that require fine-grained cognition and emotion understanding. In this work, we identify the attention deficit disorder problem in multimodal learning, caused by inconsistent cross-modal attention and layer-by-layer decayed attention activation. To address this, we propose a novel attention mechanism, termed MOdular Duplex Attention (MODA), simultaneously conducting the inner-modal refinement and inter-modal interaction. MODA employs a correct-after-align strategy to effectively decouple modality alignment from cross-layer token mixing. In the alignment phase, tokens are mapped to duplex modality spaces based on the basis vectors, enabling the interaction between visual and language modality. Further, the correctness of attention scores is ensured through adaptive masked attention, which enhances the model's flexibility by allowing customizable masking patterns for different modalities. Extensive experiments on 21 benchmark datasets verify the effectiveness of MODA in perception, cognition, and emotion tasks.
Spotlight Poster
Sam Bowyer · Laurence Aitchison · Desi Ivanova

[ East Exhibition Hall A-B ]

Abstract
Rigorous statistical evaluations of large language models (LLMs), including valid error bars and significance testing, are essential for meaningful and reliable performance assessment. Currently, when such statistical measures are reported, they typically rely on the Central Limit Theorem (CLT). In this position paper, we argue that while CLT-based methods for uncertainty quantification are appropriate when benchmarks consist of thousands of examples, they fail to provide adequate uncertainty estimates for LLM evaluations that rely on smaller, highly specialized benchmarks. In these small-data settings, we demonstrate that CLT-based methods perform very poorly, usually dramatically underestimating uncertainty (i.e. producing error bars that are too small). We give recommendations for alternative frequentist and Bayesian methods that are both easy to implement and more appropriate in these increasingly common scenarios.
Spotlight Poster
Jonas Gehring · Kunhao Zheng · Jade Copet · Vegard Mella · Taco Cohen · Gabriel Synnaeve

[ East Exhibition Hall A-B ]

Abstract
Large language models (LLMs) deployed as agents solve user-specified tasks over multiple steps while keeping the required manual engagement to a minimum. Crucially, such LLMs need to ground their generations in any feedback obtained to reliably achieve the desired outcomes. We propose an end-to-end reinforcement learning method for teaching models to leverage execution feedback in the realm of code synthesis, where state-of-the-art LLMs struggle to improve code iteratively compared to independent sampling. We benchmark on competitive programming tasks and achieve large performance gains with both small (8B parameters) and large (70B) models, outperforming previous work while reducing the number of samples required by an order of magnitude. Our analysis of inference-time behavior demonstrates that our method produces LLMs that effectively leverage automatic feedback over multiple steps.
Spotlight Poster
Ruiqi Feng · Chenglei Yu · Wenhao Deng · Peiyan Hu · Tailin Wu

[ East Exhibition Hall A-B ]

Abstract
Flow matching has shown state-of-the-art performance in various generative tasks, ranging from image generation to decision-making, where generation under energy guidance (abbreviated as guidance in the following) is pivotal. However, the guidance of flow matching is more general than and thus substantially different from that of its predecessor, diffusion models. Therefore, the challenge in guidance for general flow matching remains largely underexplored. In this paper, we propose the first framework of general guidance for flow matching. From this framework, we derive a family of guidance techniques that can be applied to general flow matching. These include a new training-free asymptotically exact guidance, novel training losses for training-based guidance, and two classes of approximate guidance that cover classical gradient guidance methods as special cases. We theoretically investigate these different methods to give a practical guideline for choosing suitable methods in different scenarios. Experiments on synthetic datasets, image inverse problems, and offline reinforcement learning demonstrate the effectiveness of our proposed guidance methods and verify the correctness of our flow matching guidance framework. Code to reproduce the experiments can be found at https://github.com/AI4Science-WestlakeU/flow_guidance.
Spotlight Poster
Shashwat Goel · Joschka Strüber · Ilze Amanda Auzina · Karuna Chandra · Ponnurangam Kumaraguru · Douwe Kiela · Ameya Pandurang Prabhu · Matthias Bethge · Jonas Geiping

[ East Exhibition Hall A-B ]

Abstract
As Language Model (LM) capabilities advance, evaluating and supervising them at scale is getting harder for humans. There is hope that other language models can automate both these tasks, which we refer to as *AI Oversight*. We study how model similarity affects both aspects of AI oversight by proposing *Chance Adjusted Probabilistic Agreement (CAPA)*--a metric for LM similarity based on overlap in model mistakes. Using CAPA, we first show that *LLM-as-a-judge* scores favor models similar to the judge, generalizing recent self-preference results. Then, we study training on LM annotations, and find complementary knowledge between the weak supervisor and strong student model plays a crucial role in gains from *weak-to-strong generalization*. As model capabilities increase, it becomes harder to find their mistakes, and we might defer more to AI oversight. However, we observe a concerning trend--model mistakes are becoming more similar with increasing capabilities, pointing to risks from correlated failures. Our work underscores the importance of reporting and correcting for model similarity, especially in the emerging paradigm of AI oversight.
Spotlight Poster
Qi Yang · Chenghao Zhang · Lubin Fan · Kun Ding · Jieping Ye · Shiming Xiang

[ East Exhibition Hall A-B ]

Abstract
Recent advancements in Large Vision Language Models (LVLMs) have significantly improved performance in Visual Question Answering (VQA) tasks through multimodal Retrieval-Augmented Generation (RAG). However, existing methods still face challenges, such as the scarcity of knowledge with reasoning examples and erratic responses from retrieved knowledge. To address these issues, in this study, we propose a multimodal RAG framework, termed RCTS, which enhances LVLMs by constructing a Reasoning Context-enriched knowledge base and a Tree Search re-ranking method. Specifically, we introduce a self-consistent evaluation mechanism to enrich the knowledge base with intrinsic reasoning patterns. We further propose a Monte Carlo Tree Search with Heuristic Rewards (MCTS-HR) to prioritize the most relevant examples. This ensures that LVLMs can leverage high-quality contextual reasoning for better and more consistent responses. Extensive experiments demonstrate that our framework achieves state-of-the-art performance on multiple VQA datasets, significantly outperforming In-Context Learning (ICL) and Vanilla-RAG methods. It highlights the effectiveness of our knowledge base and re-ranking method in improving LVLMs.
Spotlight Poster
Henry Moss · Sebastian Ober · Tom Diethe

[ East Exhibition Hall A-B ]

Abstract
Bayesian optimisation in the latent space of a VAE is a powerful framework for optimisation tasks over complex structured domains, such as the space of valid molecules. However, existing approaches tightly couple the surrogate and generative models, which can lead to suboptimal performance when the latent space is not tailored to specific tasks, which in turn has led to the proposal of increasingly sophisticated algorithms. In this work, we explore a new direction, instead proposing a decoupled approach that trains a generative model and a GP surrogate separately, then combines them via a simple yet principled Bayesian update rule. This separation allows each component to focus on its strengths— structure generation from the VAE and predictive modelling by the GP. We show that our decoupled approach improves our ability to identify high-potential candidates in molecular optimisation problems under constrained evaluation budgets.
Spotlight Poster
Zihan Guan · Mengxuan Hu · Ronghang Zhu · Sheng Li · Anil Vullikanti

[ East Exhibition Hall A-B ]

Abstract
Recent studies have uncovered a troubling vulnerability in the fine-tuning stage of large language models (LLMs): even fine-tuning on entirely benign datasets can lead to a significant increase in the harmfulness of LLM outputs. Building on this finding, our red teaming study takes this threat one step further by developing a more effective attack. Specifically, we analyze and identify samples within benign datasets that contribute most to safety degradation, then fine-tune LLMs exclusively on these samples. We approach this problem from an outlier detection perspective and propose Self-Inf-N, to detect and extract outliers for fine-tuning. Our findings reveal that fine-tuning LLMs on 100 outlier samples selected by Self-Inf-N in the benign datasets severely compromises LLM safety alignment. Extensive experiments across seven mainstream LLMs demonstrate that our attack exhibits high transferability across different architectures and remains effective in practical scenarios. Alarmingly, our results indicate that most existing mitigation strategies fail to defend against this attack, underscoring the urgent need for more robust alignment safeguards. Codes are available at https://github.com/GuanZihan/Benign-Samples-Matter.
Spotlight Poster
Mingjun Wang · Yihan Wen · Bin Sun · Jianan Mu · Juan Li · Xiaoyi Wang · Jing Ye · Bei Yu · Huawei Li

[ West Exhibition Hall B2-B3 ]

Abstract
Accurate and efficient timing prediction at the register-transfer level (RTL) remains a fundamental challenge in electronic design automation (EDA), particularly in striking a balance between accuracy and computational efficiency. While static timing analysis (STA) provides high-fidelity results through comprehensive physical parameters, its computational overhead makes it impractical for rapid design iterations. Conversely, existing RTL-level approaches sacrifice accuracy due to the limited physical information available. We propose RTLDistil, a novel cross-stage knowledge distillation framework that bridges this gap by transferring precise physical characteristics from a layout-aware teacher model (Teacher GNN) to an efficient RTL-level student model (Student GNN), both implemented as graph neural networks (GNNs). RTLDistil efficiently predicts key timing metrics, such as arrival time (AT), and employs a multi-granularity distillation strategy that captures timing-critical features at node, subgraph, and global levels. Experimental results demonstrate that RTLDistil achieves significant improvement in RTL-level timing prediction error reduction, compared to state-of-the-art prediction models. This framework enables accurate early-stage timing prediction, advancing EDA's ``left-shift'' paradigm while maintaining computational efficiency. Our code and dataset will be publicly available at https://github.com/sklp-eda-lab/RTLDistil.
Spotlight Poster
Yu Sun · Xinhao Li · Karan Dalal · Jiarui Xu · Arjun Vikram · Genghan Zhang · Yann Dubois · Xinlei Chen · Xiaolong Wang · Sanmi Koyejo · Tatsunori Hashimoto · Carlos Guestrin

[ East Exhibition Hall A-B ]

Abstract
Self-attention performs well in long context but has quadratic complexity. Existing RNN layers have linear complexity, but their performance in long context is limited by the expressive power of their hidden states. We present a practical framework for instantiating sequence modeling layers with linear complexity and expressive hidden states. The key idea is to make the hidden state a machine learning model itself, and the update rule a step of self-supervised learning. Since the hidden state is updated by training even on test sequences, our layers are called Test-Time Training (TTT) layers. We consider two instantiations: TTT-Linear and TTT-MLP, whose hidden state is a linear model and a two-layer MLP respectively. We evaluate our instantiations at the scale of 125M to 1.3B parameters, comparing with a strong Transformer and Mamba, a modern RNN. Similar to Transformer, TTT-Linear and TTT-MLP can keep reducing perplexity by conditioning on more tokens, while Mamba cannot after 16k context. TTT-MLP still faces challenges in memory I/O, but shows larger potential in long context, pointing to a promising direction for future research.
Spotlight Poster
Xingyu Zhou · Yulian Wu · Francesco Orabona

[ East Exhibition Hall A-B ]

Abstract
In this paper, we theoretically investigate the effects of noisy labels in offline alignment, with a focus on the interplay between privacy and robustness against adversarial corruption. Specifically, under linear modeling assumptions, we present a unified analysis covering both reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO) under different privacy-corruption scenarios, such as Local differential privacy-then-Corruption (LTC), where human preference labels are privatized before being corrupted by an adversary, and Corruption-then-Local differential privacy (CTL), where labels are corrupted before privacy protection. Our analysis leverages a reduction framework that reduces the offline alignment problem under linear modeling assumptions to parameter estimation in logistic regression. This framework allows us to establish an interesting separation result between LTC and CTL, demonstrating that LTC presents a greater challenge than CTL in offline alignment, even under linear models. As important by-products, our findings also advance the state-of-the-art theoretical results in offline alignment under privacy-only or corruption-only scenarios.
Spotlight Poster
Emilia Magnani · Marvin Pförtner · Tobias Weber · Philipp Hennig

[ East Exhibition Hall A-B ]

Abstract
Neural operators generalize neural networks to learn mappings between function spaces from data. They are commonly used to learn solution operators of parametric partial differential equations (PDEs) or propagators of time-dependent PDEs. However, to make them useful in high-stakes simulation scenarios, their inherent predictive error must be quantified reliably. We introduce LUNO, a novel framework for approximate Bayesian uncertainty quantification in trained neural operators. Our approach leverages model linearization to push (Gaussian) weight-space uncertainty forward to the neural operator's predictions.We show that this can be interpreted as a probabilistic version of the concept of currying from functional programming, yielding a function-valued (Gaussian) random process belief. Our framework provides a practical yet theoretically sound way to apply existing Bayesian deep learning methods such as the linearized Laplace approximation to neural operators. Just as the underlying neural operator, our approach is resolution-agnostic by design.The method adds minimal prediction overhead, can be applied post-hoc without retraining the network, and scales to large models and datasets.We evaluate these aspects in a case study on Fourier neural operators.
Spotlight Poster
Guanzheng Chen · Qilong Feng · Jinjie Ni · Xin Li · Michael Shieh

[ East Exhibition Hall A-B ]

Abstract
The emergence of long-context large language models (LLMs) offers a promising alternative to traditional retrieval-augmented generation (RAG) for processing extensive documents. However, the computational overhead of long-context inference presents significant efficiency challenges. While Speculative Decoding (SD) traditionally accelerates inference using smaller draft models, its effectiveness diminishes substantially in long-context scenarios due to memory-bound KV cache operations. We introduce Retrieval-Augmented Speculative Decoding (RAPID), which leverages RAG for both accelerating and enhancing generation quality in long-context inference. RAPID introduces the RAG drafter—a draft LLM operating on shortened retrieval contexts—to speculate on the generation of long-context target LLMs. Our approach enables a new paradigm where same-scale or even larger LLMs can serve as RAG drafters while maintaining computational efficiency. To fully leverage the potentially superior capabilities from stronger RAG drafters, we develop an inference-time knowledge transfer that enriches the target distribution by RAG. Extensive experiments on the LLaMA-3.1 and Qwen2.5 backbones demonstrate that RAPID effectively integrates the strengths of both RAG and long-context LLMs, achieving significant performance improvements (e.g., from 39.33 to 42.83 on InfiniteBench for LLaMA-3.1-8B) with more than 2$\times$ speedups for long-context inference. Our analyses also reveal the robustness of RAPID across various context lengths and retrieval quality.
Spotlight Poster
Matthew Ashman · Cristiana Diaconu · Eric Langezaal · Adrian Weller · Richard E Turner

[ East Exhibition Hall A-B ]

Abstract
Effective modelling of large-scale spatio-temporal datasets is essential for many domains, yet existing approaches often impose rigid constraints on the input data, such as requiring them to lie on fixed-resolution grids. With the rise of foundation models, the ability to process diverse, heterogeneous data structures is becoming increasingly important. Neural processes (NPs), particularly transformer neural processes (TNPs), offer a promising framework for such tasks, but struggle to scale to large spatio-temporal datasets due to the lack of an efficient attention mechanism. To address this, we introduce gridded pseudo-token TNPs which employ specialised encoders and decoders to handle unstructured data and utilise a processor comprising gridded pseudo-tokens with efficient attention mechanisms. Furthermore, we develop equivariant gridded TNPs for applications where exact or approximate translation equivariance is a useful inductive bias, improving accuracy and training efficiency. Our method consistently outperforms a range of strong baselines in various synthetic and real-world regression tasks involving large-scale data, while maintaining competitive computational efficiency. Experiments with weather data highlight the potential of gridded TNPs and serve as just one example of a domain where they can have a significant impact.
Spotlight Poster
Yedi Zhang · Aaditya Singh · Peter Latham · Andrew Saxe

[ West Exhibition Hall B2-B3 ]

Abstract
While attention-based models have demonstrated the remarkable ability of in-context learning (ICL), the theoretical understanding of how these models acquired this ability through gradient descent training is still preliminary. Towards answering this question, we study the gradient descent dynamics of multi-head linear self-attention trained for in-context linear regression. We examine two parametrizations of linear self-attention: one with the key and query weights merged as a single matrix (common in theoretical studies), and one with separate key and query matrices (closer to practical settings). For the merged parametrization, we show that the training dynamics has two fixed points and the loss trajectory exhibits a single, abrupt drop. We derive an analytical time-course solution for a certain class of datasets and initialization. For the separate parametrization, we show that the training dynamics has exponentially many fixed points and the loss exhibits saddle-to-saddle dynamics, which we reduce to scalar ordinary differential equations. During training, the model implements principal component regression in context with the number of principal components increasing over training time. Overall, we provide a theoretical description of how ICL abilities evolve during gradient descent training of linear attention, revealing abrupt acquisition or progressive improvements depending on how the key and query …
Spotlight Poster
Olga Ovcharenko · Florian Barkmann · Philip Toma · Imant Daunhawer · Julia Vogt · Sebastian Schelter · Valentina Boeva

[ West Exhibition Hall B2-B3 ]

Abstract
Self-supervised learning (SSL) has proven to be a powerful approach for extracting biologically meaningful representations from single-cell data. To advance our understanding of SSL methods applied to single-cell data, we present scSSL-Bench, a comprehensive benchmark that evaluates nineteen SSL methods. Our evaluation spans nine datasets and focuses on three common downstream tasks: batch correction, cell type annotation, and missing modality prediction. Furthermore, we systematically assess various data augmentation strategies. Our analysis reveals task-specific trade-offs: the specialized single-cell frameworks, scVI, CLAIRE, and the finetuned scGPT excel at uni-modal batch correction, while generic SSL methods, such as VICReg and SimCLR, demonstrate superior performance in cell typing and multi-modal data integration. Random masking emerges as the most effective augmentation technique across all tasks, surpassing domain-specific augmentations. Notably, our results indicate the need for a specialized single-cell multi-modal data integration framework. scSSL-Bench provides a standardized evaluation platform and concrete recommendations for applying SSL to single-cell analysis, advancing the convergence of deep learning and single-cell genomics.
Spotlight Poster
Judy Hanwen Shen · Ellen Vitercik · Anders Wikum

[ West Exhibition Hall B2-B3 ]

Abstract
The field of *algorithms with predictions* incorporates machine learning advice in the design of online algorithms to improve real-world performance. A central consideration is the extent to which predictions can be trusted—while existing approaches often require users to specify an aggregate trust level, modern machine learning models can provide estimates of prediction-level uncertainty. In this paper, we propose *calibration* as a principled and practical tool to bridge this gap, demonstrating the benefits of calibrated advice through two case studies: the *ski rental* and *online job scheduling* problems. For ski rental, we design an algorithm that achieves near-optimal prediction-dependent performance and prove that, in high-variance settings, calibrated advice offers more effective guidance than alternative methods for uncertainty quantification. For job scheduling, we demonstrate that using a calibrated predictor leads to significant performance improvements over existing methods. Evaluations on real-world data validate our theoretical findings, highlighting the practical impact of calibration for algorithms with predictions.
Spotlight Poster
Yizi Zhang · Yanchen Wang · Mehdi Azabou · Alexandre Andre · Zixuan Wang · Hanrui Lyu · International Brain Laboratory · Eva Dyer · Department of Statistics Liam Paninski · Cole Hurwitz

[ West Exhibition Hall B2-B3 ]

Abstract
Recent work has demonstrated that large-scale, multi-animal models are powerful tools for characterizing the relationship between neural activity and behavior. Current large-scale approaches, however, focus exclusively on either predicting neural activity from behavior (encoding) or predicting behavior from neural activity (decoding), limiting their ability to capture the bidirectional relationship between neural activity and behavior. To bridge this gap, we introduce a multimodal, multi-task model that enables simultaneous Neural Encoding and Decoding at Scale (NEDS). Central to our approach is a novel multi-task-masking strategy, which alternates between neural, behavioral, within-modality, and cross-modality masking. We pretrain our method on the International Brain Laboratory (IBL) repeated site dataset, which includes recordings from 83 animals performing the visual decision-making task. In comparison to other large-scale modeling approaches, we demonstrate that NEDS achieves state-of-the-art performance for both encoding and decoding when pretrained on multi-animal data and then fine-tuned on new animals. Surprisingly, NEDS's learned embeddings exhibit emergent properties: even without explicit training, they are highly predictive of the brain regions in each recording. Altogether, our approach is a step towards a foundation model of the brain that enables seamless translation between neural activity and behavior.
Spotlight Poster
Dongzhe Zheng · Wenjie Mei

[ East Exhibition Hall A-B ]

Abstract
Learning unknown dynamics under environmental (or external) constraints is fundamental to many fields (e.g., modern robotics), particularly challenging when constraint information is only locally available and uncertain. Existing approaches requiring global constraints or using probabilistic filtering fail to fully exploit the geometric structure inherent in local measurements (by using, e.g., sensors) and constraints. This paper presents a geometric framework unifying measurements, constraints, and dynamics learning through a fiber bundle structure over the state space. This naturally induced geometric structure enables measurement-aware Control Barrier Functions that adapt to local sensing (or measurement) conditions. By integrating Neural ODEs, our framework learns continuous-time dynamics while preserving geometric constraints, with theoretical guarantees of learning convergence and constraint satisfaction dependent on sensing quality. The geometric framework not only enables efficient dynamics learning but also suggests promising directions for integration with reinforcement learning approaches. Extensive simulations demonstrate significant improvements in both learning efficiency and constraint satisfaction over traditional methods, especially under limited and uncertain sensing conditions.
Spotlight Poster
Daoyuan Chen · Haibin Wang · Yilun Huang · Ce Ge · Yaliang Li · Bolin Ding · Jingren Zhou

[ East Exhibition Hall A-B ]

Abstract
The emergence of multimodal large models has advanced artificial intelligence, introducing unprecedented levels of performance and functionality. However, optimizing these models remains challenging due to historically isolated paths of model-centric and data-centric developments, leading to suboptimal outcomes and inefficient resource utilization. In response, we present a new sandbox suite tailored for integrated data-model co-development. This sandbox provides a feedback-driven experimental platform, enabling cost-effective iteration and guided refinement of both data and models. Our proposed ``Probe-Analyze-Refine'' workflow, validated through practical use cases on multimodal tasks such as image-text pre-training with CLIP, image-to-text generation with LLaVA-like models, and text-to-video generation with DiT-based models, yields transferable and notable performance boosts, such as topping the VBench leaderboard. A comprehensive set of over 100 experiments demonstrated the suite's usability and extensibility, while also uncovering insights into the interplay between data quality, diversity, model behavior, and computational costs. All codes, datasets, and models are open-sourced to foster future research and applications that would otherwise be infeasible due to the lack of a dedicated co-development infrastructure.
Spotlight Poster
Binchi Zhang · Zaiyi Zheng · Zhengzhang Chen · Jundong Li

[ West Exhibition Hall B2-B3 ]

Abstract
Symmetry in the parameter space of deep neural networks (DNNs) has proven beneficial for various deep learning applications. A well-known example is the permutation symmetry in Multi-Layer Perceptrons (MLPs), where permuting the rows of weight matrices in one layer and applying the inverse permutation to adjacent layers yields a functionally equivalent model. While permutation symmetry fully characterizes the equivalence set for MLPs, its discrete nature limits its utility for transformers. In this paper, we introduce rotation symmetry, a novel form of parameter space symmetry for transformers that generalizes permutation symmetry by rotating parameter matrices in self-attention layers. Unlike permutation symmetry, rotation symmetry operates in a continuous domain, thereby significantly expanding the equivalence set for transformers. Based on this property, we propose a theoretically optimal parameter matching algorithm as a plug-and-play module to enhance model fusion. We evaluate our approach using pre-trained transformers across diverse natural language and vision tasks. Experimental results demonstrate that our rotation symmetry-based matching algorithm substantially improves model fusion, highlighting the potential of parameter space symmetry to facilitate model fusion. Our code is available on https://github.com/zhengzaiyi/RotationSymmetry
Spotlight Poster
Georg Bökman · David Nordström · Fredrik Kahl

[ West Exhibition Hall B2-B3 ]

Abstract
Incorporating geometric invariance into neural networks enhances parameter efficiency but typically increases computational costs.This paper introduces new equivariant neural networksthat preserve symmetry while maintaining a comparable number of floating-point operations (FLOPs) per parameter to standard non-equivariant networks. We focus on horizontal mirroring (flopping) invariance, common in many computer vision tasks.The main idea is to parametrize the feature spaces in terms of mirror-symmetric and mirror-antisymmetric features, i.e., irreps of the flopping group.This decomposes the linear layers to be block-diagonal, requiring half the number of FLOPs.Our approach reduces both FLOPs and wall-clock time,providing a practical solution for efficient, scalable symmetry-aware architectures.
Spotlight Poster
Kaiwen Zheng · Yongxin Chen · Huayu Chen · Guande He · Ming-Yu Liu · Jun Zhu · Qinsheng Zhang

[ East Exhibition Hall A-B ]

Abstract
While likelihood-based generative models, particularly diffusion and autoregressive models, have achieved remarkable fidelity in visual generation, the maximum likelihood estimation (MLE) objective, which minimizes the forward KL divergence, inherently suffers from a mode-covering tendency that limits the generation quality under limited model capacity. In this work, we propose Direct Discriminative Optimization (DDO) as a unified framework that integrates likelihood-based generative training and GAN-type discrimination to bypass this fundamental constraint by exploiting reverse KL and self-generated negative signals. Our key insight is to parameterize a discriminator implicitly using the likelihood ratio between a learnable target model and a fixed reference model, drawing parallels with the philosophy of Direct Preference Optimization (DPO). Unlike GANs, this parameterization eliminates the need for joint training of generator and discriminator networks, allowing for direct, efficient, and effective finetuning of a well-trained model to its full potential beyond the limits of MLE. DDO can be performed iteratively in a self-play manner for progressive model refinement, with each round requiring less than 1\% of pretraining epochs. Our experiments demonstrate the effectiveness of DDO by significantly advancing the previous SOTA diffusion model EDM, reducing FID scores from 1.79/1.58/1.96 to new records of 1.30/0.97/1.26 on CIFAR-10/ImageNet-64/ImageNet 512$\times$512 datasets without any …
Spotlight Poster
Xiuyuan Wang · Chaochao Chen · Weiming Liu · Xinting Liao · Fan Wang · Xiaolin Zheng

[ East Exhibition Hall A-B ]

Abstract
With growing privacy concerns and the enforcement of data protection regulations, machine unlearning has emerged as a promising approach for removing the influence of forget data while maintaining model performance on retain data. However, most existing unlearning methods require access to the original training data, which is often impractical due to privacy policies, storage constraints, and other limitations. This gives rise to the challenging task of source-free unlearning, where unlearning must be accomplished without accessing the original training data. Few existing source-free unlearning methods rely on knowledge distillation and model retraining, which impose substantial computational costs. In this work, we propose the Data Synthesis-based Discrimination-Aware (DSDA) unlearning framework, which enables efficient source-free unlearning in two stages: (1) Accelerated Energy-Guided Data Synthesis (AEGDS), which employs Langevin dynamics to model the training data distribution while integrating Runge–Kutta methods and momentum to enhance efficiency. (2) Discrimination-Aware Multitask Optimization (DAMO), which refines the feature distribution of retain data and mitigates the gradient conflicts among multiple unlearning objectives. Extensive experiments on three benchmark datasets demonstrate that DSDA outperforms existing unlearning methods, validating its effectiveness and efficiency in source-free unlearning.
Spotlight Poster
Guancheng Wan · Zijie Huang · Wanjia Zhao · Xiao Luo · Yizhou Sun · Wei Wang

[ West Exhibition Hall B2-B3 ]

Abstract
Coupled dynamical systems govern essential phenomena across physics, biology, and engineering, where components interact through complex dependencies. While Graph Ordinary Differential Equations (GraphODE) offer a powerful framework to model these systems, their **generalization** capabilities degrade severely under limited observational training data due to two fundamental flaws: (i) the entanglement of static attributes and dynamic states in the initialization process, and (ii) the reliance on context-specific coupling patterns during training, which hinders performance in unseen scenarios. In this paper, we propose a Generalizable GraphODE with disentanglement and regularization (GREAT) to address these challenges. Through systematic analysis via the Structural Causal Model, we identify backdoor paths that undermine generalization and design two key modules to mitigate their effects. The *Dynamic-Static Equilibrium Decoupler (DyStaED)* disentangles static and dynamic states via orthogonal subspace projections, ensuring robust initialization. Furthermore, the *Causal Mediation for Coupled Dynamics (CMCD)* employs variational inference to estimate latent causal factors, reducing spurious correlations and enhancing universal coupling dynamics. Extensive experiments across diverse dynamical systems demonstrate that ours outperforms state-of-the-art methods within both in-distribution and out-of-distribution.
Spotlight Poster
Zi-Hao Zhou · Jun-Jie Wang · Tong Wei · Min-Ling Zhang

[ East Exhibition Hall A-B ]

Abstract
Contrastive learning has achieved remarkable success in learning effective representations, with supervised contrastive learning often outperforming self-supervised approaches. However, in real-world scenarios, data annotations are often ambiguous or inaccurate, meaning that class labels may not reliably indicate whether two examples belong to the same class. This limitation restricts the applicability of supervised contrastive learning. To address this challenge, we introduce the concept of ``continuous semantic similarity'' to define positive and negative pairs. Instead of directly relying on imprecise class labels, we measure the semantic similarity between example pairs, which quantifies how closely they belong to the same category by iteratively refining weak supervisory signals. Based on this concept, we propose a graph-theoretic framework for weakly-supervised contrastive learning, where semantic similarity serves as the graph weights. Our framework is highly versatile and can be applied to many weakly-supervised learning scenarios. We demonstrate its effectiveness through experiments in two common settings, i.e., noisy label and partial label learning, where existing methods can be easily integrated to significantly improve performance. Theoretically, we establish an error bound for our approach, showing that it can approximate supervised contrastive learning under mild conditions. The implementation code is available at [https://github.com/Speechless-10308/WSC](https://github.com/Speechless-10308/WSC).
Spotlight Poster
Seungwook Han · Jyothish Pari · Samuel Gershman · Pulkit Agrawal

[ East Exhibition Hall A-B ]

Abstract
Large Language Models (LLMs) have demonstrated impressive real-world utility, exemplifying artificial useful intelligence (AUI). However, their ability to reason adaptively and robustly -- the hallmarks of artificial general intelligence (AGI) -- remains fragile. While LLMs seemingly succeed in commonsense reasoning, programming, and mathematics, they struggle to generalize algorithmic understanding across novel contexts. Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM's reasoning overfits to the training data and is limited in its transferability. We hypothesize that the core issue underlying such limited transferability is the coupling of reasoning and knowledge in LLMs. To transition from AUI to AGI, we propose disentangling knowledge and reasoning through three key directions: (1) pretaining to reason using RL from scratch as an alternative to the widely used next-token prediction pretraining, (2) using a curriculum of synthetic tasks to ease the learning of a \textit{reasoning prior} for RL that can then be transferred to natural language tasks, and (3) learning more generalizable reasoning functions using a small context window to reduce exploiting spurious correlations between tokens. Such a reasoning system coupled with a trained retrieval system and a large external memory bank as a knowledge store can overcome several limitations of existing …
Spotlight Poster
Xingyu Zhu · Abhishek Panigrahi · Sanjeev Arora

[ East Exhibition Hall A-B ]

Abstract
We formalize a new concept for LLMs, **context-enhanced learning**. It involves standard gradient-based learning on text except that the context is enhanced with additional data on which no auto-regressive gradients are computed. This setting is a gradient-based analog of usual in-context learning (ICL) and appears in some recent works.Using a multi-step reasoning task, we prove in a simplified setting that context-enhanced learning can be **exponentially more sample-efficient** than standard learning when the model is capable of ICL. At a mechanistic level, we find that the benefit of context-enhancement arises from a more accurate gradient learning signal.We also experimentally demonstrate that it appears hard to detect or recover learning materials that were used in the context during training. This may have implications for data security as well as copyright.
Spotlight Poster
Yi-Rui Yang · Chang-Wei Shi · Wu-Jun Li

[ West Exhibition Hall B2-B3 ]

Abstract
Byzantine-robust distributed learning (BRDL), which refers to distributed learning that can work with potential faulty or malicious workers (also known as Byzantine workers), has recently attracted much research attention. Robust aggregators are widely used in existing BRDL methods to obtain robustness against Byzantine workers. However, Byzantine workers do not always exist in applications. As far as we know, there is almost no existing work theoretically investigating the effect of using robust aggregators when there are no Byzantine workers. To bridge this knowledge gap, we theoretically analyze the aggregation error for robust aggregators when there are no Byzantine workers. Specifically, we show that the worst-case aggregation error without Byzantine workers increases with the increase of the number of Byzantine workers that a robust aggregator can tolerate. The theoretical result reveals the tension between Byzantine robustness and no-attack accuracy, which refers to accuracy without faulty workers and malicious workers in this paper. Furthermore, we provide lower bounds for the convergence rate of gradient descent with robust aggregators for non-convex objective functions and objective functions that satisfy the Polyak-Lojasiewicz (PL) condition, respectively. We also prove the tightness of the lower bounds. The lower bounds for convergence rate reveal similar tension between Byzantine robustness …
Spotlight Poster
Haotian Wu · Gongpu Chen · Pier Luigi Dragotti · Deniz Gunduz

[ West Exhibition Hall B2-B3 ]

Abstract
We introduce and validate the lottery codec hypothesis, which states that untrained subnetworks within randomly initialized networks can serve as synthesis networks for overfitted image compression, achieving rate-distortion (RD) performance comparable to trained networks. This hypothesis leads to a new paradigm for image compression by encoding image statistics into the network substructure. Building on this hypothesis, we propose LotteryCodec, which overfits a binary mask to an individual image, leveraging an over-parameterized and randomly initialized network shared by the encoder and the decoder. To address over-parameterization challenges and streamline subnetwork search, we develop a rewind modulation mechanism that improves the RD performance. LotteryCodec outperforms VTM and sets a new state-of-the-art in single-image compression. LotteryCodec also enables adaptive decoding complexity through adjustable mask ratios, offering flexible compression solutions for diverse device constraints and application requirements.
Spotlight Poster
Zhangyi Liu · Feng Liu · Rui Gao · Shuang Li

[ West Exhibition Hall B2-B3 ]

Abstract
We study convergence properties of the discrete-time Mean-Field Langevin Stochastic Descent-Ascent (MFL-SDA) algorithm for solving distributional minimax optimization. These problems arise in various applications, such as zero-sum games, generative adversarial networks and distributionally robust learning. Despite the significance of MFL-SDA in these contexts, the discrete-time convergence rate remains underexplored.To address this gap, we establish a last-iterate convergence rate of $O(\frac{1}{\epsilon}\log\frac{1}{\epsilon})$ for MFL-SDA. This rate is nearly optimal when compared to the complexity lower bound of its Euclidean counterpart. This rate also matches the complexity of mean-field Langevin stochastic gradient descent for distributional minimization and the outer-loop iteration complexity of an existing double-loop algorithm for distributional minimax problems.By leveraging an elementary analysis framework that avoids PDE-based techniques, we overcome previous limitations and achieve a faster convergence rate.
Spotlight Poster
Yucheng Hu · Yanjiang Guo · Pengchao Wang · Xiaoyu Chen · Yen-Jen Wang · Jianke Zhang · Koushil Sreenath · Chaochao Lu · Jianyu Chen

[ West Exhibition Hall B2-B3 ]

Abstract
Visual representations play a crucial role in developing generalist robotic policies. Previous vision encoders, typically pre-trained with single-image reconstruction or two-image contrastive learning, tend to capture static information, often neglecting the dynamic aspects vital for embodied tasks. Recently, video diffusion models (VDMs) demonstrate the ability to predict future frames and showcase a strong understanding of physical world. We hypothesize that VDMs inherently produce visual representations that encompass both current static information and predicted future dynamics, thereby providing valuable guidance for robot action learning. Based on this hypothesis, we propose the Video Prediction Policy (VPP), which learns implicit inverse dynamics model conditioned on predicted future representations inside VDMs. To predict more precise future, we fine-tune pre-trained video foundation model on robot datasets along with internet human manipulation data.In experiments, VPP achieves a 18.6\% relative improvement on the Calvin ABC-D generalization benchmark compared to the previous state-of-the-art, and demonstrates a 31.6\% increase in success rates for complex real-world dexterous manipulation tasks. For your convenience, videos can be found at https://video-prediction-policy.github.io/
Spotlight Poster
Yichen Li · Yuying Wang · Haozhao Wang · Yining Qi · Tianzhe Xiao · Ruixuan Li

[ East Exhibition Hall A-B ]

Abstract
Continual Federated Learning (CFL) allows distributed devices to collaboratively learn novel concepts from continuously shifting training data while avoiding \textit{knowledge forgetting} of previously seen tasks. To tackle this challenge, most current CFL approaches rely on extensive rehearsal of previous data. Despite effectiveness, rehearsal comes at a cost to memory, and it may also violate data privacy. Considering these, we seek to apply regularization techniques to CFL by considering their cost-efficient properties that do not require sample caching or rehearsal. Specifically, we first apply traditional regularization techniques to CFL and observe that existing regularization techniques, especially synaptic intelligence, can achieve promising results under homogeneous data distribution but fail when the data is heterogeneous. Based on this observation, we propose a simple yet effective regularization algorithm for CFL named \textbf{FedSSI}, which tailors the synaptic intelligence for the CFL with heterogeneous data settings. FedSSI can not only reduce computational overhead without rehearsal but also address the data heterogeneity issue. Extensive experiments show that FedSSI achieves superior performance compared to state-of-the-art methods.
Spotlight Poster
Marc Jourdan · Gizem Yüce · Nicolas Flammarion

[ West Exhibition Hall B2-B3 ]

Abstract
Recent advances in language modeling have underscored the role of preference feedback in enhancing model performance. This paper investigates the conditions under which preference feedback improves parameter estimation in classes of continuous parametric distributions. In our framework, the learner observes pairs of samples from an unknown distribution along with their relative preferences depending on the same unknown parameter. We show that preferences-based M-estimators achieve a better asymptotic variance than sample-only M-estimators, further improved by deterministic preferences. Leveraging the hard constraints revealed by deterministic preferences, we propose an estimator achieving an estimation error scaling of $\mathcal{O}(1/n)$---a significant improvement over the $\Theta(1/\sqrt{n})$ rate attainable with samples alone. Next, we establish a lower bound that matches this accelerated rate; up to problem-dependent constants. While the assumptions underpinning our analysis are restrictive, they are satisfied by notable cases such as Gaussian or Laplace distributions for preferences based on the log-probability reward.
Spotlight Poster
Jianqing Zhang · Yang Liu · Jie Fu · Yang Hua · Tianyuan Zou · Jian Cao · Qiang Yang

[ East Exhibition Hall A-B ]

Abstract
The rise of generative APIs has fueled interest in privacy-preserving synthetic data generation. While the Private Evolution (PE) algorithm generates Differential Privacy (DP) synthetic images using diffusion model APIs, it struggles with few-shot private data due to the limitations of its DP-protected similarity voting approach. In practice, the few-shot private data challenge is particularly prevalent in specialized domains like healthcare and industry. To address this challenge, we propose a novel API-assisted algorithm, Private Contrastive Evolution (PCEvolve), which iteratively mines inherent inter-class contrastive relationships in few-shot private data beyond individual data points and seamlessly integrates them into an adapted Exponential Mechanism (EM) to optimize DP’s utility in an evolution loop. We conduct extensive experiments on four specialized datasets, demonstrating that PCEvolve outperforms PE and other API-assisted baselines. These results highlight the potential of leveraging API access with private data for quality evaluation, enabling the generation of high-quality DP synthetic images and paving the way for more accessible and effective privacy-preserving generative API applications. Our code is available at https://github.com/TsingZ0/PCEvolve.
Spotlight Poster
Oliver Eberle · Thomas McGee · Hamza Giaffar · Taylor Webb · Ida Momennejad

[ East Exhibition Hall A-B ]

Abstract
What algorithms do LLMs actually learn and use to solve problems? Studies addressing this question are sparse, as research priorities are focused on improving performance through scale, leaving a theoretical and empirical gap in understanding emergent algorithms. This position paper proposes AlgEval: a framework for systematic research into the algorithms that LLMs learn and use. AlgEval aims to uncover algorithmic primitives, reflected in latent representations, attention, and inference-time compute, and their algorithmic composition to solve task-specific problems. We highlight potential methodological paths and a case study toward this goal, focusing on emergent search algorithms. Our case study illustrates both the formation of top-down hypotheses about candidate algorithms, and bottom-up tests of these hypotheses via circuit-level analysis of attention patterns and hidden states. The rigorous, systematic evaluation of how LLMs actually solve tasks provides an alternative to resource-intensive scaling, reorienting the field toward a principled understanding of underlying computations. Such algorithmic explanations offer a pathway to human-understandable interpretability, enabling comprehension of the model's internal reasoning performance measures. This can in turn lead to more sample-efficient methods for training and improving performance, as well as novel architectures for end-to-end and multi-agent systems.
Spotlight Poster
Tim Vieira · Benjamin LeBrun · Mario Giulianelli · Juan Luis Gastaldi · Brian DuSell · John Terilla · Timothy O'Donnell · Ryan Cotterell

[ East Exhibition Hall A-B ]

Abstract
Modern language models are internally—and mathematically—distributions over *token* strings rather than *character* strings, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.
Spotlight Poster
Linjian Meng · Wubing Chen · Wenbin Li · Tianpei Yang · Youzhi Zhang · Yang Gao

[ West Exhibition Hall B2-B3 ]

Abstract
Nash equilibrium (NE) plays an important role in game theory. How to efficiently compute an NE in NFGs is challenging due to its complexity and non-convex optimization property. Machine Learning (ML), the cornerstone of modern artificial intelligence, has demonstrated remarkable empirical performance across various applications including non-convex optimization. To leverage non-convex stochastic optimization techniques from ML for approximating an NE, various loss functions have been proposed. Among these, only one loss function is unbiased, allowing for unbiased estimation under the sampled play. Unfortunately, this loss function suffers from high variance, which degrades the convergence rate. To improve the convergence rate by mitigating the high variance associated with the existing unbiased loss function, we propose a novel surrogate loss function named Nash Advantage Loss (NAL). NAL is theoretically proved unbiased and exhibits significantly lower variance than the existing unbiased loss function. Experimental results demonstrate that the algorithm minimizing NAL achieves a significantly faster empirical convergence rates compared to other algorithms, while also reducing the variance of estimated loss value by several orders of magnitude.
Spotlight Poster
Utkarsh Saxena · Sayeh Sharify · Kaushik Roy · Xin Wang

[ East Exhibition Hall A-B ]

Abstract
Post-training quantization (PTQ) of large language models (LLMs) holds the promise in reducing the prohibitive computational cost at inference time. Quantization of all weight, activation and key-value (KV) cache tensors to 4-bit without significantly degrading generalizability is challenging, due to the high quantization error caused by extreme outliers in activations. To tackle this problem, we propose ResQ, a PTQ method that pushes further the state-of-the-art. By means of principal component analysis (PCA), it identifies a low-rank subspace (in practice 1/8 of the hidden dimension) in which activation variances are highest, and keep the coefficients within this subspace in high precision, e.g.~8-bit, while quantizing the rest to 4-bit. Within each subspace, invariant random rotation is applied to further suppress outliers. We show that this is a provably optimal mixed precision quantization scheme that minimizes error. With the Llama and Qwen2.5 families of models, we demonstrate that ResQ outperforms recent uniform and mixed precision PTQ methods on a variety of benchmarks, achieving up to 33\% lower perplexity on Wikitext than the next best method SpinQuant, and upto 3X speedup over 16-bit baseline. Anonymous code repository available at https://anonymous.4open.science/r/project-resq-2142.
Spotlight Poster
Zichen Liu · Guoji Fu · Chao Du · Wee Sun Lee · Min Lin

[ West Exhibition Hall B2-B3 ]

Abstract
Continual reinforcement learning (CRL) refers to a naturalistic setting where an agent needs to endlessly evolve, by trial and error, to solve multiple tasks that are presented sequentially. One of the largest obstacles to CRL is that the agent may forget how to solve previous tasks when learning a new task, known as catastrophic forgetting. In this paper, we propose to address this challenge by planning with online world models. Specifically, we learn a Follow-The-Leader shallow model online to capture the world dynamics, in which we plan using model predictive control to solve a set of tasks specified by any reward functions. The online world model is immune to forgetting by construction with a proven regret bound of $\mathcal{O}(\sqrt{K^2D\log(T)})$ under mild assumptions. The planner searches actions solely based on the latest online model, thus forming a FTL Online Agent (OA) that updates incrementally. To assess OA, we further design Continual Bench, a dedicated environment for CRL, and compare with several strong baselines under the same model-planning algorithmic framework. The empirical results show that OA learns continuously to solve new tasks while not forgetting old skills, outperforming agents built on deep world models with various continual learning techniques.
Spotlight Poster
Junjie Yao · zhongwang zhang · Zhi-Qin John Xu

[ East Exhibition Hall A-B ]

Abstract
Transformer-based Large Language Models (LLMs) have revolutionized Natural Language Processing by demonstrating exceptional performance across diverse tasks. This study investigates the impact of the parameter initialization scale on the training behavior and task preferences of LLMs. We discover that smaller initialization scales encourage models to favor reasoning tasks, whereas larger initialization scales lead to a preference for memorization tasks. We validate this reasoning bias via real datasets and meticulously designed anchor functions. Further analysis of initial training dynamics suggests that specific model components, particularly the embedding space and self-attention mechanisms, play pivotal roles in shaping these learning biases. We provide a theoretical framework from the perspective of model training dynamics to explain these phenomena. Additionally, experiments on real-world language tasks corroborate our theoretical insights. This work enhances our understanding of how initialization strategies influence LLM performance on reasoning tasks and offers valuable guidelines for training models.
Spotlight Poster
Xinyue Zheng · Haowei Lin · Kaichen He · Zihao Wang · Qiang Fu · Haobo Fu · Zilong Zheng · Yitao Liang

[ East Exhibition Hall A-B ]

Abstract
Developing AI agents capable of interacting with open-world environments to solve diverse tasks is a compelling challenge. However, evaluating such open-ended agents remains difficult, with current benchmarks facing scalability limitations. To address this, we introduce \textit{Minecraft Universe} (MCU), a comprehensive evaluation framework set within the open-world video game Minecraft. MCU incorporates three key components: (1) an expanding collection of 3,452 composable atomic tasks that encompasses 11 major categories and 41 subcategories of challenges; (2) a task composition mechanism capable of generating infinite diverse tasks with varying difficulty; and (3) a general evaluation framework that achieves 91.5\% alignment with human ratings for open-ended task assessment. Empirical results reveal that even state-of-the-art foundation agents struggle with the increasing diversity and complexity of tasks. These findings highlight the necessity of MCU as a robust benchmark to drive progress in AI agent development within open-ended environments. Our evaluation code and scripts are available at https://github.com/CraftJarvis/MCU.
Spotlight Poster
Tin Dizdarevic · Ravi Hammond · Tobias Gessler · Anisoara Calinescu · Jonathan Cook · Matteo Gallici · Andrei Lupu · Jakob Foerster

[ West Exhibition Hall B2-B3 ]

Abstract
Achieving seamless coordination between AI agents and humans is crucial for real-world applications, yet it remains a significant open challenge. Hanabi is a cooperative card game featuring imperfect information, constrained communication, theory of mind requirements, and coordinated action -- making it an ideal testbed for human-AI coordination. However, its use for human-AI interaction has been limited by the challenges of human evaluation. In this work, we introduce the Ad-Hoc Human-AI Coordination Challenge (AH2AC2) to overcome the constraints of costly and difficult-to-reproduce human evaluations. We develop \textit{human proxy agents} on a large-scale human dataset that serve as robust, cheap, and reproducible human-like evaluation partners in AH2AC2. To encourage the development of data-efficient methods, we open-source a dataset of 3,079 games, deliberately limiting the amount of available human gameplay data. We present baseline results for both two- and three- player Hanabi scenarios. To ensure fair evaluation, we host the proxy agents through a controlled evaluation system rather than releasing them publicly. The code is available at \href{https://github.com/FLAIROx/ah2ac2}{https://github.com/FLAIROx/ah2ac2}.
Spotlight Poster
jiayue Liu · Zhongchao Yi · Zhengyang Zhou · Qihe Huang · Kuo Yang · Xu Wang · Yang Wang

[ West Exhibition Hall B2-B3 ]

Abstract
Discovering regularities from spatiotemporal systems can benefit various scientific and social planning. Current spatiotemporal learners usually train an independent model from a specific source data that leads to limited transferability among sources, where even correlated tasks requires new design and training. The key towards increasing cross-domain knowledge is to enable collective intelligence and model evolution. In this paper, inspired by neuroscience theories, we theoretically derive the increased information boundary via learning cross-domain collective intelligence and propose a Synaptic EVOlutional spatiotemporal network, SynEVO, where SynEVO breaks the model independence and enables cross-domain knowledge to be shared and aggregated. Specifically, we first re-order the sample groups to imitate the human curriculum learning, and devise two complementary learners, elastic common container and task-independent extractor to allow model growth and task-wise commonality and personality disentanglement. Then an adaptive dynamic coupler with a new difference metric determines whether the new sample group should be incorporated into common container to achieve model evolution under various domains. Experiments show that SynEVO improves the generalization capacity by at most 42\% under cross-domain scenarios and SynEVO provides a paradigm of NeuroAI for knowledge transfer and adaptation.Code available at [https://github.com/Rodger-Lau/SynEVO](https://github.com/Rodger-Lau/SynEVO).
Spotlight Poster
James Rowbottom · Georg Maierhofer · Teo Deveney · Eike Müller · Alberto Paganini · Katharina Schratz · Pietro Lió · Carola-Bibiane Schönlieb · Chris Budd

[ East Exhibition Hall A-B ]

Abstract
We present a novel, and effective, approach to achieve optimal mesh relocation in finite element methods (FEMs). The cost and accuracy of FEMs is critically dependent on the choice of mesh points. Mesh relocation (r-adaptivity) seeks to optimise the mesh geometry to obtain the best solution accuracy at given computational budget. Classical r-adaptivity relies on the solution of a separate nonlinear ``meshing'' PDE to determine mesh point locations. This incurs significant cost at remeshing, and relies on estimates that relate interpolation- and FEM-error. Recent machine learning approaches have focused on the construction of fast surrogates for such classical methods. Instead, our new approach trains a graph neural network (GNN) to determine mesh point locations by directly minimising the FE solution error from the PDE system Firedrake to achieve higher solution accuracy. Our GNN architecture closely aligns the mesh solution space to that of classical meshing methodologies, thus replacing classical estimates for optimality with a learnable strategy. This allows for rapid and robust training and results in an extremely efficient and effective GNN approach to online r-adaptivity. Our method outperforms both classical, and prior ML, approaches to r-adaptive meshing. In particular, it achieves lower FE solution error, whilst retaining the significant …
Spotlight Poster
xihong yang · Siwei Wang · Fangdi Wang · Jiaqi Jin · Suyuan Liu · Yue Liu · En Zhu · Xinwang Liu · Yueming Jin

[ East Exhibition Hall A-B ]

Abstract
Leveraging the powerful representation learning capabilities, deep multi-view clustering methods have demonstrated reliable performance by effectively integrating multi-source information from diverse views in recent years. Most existing methods rely on the assumption of clean views. However, noise is pervasive in real-world scenarios, leading to a significant degradation in performance. To tackle this problem, we propose a novel multi-view clustering framework for the automatic identification and rectification of noisy data, termed AIRMVC. Specifically, we reformulate noisy identification as an anomaly identification problem using GMM. We then design a hybrid rectification strategy to mitigate the adverse effects of noisy data based on the identification results. Furthermore, we introduce a noise-robust contrastive mechanism to generate reliable representations. Additionally, we provide a theoretical proof demonstrating that these representations can discard noisy information, thereby improving the performance of downstream tasks. Extensive experiments on six benchmark datasets demonstrate that AIRMVC outperforms state-of-the-art algorithms in terms of robustness in noisy scenarios. The code of AIRMVC are available at https://github.com/xihongyang1999/AIRMVC on Github.
Spotlight Poster
Kaizhen Zhu · Mokai Pan · Yuexin Ma · Yanwei Fu · Jingyi Yu · Jingya Wang · Ye Shi

[ East Exhibition Hall A-B ]

Abstract
Recent advances in diffusion bridge models leverage Doob’s $h$-transform to establish fixed endpoints between distributions, demonstrating promising results in image translation and restoration tasks. However, these approaches frequently produce blurred or excessively smoothed image details and lack a comprehensive theoretical foundation to explain these shortcomings. To address these limitations, we propose UniDB, a unified framework for diffusion bridges based on Stochastic Optimal Control (SOC). UniDB formulates the problem through an SOC-based optimization and derives a closed-form solution for the optimal controller, thereby unifying and generalizing existing diffusion bridge models. We demonstrate that existing diffusion bridges employing Doob’s $h$-transform constitute a special case of our framework, emerging when the terminal penalty coefficient in the SOC cost function tends to infinity. By incorporating a tunable terminal penalty coefficient, UniDB achieves an optimal balance between control costs and terminal penalties, substantially improving detail preservation and output quality. Notably, UniDB seamlessly integrates with existing diffusion bridge models, requiring only minimal code modifications. Extensive experiments across diverse image restoration tasks validate the superiority and adaptability of the proposed framework. Our code is available at https://github.com/UniDB-SOC/UniDB/.
Spotlight Poster
Zhengyu Zhou · Weiwei Liu

[ West Exhibition Hall B2-B3 ]

Abstract
Continuous Normalizing Flows (CNFs) have proven to be a highly efficient technique for generative modeling of complex data since the introduction of Flow Matching (FM). The core of FM is to learn the constructed velocity fields of CNFs through deep least squares regression. Despite its empirical effectiveness, theoretical investigations of FM remain limited. In this paper, we present the first end-to-end error analysis of CNFs built upon FM. Our analysis shows that for general target distributions with bounded support, the generated distribution of FM is guaranteed to converge to the target distribution in the sense of the Wasserstein-2 distance. Furthermore, the convergence rate is significantly improved under an additional mild Lipschitz condition of the target score function.
Spotlight Poster
Junhao Dong · Piotr Koniusz · Yifei Zhang · Hao Zhu · Weiming Liu · Xinghua Qu · Yew Soon ONG

[ East Exhibition Hall A-B ]

Abstract
Vision-Language Models (VLMs) such as CLIP excel at zero-shot classification due to large-scale pre-training but are vulnerable to adversarial examples. Adversarial fine-tuning robustifies zero-shot models by aligning prediction scores of individual adversaries with their clean counterparts, which typically overlooks intermediate adversarial samples along the adversarial trajectory crossing the decision boundary. Such intermediate adversaries and their vicinity produce informative representations capturing the decision boundary in detail. They can be improved by sampling adversarial candidates from simplices formed by joining two consecutive vertices on the adversarial trajectory and their clean counterpart. However, sampling simplices for adversaries is very costly. To train robust VLM, we overcome these limitations by Taylor expansion and formulating an upper-bound of alignment loss that depends on the Jacobian/Hessian obtained at clean samples. As regions between clean and intermediate adversarial samples capture a larger decision landscape, we robustify VLM by plausible adversaries from simplices by our closed-form formulation equivalent to infinite uniform sampling of the simplex. We obtain state-of-the-art robustness across 15 datasets and diverse vision-language tasks.
Spotlight Poster
Guiomar Pescador-Barrios · Sarah Filippi · Mark van der Wilk

[ East Exhibition Hall A-B ]

Abstract
Many machine learning models require setting a parameter that controls their size before training, e.g. number of neurons in DNNs, or inducing points in GPs. Increasing capacity typically improves performance until all the information from the dataset is captured. After this point, computational cost keeps increasing, without improved performance. This leads to the question "How big is big enough?" We investigate this problem for Gaussian processes (single-layer neural networks) in continual learning. Here, data becomes available incrementally, and the final dataset size will therefore not be known before training, preventing the use of heuristics for setting a fixed model size. We develop a method to automatically adjust model size while maintaining near-optimal performance. Our experimental procedure follows the constraint that any hyperparameters must be set without seeing dataset properties, and we show that our method performs well across diverse datasets without the need to adjust its hyperparameter, showing it requires less tuning than others.
Spotlight Poster
Zheng Li · Zeyu Liu · Feng Xie · Hao Zhang · Chunchen LIU · zhi geng

[ East Exhibition Hall A-B ]

Abstract
We tackle the problem of identifying whether a variable is the cause of a specified target using observational data. State-of-the-art causal learning algorithms that handle latent variables typically rely on identifying the global causal structure, often represented as a partial ancestral graph (PAG), to infer causal relationships. Although effective, these approaches are often redundant and computationally expensive when the focus is limited to a specific causal relationship. In this work, we introduce novel local characterizations that are necessary and sufficient for various types of causal relationships between two variables, enabling us to bypass the need for global structure learning. Leveraging these local insights, we develop efficient and fully localized algorithms that accurately identify causal relationships from observational data. We theoretically demonstrate the soundness and completeness of our approach. Extensive experiments on benchmark networks and real-world datasets further validate the effectiveness and efficiency of our method.
Spotlight Poster
Abhra Chaudhuri · Anjan Dutta · Tu Bui · Serban Georgescu

[ West Exhibition Hall B2-B3 ]

Abstract
We aim to develop a fundamental understanding of modality collapse, a recently observed empirical phenomenon wherein models trained for multimodal fusion tend to rely only on a subset of the modalities, ignoring the rest. We show that modality collapse happens when noisy features from one modality are entangled, via a shared set of neurons in the fusion head, with predictive features from another, effectively masking out positive contributions from the predictive features of the former modality and leading to its collapse. We further prove that cross-modal knowledge distillation implicitly disentangles such representations by freeing up rank bottlenecks in the student encoder, denoising the fusion-head outputs without negatively impacting the predictive features from either modality. Based on the above findings, we propose an algorithm that prevents modality collapse through explicit basis reallocation, with applications in dealing with missing modalities. Extensive experiments on multiple multimodal benchmarks validate our theoretical claims. Project page: https://abhrac.github.io/mmcollapse/.
Spotlight Poster
Aditya Gorla · Ryan Wang · Zhengtong Liu · Ulzee An · Sriram Sankararaman

[ East Exhibition Hall A-B ]

Abstract
We present CACTI, a masked autoencoding approach for imputing tabular data that leverages the structure in missingness patterns and contextual information. Our approach employs a novel median truncated copy masking training strategy that encourages the model to learn from empirical patterns of missingness while incorporating semantic relationships between features — captured by column names and text descriptions — to better represent feature dependence. These dual sources of inductive bias enable CACTIto outperform state-of-the-art methods — an average $R^2$ gain of 7.8\% over the next best method (13.4%, 6.1%, and 5.3% under missing not at random, at random and completely at random, respectively) — across a diverse range of datasets and missingness conditions. Our results highlight the value of leveraging dataset-specific contextual information and missingness patterns to enhance imputation performance.
Spotlight Poster
Yeyun Chen

[ West Exhibition Hall B2-B3 ]

Abstract
Predicting the solvation free energy of molecules using graph neural networks holds significant potential for advancing drug discovery and the design of novel materials. While previous methods have demonstrated success on independent and identically distributed (IID) datasets, their performance in out-of-distribution (OOD) scenarios remains largely unexplored. We propose a novel Relational Invariant Learning framework (RILOOD) to enhance OOD generalization in solvation free energy prediction. RILOOD comprises three key components: (i) a mixup-based conditional modeling module that integrates diverse environments, (ii) a novel multi-granularity refinement strategy that extends beyond core substructures to enable context-aware representation learning for capturing multi-level interactions, and (iii) an invariant learning mechanism that identifies robust patterns generalizable to unseen environments. Extensive experiments demonstrate that RILOOD significantly outperforms state-of-the-art methods across various distribution shifts, highlighting its effectiveness in improving solvation free energy prediction under diverse conditions.
Spotlight Poster
Yingzhen Yang

[ East Exhibition Hall A-B ]

Abstract
Sharp generalization bound for neural networks trained by gradient descent (GD) is of central interest in statistical learning theory and deep learning. In this paper, we consider nonparametric regressionby an over-parameterized two-layer NN trained by GD. We show that, if the neural network is trained by GD with early stopping, then the trained network renders a sharp rate of the nonparametric regression risk of $O(\epsilon_n^2)$, which is the same rate as that for the classical kernel regression trained by GD with early stopping, where $\epsilon_n$ is the critical population rate of the Neural Tangent Kernel (NTK) associated with the network and $n$ is the size of the training data. It is remarked that our result does not require distributional assumptions on the covariate as long as the covariate lies on the unit sphere, in a strong contrast with many existing results which rely on specific distributions such as the spherical uniform data distribution or distributions satisfying certain restrictive conditions.As a special case of our general result, when the eigenvalues of the associated NTKdecay at a rate of $\lambda_j \asymp j^{-\frac{d}{d-1}}$ for $j \ge 1$ which happens under certain distributional assumption such as the training features follow the spherical uniform distribution, …
Spotlight Poster
Kirsten Morehouse · Siddharth Swaroop · Weiwei Pan

[ East Exhibition Hall A-B ]

Abstract
The proliferation of LLM bias probes introduces three challenges: we lack (1) principled criteria for selecting appropriate probes, (2) a system for reconciling conflicting results across probes, and (3) formal frameworks for reasoning about when and why experimental findings will generalize to real user behavior. In response, we propose a systematic approach to LLM social bias probing, drawing on insights from the social sciences. Central to this approach is EcoLevels—a novel framework that helps (a) identify appropriate bias probes (b) reconcile conflicting results, and (c) generate predictions about bias generalization. We ground our framework in the social sciences, as many LLM probes are adapted from human studies, and these fields have faced similar challenges when studying bias in humans. Finally, we outline five lessons that demonstrate how LLM bias probing can (and should) benefit from decades of social science research
Spotlight Poster
Marta Skreta · Tara Akhound-Sadegh · Viktor Ohanesian · Roberto Bondesan · Alan Aspuru-Guzik · Arnaud Doucet · Rob Brekelmans · Alexander Tong · Kirill Neklyudov

[ East Exhibition Hall A-B ]

Abstract
While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling inference-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional `corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation.
Spotlight Poster
Lukas Braun · Erin Grant · Andrew Saxe

[ West Exhibition Hall B2-B3 ]

Abstract
A foundational principle of connectionism is that perception, action, and cognition emerge from parallel computations among simple, interconnected units that generate and rely on neural representations. Accordingly, researchers employ multivariate pattern analysis to decode and compare the neural codes of artificial and biological networks, aiming to uncover their functions. However, there is limited analytical understanding of how a network’s representation and function relate, despite this being essential to any quantitative notion of underlying function or functional similarity. We address this question using fully analysable two-layer linear networks and numerical simulations in nonlinear networks. We find that function and representation are dissociated, allowing representational similarity without functional similarity and vice versa. Further, we show that neither robustness to input noise nor the level of generalisation error constrain representations to the task. In contrast, networks robust to parameter noise have limited representational flexibility and must employ task-specific representations. Our findings suggest that representational alignment reflects computational advantages beyond functional alignment alone, with significant implications for interpreting and comparing the representations of connectionist systems
Spotlight Poster
Gaozheng Pei · Ke Ma · Yingfei Sun · Qianqian Xu · Qingming Huang

[ East Exhibition Hall A-B ]

Abstract
The diffusion-based adversarial purification methods attempt to drown adversarial perturbations into a part of isotropic noise through the forward process, and then recover the clean images through the reverse process. Due to the lack of distribution information about adversarial perturbations in the pixel domain, it is often unavoidable to damage normal semantics. We turn to the frequency domain perspective, decomposing the image into amplitude spectrum and phase spectrum. We find that for both spectra, the damage caused by adversarial perturbations tends to increase monotonically with frequency. This means that we can extract the content and structural information of the original clean sample from the frequency components that are less damaged. Meanwhile, theoretical analysis indicates that existing purification methods indiscriminately damage all frequency components, leading to excessive damage to the image. Therefore, we propose a purification method that can eliminate adversarial perturbations while maximizing the preservation of the content and structure of the original image. Specifically, at each time step during the reverse process, for the amplitude spectrum, we replace the low-frequency components of the estimated image's amplitude spectrum with the corresponding parts of the adversarial image.For the phase spectrum, we project the phase of the estimated image into a designated …
Spotlight Poster
Yuxuan Zhu · Antony Kellermann · Dylan Bowman · Philip Li · Akul Gupta · Adarsh Danda · Richard Fang · Conner Jensen · Eric Ihli · Jason Benn · Jet Geronimo · Avi Dhir · Sudhit Rao · Kaicheng Yu · Twm Stone · Daniel Kang

[ West Exhibition Hall B2-B3 ]

Abstract
Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture-the-Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized exper-tise to reproduce exploits and a systematic approach to evaluating unpredictable attacks. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our experiments show that the state-of-the-art agent framework can exploit up to 13% of the vulnerabilities.
Spotlight Poster
Andrew Wilson

[ East Exhibition Hall A-B ]

Abstract
Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization.This position paper argues that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized, using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality.
Spotlight Poster
Michalis Titsias

[ East Exhibition Hall A-B ]

Abstract
Sparse variational Gaussian processes (GPs) construct tractable posterior approximations to GP models. At the core of these methods is the assumption that the true posterior distribution over training function values ${\bf f}$ and inducing variables ${\bf u}$ is approximated by a variational distribution that incorporates the conditional GP prior $p({\bf f} | {\bf u})$ in its factorization. While this assumption is considered as fundamental, we show that for model training we can relax it through the use of a more general variational distribution $q({\bf f} | {\bf u} )$ that depends on $N$ extra parameters, where $N$ is the number of training examples. In GP regression, we can analytically optimize the evidence lower bound over the extra parameters and express a tractable collapsed bound that is tighter than the previous bound. The new bound is also amenable to stochastic optimization and its implementation requires minor modifications to existing sparse GP code. Further, we also describe extensions to non-Gaussian likelihoods. On several datasets we demonstrate that our method can reduce bias when learning the hyperparameters and can lead to better predictive performance.
Spotlight Poster
Terje Mildner · Oliver Hamelijnck · Paris Giampouras · Theodoros Damoulas

[ East Exhibition Hall A-B ]

Abstract
We introduce FedGVI, a probabilistic Federated Learning (FL) framework that is robust to both prior and likelihood misspecification. FedGVI addresses limitations in both frequentist and Bayesian FL by providing unbiased predictions under model misspecification, with calibrated uncertainty quantification. Our approach generalises previous FL approaches, specifically Partitioned Variational Inference (Ashman et al., 2022), by allowing robust and conjugate updates, decreasing computational complexity at the clients. We offer theoretical analysis in terms of fixed-point convergence, optimality of the cavity distribution, and provable robustness to likelihood misspecification. Further, we empirically demonstrate the effectiveness of FedGVI in terms of improved robustness and predictive performance on multiple synthetic and real world classification data sets.
Spotlight Poster
Mark Vero · Niels Mündler · Viktor Chibotaru · Veselin Raychev · Maximilian Baader · Nikola Jovanović · Jingxuan He · Martin Vechev

[ East Exhibition Hall A-B ]

Abstract
Automatic program generation has long been a fundamental challenge in computer science. Recent benchmarks have shown that large language models (LLMs) can effectively generate code at the function level, make code edits, and solve algorithmic coding tasks. However, to achieve full automation, LLMs should be able to generate production-quality, self-contained application modules. To evaluate the capabilities of LLMs in solving this challenge, we introduce BaxBench, a novel evaluation benchmark consisting of 392 tasks for the generation of backend applications. We focus on backends for three critical reasons: (i) they are practically relevant, building the core components of most modern web and cloud software, (ii) they are difficult to get right, requiring multiple functions and files to achieve the desired functionality, and (iii) they are security-critical, as they are exposed to untrusted third-parties, making secure solutions that prevent deployment-time attacks an imperative. BaxBench validates the functionality of the generated applications with comprehensive test cases, and assesses their security exposure by executing end-to-end exploits. Our experiments reveal key limitations of current LLMs in both functionality and security: (i) even the best model, OpenAI o1, achieves a mere 62% on code correctness; (ii) on average, we could successfully execute security exploits on around …
Spotlight Poster
Yikang Chen · Dehui du

[ East Exhibition Hall A-B ]

Abstract
This paper investigates $\sim_{\mathcal{L}\_3}$-identifiability, a form of complete counterfactual identifiability within the Pearl Causal Hierarchy (PCH) framework, ensuring that all Structural Causal Models (SCMs) satisfying the given assumptions provide consistent answers to all causal questions. To simplify this problem, we introduce exogenous isomorphism and propose $\sim_{\mathrm{EI}}$-identifiability, reflecting the strength of model identifiability required for $\sim_{\mathcal{L}\_3}$-identifiability. We explore sufficient assumptions for achieving $\sim_{\mathrm{EI}}$-identifiability in two special classes of SCMs: Bijective SCMs (BSCMs), based on counterfactual transport, and Triangular Monotonic SCMs (TM-SCMs), which extend $\sim_{\mathcal{L}\_2}$-identifiability. Our results unify and generalize existing theories, providing theoretical guarantees for practical applications. Finally, we leverage neural TM-SCMs to address the consistency problem in counterfactual reasoning, with experiments validating both the effectiveness of our method and the correctness of the theory.
Spotlight Poster
Liang Yang · Yuwei Liu · Jiaming Zhuo · Di Jin · Chuan Wang · Zhen Wang · Xiaochun Cao

[ East Exhibition Hall A-B ]

Abstract
Brain network analysis plays a critical role in brain disease prediction and diagnosis. Graph mining tools have made remarkable progress. Graph neural networks (GNNs) and Transformers, which rely on the message-passing scheme, recently dominated this field due to their powerful expressive ability on graph data. Unfortunately, by considering brain network construction using pairwise Pearson’s coefficients between any pairs of ROIs, model analysis and experimental verification reveal that *the message-passing under both GNNs and Transformers can NOT be fully explored and exploited*. Surprisingly, this paper observes the significant performance and efficiency enhancements of the Hadamard product compared to the matrix product, which is the matrix form of message passing, in processing the brain network. Inspired by this finding, a novel Brain Quadratic Network (BQN) is proposed by incorporating quadratic networks, which possess better universal approximation properties. Moreover, theoretical analysis demonstrates that BQN implicitly performs community detection along with representation learning. Extensive evaluations verify the superiority of the proposed BQN compared to the message-passing-based brain network modeling. Source code is available at [https://github.com/LYWJUN/BQN-demo](https://github.com/LYWJUN/BQN-demo).
Spotlight Poster
Shanshan Luo · Yu yixuan · Chunchen LIU · Feng Xie · zhi geng

[ East Exhibition Hall A-B ]

Abstract
Previous studies have extensively addressed the attribution problem for binary outcome variables. However, in many practical scenarios, the outcome variable is continuous, and simply binarizing it may result in information loss or biased conclusions. To address this issue, we propose a series of posterior causal estimands for retrospectively evaluating multiple correlated causes from a continuous outcome. These estimands include posterior intervention effects, posterior total causal effects, and posterior natural direct effects. Under assumptions of sequential ignorability, monotonicity, and perfect positive rank, we show that the posterior causal estimands of interest are identifiable and present the corresponding identification equations. We also provide a simple but effective estimation procedure and establish asymptotic properties of the proposed estimators. An artificial hypertension example and a real developmental toxicity dataset are employed to illustrate our method.
Spotlight Poster
Qihe Huang · Zhengyang Zhou · Kuo Yang · Zhongchao Yi · Xu Wang · Yang Wang

[ West Exhibition Hall B2-B3 ]

Abstract
Long-term time series forecasting (LTSF) has traditionally relied on large parameters to capture extended temporal dependencies, resulting in substantial computational costs and inefficiencies in both memory usage and processing time. However, time series data, unlike high-dimensional images or text, often exhibit temporal pattern similarity and low-rank structures, especially in long-term horizons. By leveraging this structure, models can be guided to focus on more essential, concise temporal data, improving both accuracy and computational efficiency. In this paper, we introduce TimeBase, an ultra-lightweight network to harness the power of minimalism in LTSF. TimeBase 1) extracts core basis temporal components and 2) transforms traditional point-level forecasting into efficient segment-level forecasting, achieving optimal utilization of both data and parameters. Extensive experiments on diverse real-world datasets show that TimeBase achieves remarkable efficiency and secures competitive forecasting performance. Additionally, TimeBase can also serve as a very effective plug-and-play complexity reducer for any patch-based forecasting models. Code is available at \url{https://github.com/hqh0728/TimeBase}.
Spotlight Poster
Jan Schuchardt · Mina Dalirrooyfard · Jed Guzelkabaagac · Anderson Schneider · Yuriy Nevmyvaka · Stephan Günnemann

[ East Exhibition Hall A-B ]

Abstract
Many forms of sensitive data, such as web traffic, mobility data, or hospital occupancy, are inherently sequential. The standard method for training machine learning models while ensuring privacy for units of sensitive information, such as individual hospital visits, is differentially private stochastic gradient descent (DP-SGD). However, we observe in this work that the formal guarantees of DP-SGD are incompatible with time series specific tasks like forecasting, since they rely on the *privacy amplification* attained by training on small, unstructured batches sampled from an unstructured dataset. In contrast, batches for forecasting are generated by (1) sampling sequentially structured time series from a dataset, (2) sampling contiguous subsequences from these series, and (3) partitioning them into context and ground-truth forecast windows. We theoretically analyze the privacy amplification attained by this *structured subsampling* to enable the training of forecasting models with sound and tight event- and user-level privacy guarantees. Towards more private models, we additionally prove how data augmentation amplifies privacy in self-supervised training of sequence models. Our empirical evaluation demonstrates that amplification by structured subsampling enables the training of forecasting models with strong formal privacy guarantees.
Spotlight Poster
Han Zhong · Zikang Shan · Guhao Feng · Wei Xiong · Xinle Cheng · Li Zhao · Di He · Jiang Bian · Liwei Wang

[ East Exhibition Hall A-B ]

Abstract
In the classical Reinforcement Learning from Human Feedback (RLHF) framework, Proximal Policy Optimization (PPO) is employed to learn from sparse, sentence-level rewards---a challenging scenario in traditional deep reinforcement learning. Despite the great successes of PPO in the alignment of state-of-the-art closed-source large language models (LLMs), its open-source implementation is still largely sub-optimal, as widely reported by numerous research studies. To address these issues, we introduce a framework that models RLHF problems as a Markov decision process (MDP), enabling the capture of fine-grained token-wise information. Furthermore, we provide theoretical insights that demonstrate the superiority of our MDP framework over the previous sentence-level bandit formulation. Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (\texttt{RTO}), which learns the token-wise reward function from preference data and performs policy optimization based on this learned token-wise reward signal. Theoretically, \texttt{RTO} is proven to have the capability of finding the near-optimal policy sample-efficiently. For its practical implementation, \texttt{RTO} innovatively integrates Direct Preference Optimization (DPO) and PPO. DPO, originally derived from sparse sentence rewards, surprisingly provides us with a token-wise characterization of response quality, which is seamlessly incorporated into our subsequent PPO training stage. We conduct extensive experiments to evaluate \texttt{RTO} against PPO and …
Spotlight Poster
Alessandro Palma · Sergei Rybakov · Leon Hetzel · Stephan Günnemann · Fabian Theis

[ West Exhibition Hall B2-B3 ]

Abstract
Latent space interpolations are a powerful tool for navigating deep generative models in applied settings. An example is single-cell RNA sequencing, where existing methods model cellular state transitions as latent space interpolations with variational autoencoders, often assuming linear shifts and Euclidean geometry. However, unless explicitly enforced, linear interpolations in the latent space may not correspond to geodesic paths on the data manifold, limiting methods that assume Euclidean geometry in the data representations. We introduce FlatVI, a novel training framework that regularises the latent manifold of discrete-likelihood variational autoencoders towards Euclidean geometry, specifically tailored for modelling single-cell count data. By encouraging straight lines in the latent space to approximate geodesic interpolations on the decoded single-cell manifold, FlatVI enhances compatibility with downstream approaches that assume Euclidean latent geometry. Experiments on synthetic data support the theoretical soundness of our approach, while applications to time-resolved single-cell RNA sequencing data demonstrate improved trajectory reconstruction and manifold interpolation.
Spotlight Poster
Mina Dalirrooyfard · Konstantin Makarychev · Slobodan Mitrovic

[ West Exhibition Hall B2-B3 ]

Abstract
We present a new Correlation Clustering algorithm for a dynamic setting where nodes are added one at a time. In this model, proposed by Cohen-Addad, Lattanzi, Maggiori, and Parotsidis (ICML 2024), the algorithm uses database queries to access the input graph and updates the clustering as each new node is added.Our algorithm has the amortized update time of $\log^{O(1)}(n)$. Its approximation factor is $20+\varepsilon$, which is a substantial improvement over the approximation factor of the algorithm by Cohen-Addad et al. We complement our theoretical findings by empirically evaluating the approximation guarantee of our algorithm. The results show that it outperforms the algorithm by Cohen-Addad et al.~in practice.
Spotlight Poster
Daniel Shao · Richard Chen · Andrew Song · Joel Runevic · Ming Y. Lu · Tong Ding · Faisal Mahmood

[ West Exhibition Hall B2-B3 ]

Abstract
Multiple Instance Learning (MIL) is a cornerstone approach in computational pathology for distilling embeddings from gigapixel tissue images into patient-level representations to predict clinical outcomes. However, MIL is frequently challenged by the constraints of working with small, weakly-supervised clinical datasets. Unlike fields such as natural language processing and computer vision, which effectively use transfer learning to improve model quality in data-scarce environments, the transferability of MIL models remains largely unexplored. We conduct the first comprehensive investigation into transfer learning capabilities of pretrained MIL models, evaluating 11 MIL models across 19 pretraining tasks spanning tissue subtyping, cancer grading, and molecular subtype prediction. We observe a substantial performance boost with finetuning pretrained models over training from randomly initialized weights, even with domain differences between pretraining and target tasks. Pretraining on pan-cancer datasets enables consistent generalization across organs and task types compared to single-disease pretraining. Remarkably, this pan-cancer pretraining leads to better transfer than that of a state-of-the-art slide-level foundation model, while using only 6.5\% of the training data. These findings indicate that MIL architectures exhibit robust adaptability, offering insights into the benefits of leveraging pretrained models to enhance performance in computational pathology.
Spotlight Poster
Xianghua Zeng · Hang Su · Zhengyi Wang · Zhiyuan LIN

[ West Exhibition Hall B2-B3 ]

Abstract
Offline multi-agent reinforcement learning (MARL) struggles to estimate out-of-distribution states and actions due to the absence of real-time environmental feedback. While diffusion models show promise in addressing these challenges, their application primarily focuses on independently diffusing the historical trajectories of individual agents, neglecting crucial multi-agent coordination dynamics and reducing policy robustness in dynamic environments. In this paper, we propose MCGD, a novel Multi-agent Coordination framework based on Graph Diffusion models to improve the effectiveness and robustness of collaborative policies. Specifically, we begin by constructing a sparse coordination graph that includes continuous node attributes and discrete edge attributes to effectively identify the underlying dynamics of multi-agent interactions. Next, we derive transition probabilities between edge categories and present adaptive categorical diffusion to capture the structure diversity of multi-agent coordination. Leveraging this coordination structure, we define neighbor-dependent forward noise and develop anisotropic diffusion to enhance the action diversity of each agent. Extensive experiments across various multi-agent environments demonstrate that MCGD significantly outperforms existing state-of-the-art baselines in coordination performance and policy robustness in dynamic environments.
Spotlight Poster
Shuai Yi · Yixiong Zou · Yuhua Li · Ruixuan Li

[ East Exhibition Hall A-B ]

Abstract
Vision Transformer (ViT) has achieved remarkable success due to its large-scale pretraining on general domains, but it still faces challenges when applying it to downstream distant domains that have only scarce training data, which gives rise to the Cross-Domain Few-Shot Learning (CDFSL) task. Inspired by Self-Attention's insensitivity to token orders, we find an interesting phenomenon neglected in current works: disrupting the continuity of image tokens (i.e., making pixels not smoothly transited across patches) in ViT leads to a noticeable performance decline in the general (source) domain but only a marginal decrease in downstream target domains. This questions the role of image tokens' continuity in ViT's generalization under large domain gaps. In this paper, we delve into this phenomenon for an interpretation. We find continuity aids ViT in learning larger spatial patterns, which are harder to transfer than smaller ones, enlarging domain distances. Meanwhile, it implies that only smaller patterns within each patch could be transferred under extreme domain gaps. Based on this interpretation, we further propose a simple yet effective method for CDFSL that better disrupts the continuity of image tokens, encouraging the model to rely less on large patterns and more on smaller ones. Extensive experiments show the effectiveness …
Spotlight Poster
Chi-Ning Chou · Hang Le · Yichen Wang · SueYeon Chung

[ West Exhibition Hall B2-B3 ]

Abstract
Integrating task-relevant information into neural representations is a fundamental ability of both biological and artificial intelligence systems. Recent theories have categorized learning into two regimes: the rich regime, where neural networks actively learn task-relevant features, and the lazy regime, where networks behave like random feature models. Yet this simple lazy–rich dichotomy overlooks a diverse underlying taxonomy of feature learning, shaped by differences in learning algorithms, network architectures, and data properties. To address this gap, we introduce an analysis framework to study feature learning via the geometry of neural representations. Rather than inspecting individual learned features, we characterize how task-relevant representational manifolds evolve throughout the learning process. We show, in both theoretical and empirical settings, that as networks learn features, task-relevant manifolds untangle, with changes in manifold geometry revealing distinct learning stages and strategies beyond the lazy–rich dichotomy. This framework provides novel insights into feature learning across neuroscience and machine learning, shedding light on structural inductive biases in neural circuits and the mechanisms underlying out-of-distribution generalization.
Spotlight Poster
Tianwei Lin · Wenqiao Zhang · Sijing Li · Yuqian Yuan · Binhe Yu · Haoyuan Li · Wanggui He · Hao Jiang · Mengze Li · Song xiaohui · Siliang Tang · Jun Xiao · Hui Lin · Yueting Zhuang · Beng Chin Ooi

[ West Exhibition Hall B2-B3 ]

Abstract
We present **HealthGPT**, a powerful Medical Large Vision-Language Model (Med-LVLM) that integrates medical visual comprehension and generation capabilities within a unified autoregressive paradigm. Our bootstrapping philosophy is to progressively adapt heterogeneous comprehension and generation knowledge to pre-trained Large Language Models (LLMs). This is achieved through a novel heterogeneous low-rank adaptation **(H-LoRA)** technique, which is complemented by a tailored hierarchical visual perception **(HVP)** approach and a three-stage learning strategy **(TLS)**. To effectively learn the HealthGPT, we devise a comprehensive medical domain-specific comprehension and generation dataset called **VL-Health**. Experimental results demonstrate exceptional performance and scalability of HealthGPT in medical visual unified tasks. Our project can be accessed at https://github.com/DCDmllm/HealthGPT.
Spotlight Poster
Sunny Sanyal · Hayden Prairie · Rudrajit Das · Ali Kavis · Sujay Sanghavi

[ East Exhibition Hall A-B ]

Abstract
Fine-tuning a pre-trained model on a downstream task often degrades its original capabilities, a phenomenon known as "catastrophic forgetting". This is especially an issue when one does not have access to the data and recipe used to develop the pre-trained model. Under this constraint, most existing methods for mitigating forgetting are inapplicable. To address this challenge, we propose a *sample weighting scheme for the fine-tuning data* solely based on the pre-trained model's losses. Specifically, we upweight the easy samples on which the pre-trained model's loss is low and vice versa to limit the drift from the pre-trained model. Our approach is orthogonal and yet complementary to existing methods; while such methods mostly operate on parameter or gradient space, we concentrate on the sample space. We theoretically analyze the impact of fine-tuning with our method in a linear setting, showing that it stalls learning in a certain subspace, which inhibits overfitting to the target task. We empirically demonstrate the efficacy of our method on both language and vision tasks. As an example, when fine-tuning Gemma 2 2B on MetaMathQA, our method results in only a $0.8$% drop in accuracy on GSM8K (another math dataset) compared to standard fine-tuning, while preserving $5.4$% …
Spotlight Poster
Juan Correa · Elias Bareinboim

[ East Exhibition Hall A-B ]

Abstract
Graphical models have been widely used as parsimonious encoders of constraints of the underlying probability models. When organized in a structured way, these models can facilitate the derivation of non-trivial constraints, the inference of quantities of interest, and the optimization of their estimands. In particular, causal diagrams allow for the efficient representation of structural constraints of the underlying causal system. In this paper, we introduce an efficient graphical construction called Ancestral Multi-world Networks that is sound and complete for reading counterfactual independences from a causal diagram using d-separation. Moreover, we introduce the counterfactual (ctf-) calculus, which can be used to transform counterfactual quantities using three rules licensed by the constraints encoded in the diagram. This result generalizes Pearl’s celebrated do-calculus from interventional to counterfactual reasoning.
Spotlight Poster
Tuan Dam · Pascal Stenger · Lukas Schneider · Joni Pajarinen · Carlo D'Eramo · Odalric-Ambrym Maillard

[ West Exhibition Hall B2-B3 ]

Abstract
This paper introduces a novel backup strategy for Monte-Carlo Tree Search (MCTS) tailored for highly stochastic and partially observable Markov decision processes. We adopt a probabilistic approach, modeling both value and action-value nodes as Gaussian distributions, to introduce a novel backup operator that computes value nodes as the Wasserstein barycenter of their action-value children nodes; thus, propagating the uncertainty of the estimate across the tree to the root node. We study our novel backup operator when using a novel combination of $L^1$-Wasserstein barycenter with $\alpha$-divergence, by drawing a crucial connection to the generalized mean backup operator. We complement our probabilistic backup operator with two sampling strategies, based on optimistic selection and Thompson sampling, obtaining our Wasserstein MCTS algorithm. We provide theoretical guarantees of asymptotic convergence of $\mathcal{O}(n^{-1/2})$, with $n$ as the number of visited trajectories, to the optimal policy and an empirical evaluation on several stochastic and partially observable environments, where our approach outperforms well-known related baselines.
Spotlight Poster
Amrith Setlur · Nived Rajaraman · Sergey Levine · Aviral Kumar

[ East Exhibition Hall A-B ]

Abstract
Despite substantial advances in scaling test-time compute, an ongoing debate in the community is how it should be scaled up to enable continued and efficient improvements with scaling. There are largely two approaches: (i) distilling successful search or thinking traces; and (ii), using verification (e.g., 0/1 outcome rewards, or verifiers) to guide reinforcement learning (RL) and search algorithms. In this paper, we prove that finetuning LLMs with verifier-based (VB) methods based on RL or search is far superior to verifier-free (VF) approaches based on distilling or cloning search traces, given a fixed amount of compute/data budget. Further, we show that as we scale test-time compute (measured as the output token length) and training data, suboptimality of VF methods scales poorly compared to VB when the base pre-trained LLM presents a heterogeneous distribution over correct solution traces (e.g., different lengths, styles, etc.) and admits a non-sharp distribution over rewards on traces sampled from it. We formalize this condition using anti-concentration [Erdős 1945], implying a stronger result that VB methods scale better asymptotically, with the performance gap between VB and VF widening as test-time budget grows.We corroborate our theory empirically on didactic and math reasoning problems with 3/8/32B-sized pre-trained LLMs, where we …
Spotlight Poster
Xinyan Liang · Shijie Wang · Yuhua Qian · Qian Guo · Liang Du · Bingbing Jiang · Tingjin Luo · Feijiang Li

[ East Exhibition Hall A-B ]

Abstract
Multi-view classification (MVC) based on the Dempster-Shafer theory has gained significant recognition for its reliability in safety-critical applications. However, existing methods predominantly focus on providing confidence levels for decision outcomes without explaining the reasoning behind these decisions. Moreover, the reliance on first-order statistical magnitudes of belief masses often inadequately capture the intrinsic uncertainty within the evidence. To address these limitations, we propose a novel framework termed Trusted Multi-view Classification Constrained with Expert Knowledge (TMCEK). TMCEK integrates expert knowledge to enhance feature-level interpretability and introduces a distribution-aware subjective opinion mechanism to derive more reliable and realistic confidence estimates. The theoretical superiority of the proposed uncertainty measure over conventional approaches is rigorously established. Extensive experiments conducted on three multi-view datasets for sleep stage classification demonstrate that TMCEK achieves state-of-the-art performance while offering interpretability at both the feature and decision levels. These results position TMCEK as a robust and interpretable solution for MVC in safety-critical domains. The code is available at https://github.com/jie019/TMCEK_ICML2025.
Spotlight Poster
Lisha Chen · Quan Xiao · Ellen Fukuda · Xinyi Chen · Kun Yuan · Tianyi Chen

[ West Exhibition Hall B2-B3 ]

Abstract
Multi-objective learning under user-specified preference is common in real-world problems such as multi-lingual speech recognition under fairness. In this work, we frame such a problem as a semivectorial bilevel optimization problem, whose goal is to optimize a pre-defined preference function, subject to the constraint that the model parameters are weakly Pareto optimal. To solve this problem, we convert the multi-objective constraints to a single-objective constraint through a merit function with an easy-to-evaluate gradient, and then, we use a penalty-based reformulation of the bilevel optimization problem. We theoretically establish the properties of the merit function, and the relations of solutions for the penalty reformulation and the constrained formulation. Then we propose algorithms to solve the reformulated single-level problem, and establish its convergence guarantees. We test the method on various synthetic and real-world problems. The results demonstrate the effectiveness of the proposed method in finding preference-guided optimal solutions to the multi-objective problem.
Spotlight Poster
Seth Karten · Andy Nguyen · Chi Jin

[ West Exhibition Hall B2-B3 ]

Abstract
We introduce PokéChamp, a minimax agent powered by Large Language Models (LLMs) for Pokémon battles. Built on a general framework for two-player competitive games, PokéChamp leverages the generalist capabilities of LLMs to enhance minimax tree search. Specifically, LLMs replace three key modules: (1) player action sampling, (2) opponent modeling, and (3) value function estimation, enabling the agent to effectively utilize gameplay history and human knowledge to reduce the search space and address partial observability. Notably, our framework requires no additional LLM training. We evaluate PokéChamp in the popular Gen 9 OU format. When powered by GPT-4o, it achieves a win rate of 76\% against the best existing LLM-based bot and 84\% against the strongest rule-based bot, demonstrating its superior performance. Even with an open-source 8-billion-parameter Llama 3.1 model, PokéChamp consistently outperforms the previous best LLM-based bot, Pokéllmon powered by GPT-4o, with a 64\% win rate. PokéChamp attains a projected Elo of 1300-1500 on the Pokémon Showdown online ladder, placing it among the top 30\%-10\% of human players. In addition, this work compiles the largest real-player Pokémon battle dataset, featuring over 3 million games, including more than 500k high-Elo matches. Based on this dataset, we establish a series of battle benchmarks …
Spotlight Poster
Matteo Zecchin · Sangwoo Park · Osvaldo Simeone

[ East Exhibition Hall A-B ]

Abstract
We introduce adaptive learn-then-test (aLTT), an efficient hyperparameter selection procedure that provides finite-sample statistical guarantees on the population risk of AI models. Unlike the existing learn-then-test (LTT) technique, which relies on conventional p-value-based multiple hypothesis testing (MHT), aLTT implements sequential data-dependent MHT with early termination by leveraging e-processes. As a result, aLTT can reduce the number of testing rounds, making it particularly well-suited for scenarios in which testing is costly or presents safety risks. Apart from maintaining statistical validity, in applications such as online policy selection for offline reinforcement learning and prompt engineering, aLTT is shown to achieve the same performance as LTT while requiring only a fraction of the testing rounds.
Spotlight Poster
Xin Chen · Yarden As · Andreas Krause

[ East Exhibition Hall A-B ]

Abstract
Large language models (LLMs) have emerged as powerful tools but pose significant safety risks through harmful outputs and vulnerability to adversarial attacks. We propose SaP–short for Safety Polytope–a geometric approach to LLM safety, that learns and enforces multiple safety constraints directly in the model's representation space. We develop a framework that identifies safe and unsafe regions via the polytope's facets, enabling both detection and correction of unsafe outputs through geometric steering. Unlike existing approaches that modify model weights, SaP operates post-hoc in the representation space, preserving model capabilities while enforcing safety constraints. Experiments across multiple LLMs demonstrate that our method can effectively detect unethical inputs, reduce adversarial attack success rates while maintaining performance on standard tasks, thus highlighting the importance of having an explicit geometric model for safety. Analysis of the learned polytope facets reveals emergence of specialization in detecting different semantic notions of safety, providing interpretable insights into how safety is captured in LLMs' representation space.
Spotlight Poster
Weizhong Huang · Yuxin Zhang · Xiawu Zheng · Fei Chao · Rongrong Ji

[ East Exhibition Hall A-B ]

Abstract
In this paper, we address the challenge of determining the layer-wise sparsity rates of large language models (LLMs) through a theoretical perspective. Specifically, we identify a critical issue of **"reconstruction error explosion"** in existing LLMs sparsification methods. This refers to the cumulative effect of reconstruction errors throughout the sparsification process, where errors from earlier layers propagate and amplify in subsequent layers. As a result, the overall reconstruction error increases significantly, leading to a substantial degradation in model performance. Through theoretical analysis, we derive a simple yet effective approach to layer-wise sparsity allocation that mitigates this issue. Our method uses a monotonically increasing arithmetic progression, reducing the process of determining sparsity rates for multiple layers to the determination of a single common difference hyperparameter. Remarkably, this allows for the optimal layer-wise sparsity rates to be identified with just a few trials. Both our theoretical analysis and experimental results demonstrate that this sparsity allocation scheme is near optimal. Extensive experiments show that our method significantly improves the performance of sparse LLMs across various architectures, outperforming existing layer-wise sparsity methods. Furthermore, it enhances the performance of various compression techniques and is applicable to vision and multimodal models. Notably, our method achieves a reduction …
Spotlight Poster
Yizhuo Chen · Chun-Fu (Richard) Chen · Hsiang Hsu · Shaohan Hu · Tarek Abdelzaher

[ East Exhibition Hall A-B ]

Abstract
The growing Machine Learning (ML) services require extensive collections of user data, which may inadvertently include people's private information irrelevant to the services. Various studies have been proposed to protect private attributes by removing them from the data while maintaining the utilities of the data for downstream tasks. Nevertheless, as we theoretically and empirically show in the paper, these methods reveal severe vulnerability because of a common weakness rooted in their adversarial training based strategies. To overcome this limitation, we propose a novel approach, PASS, designed to stochastically substitute the original sample with another one according to certain probabilities, which is trained with a novel loss function soundly derived from information-theoretic objective defined for utility-preserving private attributes protection. The comprehensive evaluation of PASS on various datasets of different modalities, including facial images, human activity sensory signals, and voice recording datasets, substantiates PASS's effectiveness and generalizability.
Spotlight Poster
Zhen Liu · Yicheng Luo · Boyuan Li · Emadeldeen Eldele · Min Wu · Qianli Ma

[ East Exhibition Hall A-B ]

Abstract
Shapelets are discriminative subsequences (or shapes) with high interpretability in time series classification. Due to the time-intensive nature of shapelet discovery, existing shapelet-based methods mainly focus on selecting discriminative shapes while discarding others to achieve candidate subsequence sparsification. However, this approach may exclude beneficial shapes and overlook the varying contributions of shapelets to classification performance. To this end, we propose a Soft sparse Shapes (SoftShape) model for efficient time series classification. Our approach mainly introduces soft shape sparsification and soft shape learning blocks. The former transforms shapes into soft representations based on classification contribution scores, merging lower-scored ones into a single shape to retain and differentiate all subsequence information. The latter facilitates intra- and inter-shape temporal pattern learning, improving model efficiency by using sparsified soft shapes as inputs. Specifically, we employ a learnable router to activate a subset of class-specific expert networks for intra-shape pattern learning. Meanwhile, a shared expert network learns inter-shape patterns by converting sparsified shapes into sequences. Extensive experiments show that SoftShape outperforms state-of-the-art methods and produces interpretable results.
Spotlight Poster
Shaokun Zhang · Ming Yin · Jieyu Zhang · Jiale Liu · Zhiguang Han · Jingyang Zhang · Beibin Li · Chi Wang · Huazheng Wang · Yiran Chen · Qingyun Wu

[ East Exhibition Hall A-B ]

Abstract
Failure attribution in LLM multi-agent systems—identifying the agent and step responsible for task failures—provides crucial clues for systems debugging but remains underexplored and labor-intensive. In this paper, we propose and formulate a new research area: automated failure attribution for LLM multi-agent systems.To support this initiative, we introduce the Who\&When dataset, comprising extensive failure logs from 127 LLM multi-agent systems with fine-grained annotations linking failures to specific agents and decisive error steps.Using the Who\&When, we develop and evaluate three automated failure attribution methods, summarizing their corresponding pros and cons. The best method achieves 53.5\% accuracy in identifying failure-responsible agents but only 14.2\% in pinpointing failure steps, with some methods performing below random. Even SOTA reasoning models, such as OpenAI o1 and DeepSeek R1, fail to achieve practical usability. These results highlight the task's complexity and the need for further research in this area. Code and dataset are available in https://github.com/mingyin1/Agents_Failure_Attribution.
Spotlight Poster
Xingxing Yang · Jie Chen · Zaifeng Yang

[ West Exhibition Hall B2-B3 ]

Abstract
This paper presents a novel approach to hyperspectral image (HSI) reconstruction from RGB images, addressing fundamental limitations in existing learning-based methods from a physical perspective. We discuss and aim to address the ``colorimetric dilemma": failure to consistently reproduce ground-truth RGB from predicted HSI, thereby compromising physical integrity and reliability in practical applications. To tackle this issue, we propose PhySpec, a physically consistent framework for robust HSI reconstruction. Our approach fundamentally exploits the intrinsic physical relationship between HSIs and corresponding RGBs by employing orthogonal subspace decomposition, which enables explicit estimation of camera spectral sensitivity (CSS). This ensures that our reconstructed spectra align with well-established physical principles, enhancing their reliability and fidelity. Moreover, to efficiently use internal information from test samples, we propose a self-supervised meta-auxiliary learning (MAXL) strategy that rapidly adapts the trained parameters to unseen samples using only a few gradient descent steps at test time, while simultaneously constraining the generated HSIs to accurately recover ground-truth RGB values. Thus, MAXL reinforces the physical integrity of the reconstruction process. Extensive qualitative and quantitative evaluations validate the efficacy of our proposed framework, showing superior performance compared to SOTA methods.
Spotlight Poster
Pedro Santos · Alberto Sardinha · Francisco S. Melo

[ West Exhibition Hall B2-B3 ]

Abstract
The general-utility Markov decision processes (GUMDPs) framework generalizes the MDPs framework by considering objective functions that depend on the frequency of visitation of state-action pairs induced by a given policy. In this work, we contribute with the first analysis on the impact of the number of trials, i.e., the number of randomly sampled trajectories, in infinite-horizon GUMDPs. We show that, as opposed to standard MDPs, the number of trials plays a key-role in infinite-horizon GUMDPs and the expected performance of a given policy depends, in general, on the number of trials. We consider both discounted and average GUMDPs, where the objective function depends, respectively, on discounted and average frequencies of visitation of state-action pairs. First, we study policy evaluation under discounted GUMDPs, proving lower and upper bounds on the mismatch between the finite and infinite trials formulations for GUMDPs. Second, we address average GUMDPs, studying how different classes of GUMDPs impact the mismatch between the finite and infinite trials formulations. Third, we provide a set of empirical results to support our claims, highlighting how the number of trajectories and the structure of the underlying GUMDP influence policy evaluation.
Spotlight Poster
Linxi Zhao · Yihe Deng · Weitong Zhang · Quanquan Gu

[ East Exhibition Hall A-B ]

Abstract
The advancement of Large Vision-Language Models (LVLMs) has increasingly highlighted the critical issue of their tendency to hallucinate non-existing objects in the images. To address this issue, previous works focused on using specially curated datasets or powerful LLMs to rectify the outputs of LVLMs. However, these approaches require either costly training or fine-tuning, or API access to proprietary LLMs for post-generation correction. In response to these limitations, we propose Mitigating hallucinAtion via image-gRounded guIdaNcE (MARINE), a framework that is both training-free and API-free. MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs. This is achieved by leveraging open-source vision models to extract object-level information, thereby enhancing the precision of LVLM-generated content. Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance. Through comprehensive evaluations across 5 popular LVLMs with diverse evaluation metrics and benchmarks, we demonstrate the effectiveness of MARINE, which even outperforms existing fine-tuning-based methods. Remarkably, it reduces hallucinations consistently in GPT-4V-assisted evaluation while maintaining the detailedness of LVLMs' generations. We release our code at https://github.com/Linxi-ZHAO/MARINE.
Spotlight Poster
Yiyang Fang · Jian Liang · Wenke Huang · He Li · Kehua Su · Mang Ye

[ East Exhibition Hall A-B ]

Abstract
Multimodal large language models (MLLMs) have achieved impressive progress in tasks such as visual question answering and visual understanding, but they still face significant challenges in emotional reasoning. Current methods to enhance emotional understanding typically rely on fine-tuning or manual annotations, which are resource-intensive and limit scalability. In this work, we focus on improving the ability of MLLMs to capture emotions during the inference phase. Specifically, MLLMs encounter two main issues: they struggle to distinguish between semantically similar emotions, leading to misclassification, and they are overwhelmed by redundant or irrelevant visual information, which distracts from key emotional cues. To address these, we propose Sharpening Emotion Perception in MLLMs (SEPM), which incorporates a Confidence-Guided Coarse-to-Fine Inference framework to refine emotion classification by guiding the model through simpler tasks. Additionally, SEPM employs Focus-on-Emotion Visual Augmentation to reduce visual redundancy by directing the attention of models to relevant emotional cues in images. Experimental results demonstrate that SEPM significantly improves MLLM performance on emotion-related tasks, providing a resource-efficient and scalable solution for emotion recognition.
Spotlight Poster
Kristina Nikolić · Luze Sun · Jie Zhang · Florian Tramer

[ East Exhibition Hall A-B ]

Abstract
Jailbreak attacks bypass the guardrails of large language models to produce harmful outputs.In this paper, we ask whether the model outputs produced by existing jailbreaks are actually *useful*. For example, when jailbreaking a model to give instructions for building a bomb, does the jailbreak yield good instructions?Since the utility of most unsafe answers (e.g., bomb instructions) is hard to evaluate rigorously, we build new jailbreak evaluation sets with known ground truth answers, by aligning models to refuse questions related to benign and easy-to-evaluate topics (e.g., biology or math).Our evaluation of eight representative jailbreaks across five utility benchmarks reveals a consistent drop in model utility in jailbroken responses, which we term the *jailbreak tax*. For example, while all jailbreaks we tested bypass guardrails in models aligned to refuse to answer math, this comes at the expense of a drop of up to 92% in accuracy.Overall, our work proposes jailbreak utility as a new important metric in AI safety, and introduces benchmarks to evaluate existing and future jailbreaks. We make the benchmark available at https://github.com/ethz-spylab/jailbreak-tax
Spotlight Poster
Jeonghye Kim · Yongjae Shin · Whiyoung Jung · Sunghoon Hong · Deunsol Yoon · Youngchul Sung · Kanghoon Lee · Woohyung Lim

[ West Exhibition Hall B2-B3 ]

Abstract
Reinforcement learning with offline data suffers from Q-value extrapolation errors. To address this issue, we first demonstrate that linear extrapolation of the Q-function beyond the data range is particularly problematic. To mitigate this, we propose guiding the gradual decrease of Q-values outside the data range, which is achieved through reward scaling with layer normalization (RS-LN) and a penalization mechanism for infeasible actions (PA). By combining RS-LN and PA, we develop a new algorithm called PARS. We evaluate PARS across a range of tasks, demonstrating superior performance compared to state-of-the-art algorithms in both offline training and online fine-tuning on the D4RL benchmark, with notable success in the challenging AntMaze Ultra task.
Spotlight Poster
Arnav Kumar Jain · Gonzalo Gonzalez-Pumariega · Wayne Chen · Alexander Rush · Wenting Zhao · Sanjiban Choudhury

[ East Exhibition Hall A-B ]

Abstract
We address the problem of code generation from multi-turn execution feedback. Existing methods either generate code without feedback or use complex, hierarchical reinforcement learning to optimize multi-turn rewards.We propose a simple yet scalable approach, $\mu$CODE, that solves multi-turn code generation using only single-step rewards.Our key insight is that code generation is a one-step recoverable MDP, where the correct code can be recovered from any intermediate code state in a single turn.$\mu$CODE iteratively trains both a generator to provide code solutions conditioned on multi-turn execution feedback and a verifier to score the newly generated code.Experimental evaluations show that our approach achieves significant improvements over state-of-the-art baselines. We provide analysis of the design choices of the reward models and policy, and show the efficacy of $\mu$CODE at utilizing the execution feedback.
Spotlight Poster
Max Hopkins · Shay Moran

[ West Exhibition Hall B2-B3 ]

Abstract
Stability is a central property in learning and statistics promising the output of an algorithm $\mathcal{A}$ does not change substantially when applied to similar datasets $S$ and $S'$. It is an elementary fact that any sufficiently stable algorithm (e.g.\ one returning the same result with high probability, satisfying privacy guarantees, etc.) must be randomized. This raises a natural question: can we quantify \textit{how much} randomness is needed for algorithmic stability? We study the randomness complexity of two influential notions of stability in learning: \textit{replicability} (which promises $\mathcal{A}$ usually outputs the same result when run over samples from the same distribution), and \textit{differential privacy} (which promises the output distribution of $\mathcal{A}$ remains similar under neighboring datasets). In particular, building on the ideas of (Dixon, Pavan, Vander Woude, and Vinodchandran ICML 2024) and (Cannone, Su, and Vadhan ITCS 2024), we prove a "weak-to-strong" boosting theorem for stability in these settings: the randomness complexity of a task $\mathcal{M}$ is tightly controlled by the best replication probability of any \textit{deterministic} algorithm solving $\mathcal{M}$, a parameter known as $\mathcal{M}$'s "global stability" (Chase, Moran, Yehudayoff FOCS 2023). Finally, we use this connection to characterize the randomness complexity of PAC Learning: a class has bounded randomness complexity …
Spotlight Poster
Yichuan Deng · Xiaoyu Li · Zhao Song · OMRI WEINSTEIN

[ West Exhibition Hall B2-B3 ]

Abstract
A recent work by [Larsen, SODA 2023] introduced a faster combinatorial alternative to Bansal's SDP algorithm for finding a coloring $x \in \\{-1, 1\\}^n$ that approximately minimizes the discrepancy $\mathrm{disc}(A, x) := \\| A x \\|_{\infty}$ of a real-valued $m \times n$ matrix $A$. Larsen's algorithm runs in $\widetilde{O}(mn^2)$ time compared to Bansal's $\widetilde{O}(mn^{4.5})$-time algorithm, with a slightly weaker logarithmic approximation ratio in terms of the hereditary discrepancy of $A$ [Bansal, FOCS 2010]. We present a combinatorial $\widetilde{O}(\mathrm{nnz}(A) + n^3)$-time algorithm with the same approximation guarantee as Larsen's, optimal for tall matrices where $m = \mathrm{poly}(n)$. Using a more intricate analysis and fast matrix multiplication, we further achieve a runtime of $\widetilde{O}(\mathrm{nnz}(A) + n^{2.53})$, breaking the cubic barrier for square matrices and surpassing the limitations of linear-programming approaches [Eldan and Singh, RS\&A 2018]. Our algorithm relies on two key ideas: (i) a new sketching technique for finding a projection matrix with a short $\ell_2$-basis using implicit leverage-score sampling, and (ii) a data structure for efficiently implementing the iterative Edge-Walk partial-coloring algorithm [Lovett and Meka, SICOMP 2015], and using an alternative analysis to enable ``lazy'' batch updates with low-rank corrections. Our results nearly close the computational gap between real-valued and binary …
Spotlight Poster
Greyson Brothers

[ East Exhibition Hall A-B ]

Abstract
We investigate the design of pooling methods used to summarize the outputs of transformer embedding models, primarily motivated by reinforcement learning and vision applications. This work considers problems where a subset of the input vectors contains requisite information for a downstream task (signal) while the rest are distractors (noise). By framing pooling as vector quantization with the goal of minimizing signal loss, we demonstrate that the standard methods used to aggregate transformer outputs, AvgPool, MaxPool, and ClsToken, are vulnerable to performance collapse as the signal-to-noise ratio (SNR) of inputs fluctuates. We then show that an attention-based *adaptive pooling* method can approximate the signal-optimal vector quantizer within derived error bounds for any SNR. Our theoretical results are first validated by supervised experiments on a synthetic dataset designed to isolate the SNR problem, then generalized to standard relational reasoning, multi-agent reinforcement learning, and vision benchmarks with noisy observations, where transformers with adaptive pooling display superior robustness across tasks.
Spotlight Poster
Zexu Sun · Qiyu Han · Hao Yang · Anpeng Wu · Minqin Zhu · Dugang Liu · Chen Ma · Yunpeng Weng · Xing Tang · xiuqiang He

[ West Exhibition Hall B2-B3 ]

Abstract
In online platforms, incentives (\textit{e.g}., discounts, coupons) are used to boost user engagement and revenue. Uplift modeling methods are developed to estimate user responses from observational data, often incorporating distribution balancing to address selection bias. However, these methods are limited by in-distribution testing data, which mirrors the training data distribution. In reality, user features change continuously due to time, geography, and other factors, especially on complex online marketing platforms. Thus, effective uplift modeling method for out-of-distribution data is crucial. To address this, we propose a novel uplift modeling method \textbf{I}nvariant \textbf{D}eep \textbf{U}plift \textbf{M}odeling, namely \textbf{IDUM}, which uses invariant learning to enhance out-of-distribution generalization by identifying causal factors that remain consistent across domains. IDUM further refines these features into necessary and sufficient factors and employs a masking component to reduce computational costs by selecting the most informative invariant features. A balancing discrepancy component is also introduced to mitigate selection bias in observational data. We conduct extensive experiments on public and real-world datasets to demonstrate IDUM's effectiveness in both in-distribution and out-of-distribution scenarios in online marketing. Furthermore, we also provide theoretical analysis and related proofs to support our IDUM's generalizability.
Spotlight Poster
Siyuan Duan · Wenyuan Wu · Peng Hu · Zhenwen Ren · Dezhong Peng · Yuan Sun

[ East Exhibition Hall A-B ]

Abstract
Physics-informed neural networks (PINNs) aim to constrain the outputs and gradients of deep learning models to satisfy specified governing physics equations, which have demonstrated significant potential for solving partial differential equations (PDEs). Although existing PINN methods have achieved pleasing performance, they always treat both easy and hard sample points indiscriminately, especially ones in the physical boundaries. This easily causes the PINN model to fall into undesirable local minima and unstable learning, thereby resulting in an Unbalanced Prediction Problem (UPP). To deal with this daunting problem, we propose a novel framework named Cognitive Physical Informed Neural Network (CoPINN) that imitates the human cognitive learning manner from easy to hard. Specifically, we first employ separable subnetworks to encode independent one-dimensional coordinates and apply an aggregation scheme to generate multi-dimensional predicted physical variables. Then, during the training phase, we dynamically evaluate the difficulty of each sample according to the gradient of the PDE residuals. Finally, we propose a cognitive training scheduler to progressively optimize the entire sampling regions from easy to hard, thereby embracing robustness and generalization against predicting physical boundary regions. Extensive experiments demonstrate that our CoPINN achieves state-of-the-art performance, particularly significantly reducing prediction errors in stubborn regions.
Spotlight Poster
Mahir Labib Dihan · Tanvir Hassan · Md Tanvir Parvez · Hasebul Hasan · Almash Alam · Muhammad Aamir Cheema · Mohammed Eunus Ali · Md Rizwan Parvez

[ West Exhibition Hall B2-B3 ]

Abstract
Recent advancements in foundation models have improved autonomous tool usage and reasoning, but their capabilities in map-based reasoning remain underexplored. To address this, we introduce MapEval, a benchmark designed to assess foundation models across three distinct tasks—textual, API-based, and visual reasoning—through 700 multiple-choice questions spanning 180 cities and 54 countries, covering spatial relationships, navigation, travel planning, and real-world map interactions. Unlike prior benchmarks that focus on simple location queries, MapEval requires models to handle long-context reasoning, API interactions and visual map analysis, making it the most comprehensive evaluation framework for geospatial AI. On evaluation of 30 foundation models, including Claude-3.5-Sonnet, GPT-4o, Gemini-1.5-Pro, none surpasses 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance. These results expose critical gaps in spatial inference, as models struggle with distances, directions, route planning, and place-specific reasoning, highlighting the need for better geospatial AI to bridge the gap between foundation models and real-world navigation.
Spotlight Poster
Dongjae Jeon · Dueun Kim · Albert No

[ East Exhibition Hall A-B ]

Abstract
In this paper, we introduce a geometric framework to analyze memorization in diffusion models through the sharpness of the log probability density. We mathematically justify a previously proposed score-difference-based memorization metric by demonstrating its effectiveness in quantifying sharpness. Additionally, we propose a novel memorization metric that captures sharpness at the initial stage of image generation in latent diffusion models, offering early insights into potential memorization. Leveraging this metric, we develop a mitigation strategy that optimizes the initial noise of the generation process using a sharpness-aware regularization term.
Spotlight Poster
Shira Vansover-Hager · Tomer Koren · Roi Livni

[ West Exhibition Hall B2-B3 ]

Abstract
We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. …
Spotlight Poster
Kaiyu Yang · Gabriel Poesia · Jingxuan He · Wenda Li · Kristin Lauter · Swarat Chaudhuri · Dawn Song

[ East Exhibition Hall A-B ]

Abstract
AI for Mathematics (AI4Math) is intellectually intriguing and is crucial for AI-driven system design and verification. Extensive efforts on AI4Math have mirrored techniques in NLP, in particular, training large language models on carefully curated math datasets in text form. As a complementary yet less explored avenue, formal mathematical reasoning is grounded in formal systems such as proof assistants, which can verify the correctness of reasoning and provide automatic feedback. This position paper advocates formal mathematical reasoning as an indispensable component in future AI for math, formal verification, and verifiable generation. We summarize existing progress, discuss open challenges, and envision critical milestones to measure future success.
Spotlight Poster
Hyeongwon Jang · Changhun Kim · Eunho Yang

[ West Exhibition Hall B2-B3 ]

Abstract
Recent explainable artificial intelligence (XAI) methods for time series primarily estimate point-wise attribution magnitudes, while overlooking the directional impact on predictions, leading to suboptimal identification of significant points. Our analysis shows that conventional Integrated Gradients (IG) effectively capture critical points with both positive and negative impacts on predictions. However, current evaluation metrics fail to assess this capability, as they inadvertently cancel out opposing feature contributions. To address this limitation, we propose novel evaluation metrics—Cumulative Prediction Difference (CPD) and Cumulative Prediction Preservation(CPP)—to systematically assess whether attribution methods accurately identify significant positive and negative points in time series XAI. Under these metrics, conventional IG outperforms recent counterparts. However, directly applying IG to time series data may lead to suboptimal outcomes, as generated paths ignore temporal relationships and introduce out-of-distribution samples. To overcome these challenges, we introduce TIMING, which enhances IG by incorporating temporal awareness while maintaining its theoretical properties. Extensive experiments on synthetic and real-world time series benchmarks demonstrate that TIMING outperforms existing timeseries XAI baselines. Our code is available at https://github.com/drumpt/TIMING.
Spotlight Poster
Avery Ma · Yangchen Pan · Amir-massoud Farahmand

[ East Exhibition Hall A-B ]

Abstract
Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt's topic. We also introduce ManyHarm, a dataset of harmful question–answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Spotlight Poster
Amber Xie · Oleh Rybkin · Dorsa Sadigh · Chelsea Finn

[ West Exhibition Hall B2-B3 ]

Abstract
Recent progress in imitation learning has been enabled by policy architectures that scale to complex visuomotor tasks, multimodal distributions, and large datasets. However, these methods often rely on learning from large amount of expert demonstrations.To address these shortcomings, we propose Latent Diffusion Planning (LDP), a modular approach consisting of a planner which can leverage action-free demonstrations, and an inverse dynamics model which can leverage suboptimal data, that both operate over a learned latent space. First, we learn a compact latent space through a variational autoencoder, enabling effective forecasting of future states in image-based domains. Then, we train a planner and an inverse dynamics model with diffusion objectives. By separating planning from action prediction, LDP can benefit from the denser supervision signals of suboptimal and action-free data.On simulated visual robotic manipulation tasks, LDP outperforms state-of-the-art imitation learning approaches, as they cannot leverage such additional data.