Track: Poster Session 5 West

Poster

#W-100

Disentangling and Integrating Relational and Sensory Information in Transformer Architectures

Awni Altabaa · John Lafferty

Relational reasoning is a central component of generally intelligent systems, enabling robust and data-efficient inductive generalization. Recent empirical evidence shows that many existing neural architectures, including Transformers, struggle with tasks requiring relational reasoning. In this work, we distinguish between two types of information: sensory information about the properties of individual objects, and relational information about the relationships between objects. While neural attention provides a powerful mechanism for controlling the flow of sensory information between objects, the Transformer lacks an explicit computational mechanism for routing and processing relational information. To address this limitation, we propose an architectural extension of the Transformer framework that we call the Dual Attention Transformer (DAT), featuring two distinct attention mechanisms: sensory attention for directing the flow of sensory information, and a novel relational attention mechanism for directing the flow of relational information. We empirically evaluate DAT on a diverse set of tasks ranging from synthetic relational benchmarks to complex real-world tasks such as language modeling and visual processing. Our results demonstrate that integrating explicit relational computational mechanisms into the Transformer architecture leads to significant performance gains in terms of data efficiency and parameter efficiency.

Poster

#W-1005

Predicting mutational effects on protein binding from folding energy

Arthur Deng · Karsten Householder · Fang Wu · K. Garcia · Brian Trippe

Accurate estimation of mutational effects on protein-protein binding energies is an open problem with applications in structural biology and therapeutic design. Several deep learning predictors for this task have been proposed but, presumably due to the scarcity of binding data, these methods under-perform computationally expensive estimates based on empirical force-fields. In response, we propose a transfer-learning approach that leverages advances in protein sequence modeling and folding stability prediction for this task. The key idea is to parameterize the binding energy as the difference between the folding energy of the protein complex and the sum of the folding energies of its binding partners. We show that using a pre-trained inverse-folding model as a proxy for folding energy provides strong zero-shot performance, and can be fine-tuned with (1) copious folding energy measurements and (2) more limited binding energy measurements.The resulting predictor, StaB-ddG, is the first deep learning predictor to match the accuracy of the state-of-the-art empirical force-field method Flex ddG, while offering an over 10,000x speed-up.

Poster

#W-1006

SpikF: Spiking Fourier Network for Efficient Long-term Prediction

Wenjie Wu · Dexuan Huo · Hong Chen

Spiking Neural Networks (SNNs) have demonstrated remarkable potential across many domains, including computer vision and natural language processing, owing to their energy efficiency and biological plausibility. However, their application in long-term prediction tasks remains underexplored, which is primarily due to two critical challenges: (1) current SNN encoding methods are unable to effectively encode long temporal information, leading to increased computational complexity and energy consumption; (2) though Transformer-based models have achieved state-of-the-art accuracy in temporal prediction tasks, the absence of proper positional encoding for spiking self-attention restricts Spiking Transformer from effectively utilizing positional information, resulting in performance degradation. To address these challenges, we introduce an attention-free framework, **Spik**ing **F**ourier Network (**SpikF**), that encodes input sequences in patches and employs an innovative frequency domain selection mechanism to effectively utilize the sequential properties of time-series data. Extensive evaluations on eight well-established long-term prediction datasets demonstrate that SpikF achieves an averaged $1.9\\%$ reduction in Mean Absolute Error (MAE) compared to state-of-the-art models, while lowering total energy consumption by $3.16\times$. Our code is available at https://github.com/WWJ-creator/SpikF.

Poster

#W-1007

Local Pan-privacy for Federated Analytics

Vitaly Feldman · Audra McMillan · Guy Rothblum · Kunal Talwar

Pan-privacy was proposed by Dwork et al. (2010) as an approach to designing a private analytics system that retains its privacy properties in the face of intrusions that expose the system's internal state. Motivated by Federated telemetry applications, we study {\em local pan-privacy}, where privacy should be retained under repeated unannounced intrusions {\em on the local state}. We consider the problem of monitoring the count of an event in a federated system, where event occurrences on a local device should be hidden even from an intruder on that device. We show that under reasonable constraints, the goal of providing information-theoretic differential privacy under intrusion is incompatible with collecting telemetry information. We then show that this problem can be solved in a scalable way using standard cryptographic primitives.

Poster

#W-1008

The Lock-in Hypothesis: Stagnation by Algorithm

Tianyi Qiu · Zhonghao He · Tejasveer Chugh · Max Kleiman-Weiner

The training and deployment of large language models (LLMs) create a feedback loop with human users: models learn human beliefs from data, reinforce these beliefs with generated content, reabsorb the reinforced beliefs, and feed them back to users again and again. This dynamic resembles an echo chamber.We hypothesize that this feedback loop entrenches the existing values and beliefs of users, leading to a loss of diversity in human ideas and potentially the lock-in of false beliefs.We formalize this hypothesis and test it empirically with agent-based LLM simulations and real-world GPT usage data. Analysis reveals sudden but sustained drops in diversity after the release of new GPT iterations, consistent with the hypothesized human-AI feedback loop.Website: https://thelockinhypothesis.com

Poster

#W-1009

EgoPrivacy: What Your First-Person Camera Says About You?

Yijiang Li · Genpei Zhang · Jiacheng Cheng · Yi Li · Xiaojun Shan · Dashan Gao · Jiancheng Lyu · Yuan Li · Ning Bi · Nuno Vasconcelos

While the rapid proliferation of wearable cameras has raised significant concerns about egocentric video privacy, prior work has largely overlooked the unique privacy threats posed to the camera wearer. This work investigates the core question: How much privacy information about the camera wearer can be inferred from their first-person view videos? We introduce EgoPrivacy, the first large-scale benchmark for the comprehensive evaluation of privacy risks in egocentric vision. EgoPrivacy covers three types of privacy (demographic, individual, and situational), defining seven tasks that aim to recover private information ranging from fine-grained (e.g., wearer's identity) to coarse-grained (e.g., age group). To further emphasize the privacy threats inherent to egocentric vision, we propose Retrieval-Augmented Attack, a novel attack strategy that leverages ego-to-exo retrieval from an external pool of exocentric videos to boost the effectiveness of demographic privacy attacks. An extensive comparison of the different attacks possible under all threat models is presented, showing that private information of the wearer is highly susceptible to leakage. For instance, our findings indicate that foundation models can effectively compromise wearer privacy even in zero-shot settings by recovering attributes such as identity, scene, gender, and race with 70–80% accuracy. Our code and data are available at https://github.com/williamium3000/ego-privacy.

Poster

#W-101

EnIGMA: Interactive Tools Substantially Assist LM Agents in Finding Security Vulnerabilities

Talor Abramovich · Meet Udeshi · Minghao Shao · Kilian Lieret · Haoran Xi · Kimberly Milner · Sofija Jancheska · John Yang · Carlos Jimenez · Farshad Khorrami · Prashanth Krishnamurthy · Brendan Dolan-Gavitt · Muhammad Shafique · Karthik Narasimhan · Ramesh Karri · Ofir Press

Although language model (LM) agents have demonstrated increased performance in multiple domains, including coding and web-browsing, their success in cybersecurity has been limited. We present EnIGMA, an LM agent for autonomously solving Capture The Flag (CTF) challenges. We introduce new tools and interfaces to improve the agent's ability to find and exploit security vulnerabilities, focusing on interactive terminal programs. These novel Interactive Agent Tools enable LM agents, for the first time, to run interactive utilities, such as a debugger and a server connection tool, which are essential for solving these challenges.Empirical analysis on 390 CTF challenges across four benchmarks demonstrate that these new tools and interfaces substantially improve our agent's performance, achieving state-of-the-art results on NYU CTF, Intercode-CTF, and CyBench. Finally, we analyze data leakage, developing new methods to quantify it and identifying a new phenomenon we term soliloquizing, where the model self-generates hallucinated observations without interacting with the environment.

Poster

#W-1010

Improving Generalization with Flat Hilbert Bayesian Inference

Tuan Truong · Quyen Tran · Ngoc Quan Pham · Nhat Ho · Dinh Phung · Trung Le

We introduce Flat Hilbert Bayesian Inference (FHBI), an algorithm designed to enhance generalization in Bayesian inference. Our approach involves an iterative two-step procedure with an adversarial functional perturbation step and a functional descent step within the reproducing kernel Hilbert spaces. This methodology is supported by a theoretical analysis that extends previous findings on generalization ability from finite-dimensional Euclidean spaces to infinite-dimensional functional spaces. To evaluate the effectiveness of FHBI, we conduct comprehensive comparisons against nine baseline methods on the VTAB-1K benchmark, which encompasses 19 diverse datasets across various domains with diverse semantics. Empirical results demonstrate that FHBI consistently outperforms the baselines by notable margins, highlighting its practical efficacy.

Poster

#W-1011

Ranked from Within: Ranking Large Multimodal Models Without Labels

Weijie Tu · Weijian Deng · Dylan Campbell · Yu Yao · Jiyang Zheng · Tom Gedeon · Tongliang Liu

Can the relative performance of a pre-trained large multimodal model (LMM) be predicted without access to labels? As LMMs proliferate, it becomes increasingly important to develop efficient ways to choose between them when faced with new data or tasks. The usual approach does the equivalent of giving the models an exam and marking them. We opt to avoid marking and the associated labor of determining the ground-truth answers. Instead, we explore other signals elicited and ascertain how well the models know their own limits, evaluating the effectiveness of these signals at unsupervised model ranking. We evaluate 47 state-of-the-art LMMs (e.g., LLaVA) across 9 visual question answering benchmarks, analyzing how well uncertainty-based metrics can predict relative model performance. Our findings show that uncertainty scores derived from softmax distributions provide a robust and consistent basis for ranking models across various tasks. This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.

Poster

#W-1012

Eliciting Language Model Behaviors with Investigator Agents

Xiang Li · Neil Chowdhury · Daniel Johnson · Tatsunori Hashimoto · Percy Liang · Sarah Schwettmann · Jacob Steinhardt

Language models exhibit complex, diverse behaviors when prompted with free-form text, making it hard to characterize the space of possible outputs. We study the problem of behavioral elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations, harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train amortized investigator models to emulate the posterior distribution over the prompts, conditioned on the target behavior. Specifically, we first fit a reverse model and then use reinforcement learning to optimize likelihood of generating the target behavior. To improve the diversity of the prompt distribution, we further propose a novel iterative training objective based on the Frank-Wolfe algorithm that encourages each iteration to discover different sets of prompts not captured by previous iterations. Our investigator models produce prompts that exhibit a variety of effective and human-interpretable strategies for behavior elicitation, obtaining a 100% attack success rate on AdvBench (Harmful Behaviors) and an 85% hallucination rate.

Poster

#W-1013

Monte Carlo Tree Search for Comprehensive Exploration in LLM-Based Automatic Heuristic Design

Zhi Zheng · Zhuoliang Xie · Zhenkun Wang · Bryan Hooi

Handcrafting heuristics for solving complex optimization tasks (e.g., route planning and task allocation) is a common practice but requires extensive domain knowledge. Recently, Large Language Model (LLM)-based automatic heuristic design (AHD) methods have shown promise in generating high-quality heuristics without manual interventions. Existing LLM-based AHD methods employ a population to maintain a fixed number of top-performing LLM-generated heuristics and introduce evolutionary computation (EC) to iteratively enhance the population. However, these population-based procedures cannot fully develop the potential of each heuristic and are prone to converge into local optima. To more comprehensively explore the space of heuristics, this paper proposes to use Monte Carlo Tree Search (MCTS) for LLM-based heuristic evolution. The proposed MCTS-AHD method organizes all LLM-generated heuristics in a tree structure and can better develop the potential of temporarily underperforming heuristics. In experiments, MCTS-AHD delivers significantly higher-quality heuristics on various complex tasks. Our code is available.

Poster

#W-1014

Exponential Family Variational Flow Matching for Tabular Data Generation

Andres Guzman Cordero · Floor Eijkelboom · Jan-Willem van de Meent

While denoising diffusion and flow matching have driven major advances in generative modeling, their application to tabular data remains limited, despite its ubiquity in real-world applications. To this end, we develop TabbyFlow, a variational Flow Matching (VFM) method for tabular data generation. To apply VFM to data with mixed continuous and discrete features, we introduce Exponential Family Variational Flow Matching (EF-VFM), which represents heterogeneous data types using a general exponential family distribution. We hereby obtain an efficient, data-driven objective based on moment matching, enabling principled learning of probability paths over mixed continuous and discrete variables. We also establish a connection between variational flow matching and generalized flow matching objectives based on Bregman divergences. Evaluation on tabular data benchmarks demonstrates state-of-the-art performance compared to baselines.

Poster

#W-1015

Quantum Speedup for Hypergraph Sparsification

Chenghua Liu · Minbo Gao · Zhengfeng Ji · Ying

Graph sparsification serves as a foundation for many algorithms, such as approximation algorithms for graph cuts and Laplacian system solvers. As its natural generalization, hypergraph sparsification has recently gained increasing attention, with broad applications in graph machine learning and other areas. In this work, we propose the first quantum algorithm for hypergraph sparsification, addressing an open problem proposed by Apers and de Wolf (FOCS'20). For a weighted hypergraph with $n$ vertices, $m$ hyperedges, and rank $r$, our algorithm outputs a near-linear size $\varepsilon$-spectral sparsifier in time $\widetilde O(r\sqrt{mn}/\varepsilon)$. This algorithm matches the quantum lower bound for constant $r$ and demonstrates quantum speedup when compared with the state-of-the-art $\widetilde O(mr)$-time classical algorithm. As applications, our algorithm implies quantum speedups for computing hypergraph cut sparsifiers, approximating hypergraph mincuts and hypergraph $s$-$t$ mincuts.

Poster

#W-1016

Phase transitions for the existence of unregularized M-estimators in single index models

Takuya Koriyama · Pierre C Bellec

This paper studies phase transitions for the existence of unregularized M-estimators under proportional asymptotics where the sample size $n$ and feature dimension $p$ grow proportionally with $n/p \to \delta \in (1, \infty)$. We study the existence of M-estimators in single-index models where the response $y_i$ depends on covariates $x_i \sim N(0, I_p)$ through an unknown index ${w} \in \mathbb{R}^p$ and an unknown link function. An explicit expression is derived for the critical threshold $\delta_\infty$ that determines the phase transition for the existence of the M-estimator, generalizing the results of Candés & Sur (2020) for binary logistic regression to other single-index models.Furthermore, we investigate the existence of a solution to the nonlinear system of equations governing the asymptotic behavior of the M-estimator when it exists. The existence of solution to this system for $\delta > \delta_\infty$ remains largely unproven outside the global null in binary logistic regression. We address this gap with a proof that the system admits a solution if and only if $\delta > \delta_\infty$, providing a comprehensive theoretical foundation for proportional asymptotic results that require as a prerequisite the existence of a solution to the system.

Poster

#W-1017

Learning With Multi-Group Guarantees For Clusterable Subpopulations

Jessica Dai · Nika Haghtalab · Eric Zhao

A canonical desideratum for prediction problems is that performance guarantees should hold not just on average over the population, but also for meaningful subpopulations within the overall population. But what constitutes a meaningful subpopulation? In this work, we take the perspective that relevant subpopulations should be defined with respect to the clusters that naturally emerge from the distribution of individuals for which predictions are being made. In this view, a population refers to a mixture model whose components constitute the relevant subpopulations. We suggest two formalisms for capturing per-subgroup guarantees: first, by attributing each individual to the component from which they were most likely drawn, given their features; and second, by attributing each individual to all components in proportion to their relative likelihood of having been drawn from each component. Using online calibration as a case study, we study a multi-objective algorithm that provides guarantees for each of these formalisms by handling all plausible underlying subpopulation structures simultaneously, and achieve an $O(T^{1/2})$ rate even when the subpopulations are not well-separated. In comparison, the more natural cluster-then-predict approach that first recovers the structure of the subpopulations and then makes predictions suffers from a $O(T^{2/3})$ rate and requires the subpopulations to be separable. Along the way, we prove that providing per-subgroup calibration guarantees for underlying clusters can be easier than learning the clusters: separation between median subgroup features is required for the latter but not the former.

Poster

#W-1018

Minimax Optimal Regret Bound for Reinforcement Learning with Trajectory Feedback

Zihan Zhang · Yuxin Chen · Jason Lee · Simon Du · Ruosong Wang

In this work, we study reinforcement learning (RL) with trajectory feedback. Compared to the standard RL setting, in RL with trajectory feedback, the agent only observes the accumulative reward along the trajectory, and therefore, this model is particularly suitable for scenarios where querying the reward in each single step incurs prohibitive cost. For a finite-horizon Markov Decision Process (MDP) with $S$ states, $A$ actions and a horizon length of $H$, we develop an algorithm that enjoys an asymptotically nearly optimal regret of $\tilde{O}\left(\sqrt{SAH^3K}\right)$ in $K$ episodes.To achieve this result, our new technical ingredients include(i) constructing a tighter confidence region for the reward function by incorporating the RL with trajectory feedback setting with techniques in linear bandits and (ii) constructing a reference transition model to better guide the exploration process.

Poster

#W-1019

General agents need world models

Jonathan Richens · Tom Everitt · David Abel

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient?We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment.We show that this model can be extracted from the agent's policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models.This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.

Poster

#W-102

De-coupled NeuroGF for Shortest Path Distance Approximations on Large Terrain Graphs

Samantha Chen · Pankaj Agarwal · Yusu Wang

The ability to acquire high-resolution, large-scale geospatial data at an unprecedented using LiDAR and other related technologies has intensified the need for scalable algorithms for terrain analysis, including shortest-path-distance (SPD) queries on large-scale terrain digital elevation models (DEMs). In this paper, we present a neural data structure for efficiently answering SPD queries approximately on a large terrain DEM, which is based on the recently proposed neural geodesic field (NeuroGF) framework (Zhang et al., 2023)---the state-of-the-art neural data structure for estimating geodesic distance.In particular, we propose a decoupled-NeuroGF data structure combined with an efficient two-stage mixed-training strategy, which significantly reduces computational bottlenecks and enables efficient training on terrain DEMs at a scale not feasible before. We demonstrate the efficacy of our approach by performing detailed experiments on both synthetic and real data sets.For instance, we can train a small model with around 70000 parameters on a terrain DEM with 16 million nodes in a matter of hours that can answer SPD queries with 1\% relative error in at most 10ms per query.

Poster

#W-1020

Reinforcement Learning with Random Time Horizons

Enric Borrell · Lorenz Richter · Christof Schuette

We extend the standard reinforcement learning framework to random time horizons. While the classical setting typically assumes finite and deterministic or infinite runtimes of trajectories, we argue that multiple real-world applications naturally exhibit random (potentially trajectory-dependent) stopping times. Since those stopping times typically depend on the policy, their randomness has an effect on policy gradient formulas, which we (mostly for the first time) derive rigorously in this work both for stochastic and deterministic policies. We present two complementary perspectives, trajectory or state-space based, and establish connections to optimal control theory. Our numerical experiments demonstrate that using the proposed formulas can significantly improve optimization convergence compared to traditional approaches.

Poster

#W-103

RepoAudit: An Autonomous LLM-Agent for Repository-Level Code Auditing

Jinyao Guo · Chengpeng Wang · Xiangzhe Xu · Zian Su · Xiangyu Zhang

Code auditing is the process of reviewing code with the aim of identifying bugs. Large Language Models (LLMs) have demonstrated promising capabilities for this task without requiring compilation, while also supporting user-friendly customization. However, auditing a code repository with LLMs poses significant challenges: limited context windows and hallucinations can degrade the quality of bug reports, and analyzing large-scale repositories incurs substantial time and token costs, hindering efficiency and scalability.This work introduces an LLM-based agent, RepoAudit, designed to perform autonomous repository-level code auditing. Equipped with agent memory, RepoAudit explores the codebase on demand by analyzing data-flow facts along feasible program paths within individual functions. It further incorporates a validator module to mitigate hallucinations by verifying data-flow facts and checking the satisfiability of path conditions associated with potential bugs, thereby reducing false positives. RepoAudit detects 40 true bugs across 15 real-world benchmark projects with a precision of 78.43%, requiring on average only 0.44 hours and $2.54 per project. Also, it detects 185 new bugs in high-profile projects, among which 174 have been confirmed or fixed. We have open-sourced RepoAudit at https://github.com/PurCL/RepoAudit.

Poster

#W-104

OV-MER: Towards Open-Vocabulary Multimodal Emotion Recognition

Zheng Lian · Haiyang Sun · Licai Sun · Haoyu Chen · Lan Chen · Hao Gu · Zhuofan Wen · Shun Chen · Zhang Siyuan · Hailiang Yao · Bin Liu · Rui Liu · Shan Liang · Ya Li · Jiangyan Yi · Jianhua Tao

Multimodal Emotion Recognition (MER) is a critical research area that seeks to decode human emotions from diverse data modalities. However, existing machine learning methods predominantly rely on predefined emotion taxonomies, which fail to capture the inherent complexity, subtlety, and multi-appraisal nature of human emotional experiences, as demonstrated by studies in psychology and cognitive science. To overcome this limitation, we advocate for introducing the concept of open vocabulary into MER. This paradigm shift aims to enable models to predict emotions beyond a fixed label space, accommodating a flexible set of categories to better reflect the nuanced spectrum of human emotions. To achieve this, we propose a novel paradigm: Open-Vocabulary MER (OV-MER), which enables emotion prediction without being confined to predefined spaces. However, constructing a dataset that encompasses the full range of emotions for OV-MER is practically infeasible; hence, we present a comprehensive solution including a newly curated database, novel evaluation metrics, and a preliminary benchmark. By advancing MER from basic emotions to more nuanced and diverse emotional states, we hope this work can inspire the next generation of MER, enhancing its generalizability and applicability in real-world scenarios. Code and dataset are available at: https://github.com/zeroQiaoba/AffectGPT.

Poster

#W-105

Adaptive Flow Matching for Resolving Small-Scale Physics

Stathi Fotiadis · Noah Brenowitz · Tomas Geffner · Yair Cohen · Michael Pritchard · Arash Vahdat · Morteza Mardani

Conditional diffusion and flow models are effective for super-resolving small-scale details in natural images. However, in physical sciences such as weather, three major challenges arise: (i) spatially misaligned input-output distributions (PDEs at different resolutions lead to divergent trajectories), (ii) misaligned and distinct input-output channels (channel synthesis), (iii) several channels with diverse stochasticity scales (multiscale). To address these, we propose to first encode inputs into a latent base distribution that is closer to the target, then apply Flow Matching to generate small-scale physics. The encoder captures deterministic components, while Flow Matching adds stochastic details. To handle uncertainty in the deterministic part, we inject noise via an adaptive noise scaling mechanism, dynamically adjusted by maximum-likelihood estimates of the encoder’s predictions. Experiments on real-world weather data (including super-resolution from 25 km to 2 km scales in Taiwan) and in synthetic Kolmogorov flow datasets show that our proposed Adaptive Flow Matching (AFM) framework outperforms existing methods and produces better-calibrated ensembles.

Poster

#W-106

Zebra: In-Context Generative Pretraining for Solving Parametric PDEs

Louis Serrano · Armand Kassaï Koupaï · Thomas Wang · Pierre ERBACHER · patrick gallinari

Solving time-dependent parametric partial differential equations (PDEs) is challenging for data-driven methods, as these models must adapt to variations in parameters such as coefficients, forcing terms, and initial conditions. State-of-the-art neural surrogates perform adaptation through gradient-based optimization and meta-learning to implicitly encode the variety of dynamics from observations. This often comes with increased inference complexity. Inspired by the in-context learning capabilities of large language models (LLMs), we introduce Zebra, a novel generative auto-regressive transformer designed to solve parametric PDEs without requiring gradient adaptation at inference. By leveraging in-context information during both pre-training and inference, Zebra dynamically adapts to new tasks by conditioning on input sequences that incorporate context example trajectories. As a generative model, Zebra can be used to generate new trajectories and allows quantifying the uncertainty of the predictions. We evaluate Zebra across a variety of challenging PDE scenarios, demonstrating its adaptability, robustness, and superior performance compared to existing approaches.

Poster

#W-107

Compositional Flows for 3D Molecule and Synthesis Pathway Co-design

Tony Shen · Seonghwan Seo · Ross Irwin · Kieran Didi · Simon Olsson · Woo Youn Kim · Martin Ester

Many generative applications, such as synthesis-based 3D molecular design, involve constructing compositional objects with continuous features.Here, we introduce Compositional Generative Flows (CGFlow), a novel framework that extends flow matching to generate objects in compositional steps while modeling continuous states. Our key insight is that modeling compositional state transitions can be formulated as a straightforward extension of the flow matching interpolation process.We further build upon the theoretical foundations of generative flow networks (GFlowNets), enabling reward-guided sampling of compositional structures. We apply CGFlow to synthesizable drug design by jointly designing the molecule's synthetic pathway with its 3D binding pose.Our approach achieves state-of-the-art binding affinity and synthesizability on all 15 targets from the LIT-PCBA benchmark, and 4.2x improvement in sampling efficiency compared to 2D synthesis-based baseline.To our best knowledge, our method is also the first to achieve state of-art-performance in both Vina Dock (-9.42) and AiZynth success rate (36.1\%) on the CrossDocked2020 benchmark.

Poster

#W-108

Geometric and Physical Constraints Synergistically Enhance Neural PDE Surrogates

Yunfei Huang · David S. Greenberg

Neural PDE surrogates can improve the cost-accuracy tradeoff of classical solvers, but often generalize poorly to new initial conditions and accumulate errors over time. Physical and symmetry constraints have shown promise in closing this performance gap, but existing techniques for imposing these inductive biases are incompatible with the staggered grids commonly used in computational fluid dynamics. Here we introduce novel input and output layers that respect physical laws and symmetries on the staggered grids, and for the first time systematically investigate how these constraints, individually and in combination, affect the accuracy of PDE surrogates. We focus on two challenging problems: shallow water equations with closed boundaries and decaying incompressible turbulence. Compared to strong baselines, symmetries and physical constraints consistently improve performance across tasks, architectures, autoregressive prediction steps, accuracy measures, and network sizes. Symmetries are more effective than physical constraints, but surrogates with both performed best, even compared to baselines with data augmentation or pushforward training, while themselves benefiting from the pushforward trick. Doubly-constrained surrogates also generalize better to initial conditions and durations beyond the range of the training data, and more accurately predict real-world ocean currents.

Poster

#W-109

Flexibility-conditioned protein structure design with flow matching

Vsevolod Viliuga · Leif Seute · Nicolas Wolf · Simon Wagner · Arne Elofsson · Jan Stuehmer · Frauke Gräter

Recent advances in geometric deep learning and generative modeling have enabled the design of novel proteins with a wide range of desired properties. However, current state-of-the-art approaches are typically restricted to generating proteins with only static target properties, such as motifs and symmetries. In this work, we take a step towards overcoming this limitation by proposing a framework to condition structure generation on flexibility, which is crucial for key functionalities such as catalysis or molecular recognition. We first introduce BackFlip, an equivariant neural network for predicting per-residue flexibility from an input backbone structure. Relying on BackFlip, we propose FliPS, an SE(3)-equivariant conditional flow matching model that solves the inverse problem, that is, generating backbones that display a target flexibility profile. In our experiments, we show that FliPS is able to generate novel and diverse protein backbones with the desired flexibility, verified by Molecular Dynamics (MD) simulations.

Poster

#W-110

All-atom Diffusion Transformers: Unified generative modelling of molecules and materials

Chaitanya Joshi · Xiang Fu · Yi-Lun Liao · Vahe Gharakhanyan · Benjamin Kurt Miller · Anuroop Sriram · Zachary Ulissi

Diffusion models are the standard toolkit for generative modelling of 3D atomic systems. However, for different types of atomic systems -- such as molecules and materials -- the generative processes are usually highly specific to the target system despite the underlying physics being the same. We introduce the All-atom Diffusion Transformer (ADiT), a unified latent diffusion framework for jointly generating both periodic materials and non-periodic molecular systems using the same model: (1) An autoencoder maps a unified, all-atom representations of molecules and materials to a shared latent embedding space; and (2) A diffusion model is trained to generate new latent embeddings that the autoencoder can decode to sample new molecules or materials. Experiments on MP20, QM9 and GEOM-DRUGS datasets demonstrate that jointly trained ADiT generates realistic and valid molecules as well as materials, obtaining state-of-the-art results on par with molecule and crystal-specific models. ADiT uses standard Transformers with minimal inductive biases for both the autoencoder and diffusion model, resulting in significant speedups during training and inference compared to equivariant diffusion models. Scaling ADiT up to half a billion parameters predictably improves performance, representing a step towards broadly generalizable foundation models for generative chemistry. Open source code: https://github.com/facebookresearch/all-atom-diffusion-transformer

Poster

#W-111

Closed-form Solutions: A New Perspective on Solving Differential Equations

Shu Wei · Yanjie Li · Lina Yu · Weijun Li · Min Wu · Linjun Sun · Jingyi Liu · Hong Qin · Deng Yusong · Jufeng Han · Yan Pang

The quest for analytical solutions to differential equations has traditionally been constrained by the need for extensive mathematical expertise.Machine learning methods like genetic algorithms have shown promise in this domain, but are hindered by significant computational time and the complexity of their derived solutions. This paper introduces SSDE (Symbolic Solver for Differential Equations), a novel reinforcement learning-based approach that derives symbolic closed-form solutions for various differential equations. Evaluations across a diverse set of ordinary and partial differential equations demonstrate that SSDE outperforms existing machine learning methods, delivering superior accuracy and efficiency in obtaining analytical solutions.

Poster

#W-112

Reinforcement Learning for Quantum Control under Physical Constraints

Jan Ole Ernst · Aniket Chatterjee · Tim Franzmeyer · Axel Kuhn

Quantum control is concerned with the realisation of desired dynamics in quantum systems, serving as a linchpin for advancing quantum technologies and fundamental research. Analytic approaches and standard optimisation algorithms do not yield satisfactory solutions for more complex quantum systems, and especially not for real world quantum systems which are open and noisy. We devise a physics-constrained Reinforcement Learning (RL) algorithm that restricts the space of possible solutions.We incorporate priors about the desired time scales of the quantum state dynamics - as well as realistic control signal limitations - as constraints to the RL algorithm. These constraints improve solution quality and enhance computational scaleability. We evaluate our method on three broadly relevant quantum systems and incorporate real-world complications, arising from dissipation and control signal perturbations. We achieve both higher fidelities - which exceed 0.999 across all systems - and better robustness to time-dependent perturbations and experimental imperfections than previous methods. Lastly, we demonstrate that incorporating multi-step feedback can yield solutions robust even to strong perturbations. Our implementation can be found at: https://github.com/jan-o-e/RL4qcWpc.

Poster

#W-113

GenMol: A Drug Discovery Generalist with Discrete Diffusion

Seul Lee · Karsten Kreis · Srimukh Veccham · Meng Liu · Danny Reidenbach · Yuxing Peng · Saee Paliwal · Weili Nie · Arash Vahdat

Drug discovery is a complex process that involves multiple stages and tasks. However, existing molecular generative models can only tackle some of these tasks. We present Generalist Molecular generative model (GenMol), a versatile framework that uses only a single discrete diffusion model to handle diverse drug discovery scenarios. GenMol generates Sequential Attachment-based Fragment Embedding (SAFE) sequences through non-autoregressive bidirectional parallel decoding, thereby allowing the utilization of a molecular context that does not rely on the specific token ordering while having better sampling efficiency. GenMol uses fragments as basic building blocks for molecules and introduces fragment remasking, a strategy that optimizes molecules by regenerating masked fragments, enabling effective exploration of chemical space. We further propose molecular context guidance (MCG), a guidance method tailored for masked discrete diffusion of GenMol. GenMol significantly outperforms the previous GPT-based model in de novo generation and fragment-constrained generation, and achieves state-of-the-art performance in goal-directed hit generation and lead optimization. These results demonstrate that GenMol can tackle a wide range of drug discovery tasks, providing a unified and versatile approach for molecular design.

Spotlight Poster

#W-114

The dark side of the forces: assessing non-conservative force models for atomistic machine learning

Filippo Bigi · Marcel Langer · Michele Ceriotti

The use of machine learning to estimate the energy of a group of atoms, and the forces that drive them to more stable configurations, have revolutionized the fields of computational chemistry and materials discovery.In this domain, rigorous enforcement of symmetry and conservation laws has traditionally been considered essential. For this reason, interatomic forces are usually computed as the derivatives of the potential energy, ensuring energy conservation. Several recent works have questioned this physically constrained approach, suggesting that directly predicting the forces yields a better trade-off between accuracy and computational efficiency -- and that energy conservation can be learned during training.This work investigates the applicability of such non-conservative models in microscopic simulations. We identify and demonstrate several fundamental issues, from ill-defined convergence of geometry optimization to instability in various types of molecular dynamics.Contrary to the case of rotational symmetry, energy conservation is hard to learn, monitor, and correct for.The best approach to exploit the acceleration afforded by direct force prediction might be to use it in tandem with a conservative model, reducing -- rather than eliminating -- the additional cost of backpropagation, but avoiding the pathological behavior associated with non-conservative forces.

Spotlight Poster

#W-115

LLM-SRBench: A New Benchmark for Scientific Equation Discovery with Large Language Models

Parshin Shojaee · Ngoc Hieu Nguyen · Kazem Meidani · Amir Barati Farimani · Khoa Doan · Chandan Reddy

Scientific equation discovery is a fundamental task in the history of scientific progress, enabling the derivation of laws governing natural phenomena. Recently, Large Language Models (LLMs) have gained interest for this task due to their potential to leverage embedded scientific knowledge for hypothesis generation. However, evaluating the true discovery capabilities of these methods remains challenging, as existing benchmarks often rely on common equations that are susceptible to memorization by LLMs, leading to inflated performance metrics that do not reflect actual discovery. In this paper, we introduce LLM-SRBench, a comprehensive benchmark with 239 challenging problems across four scientific domains specifically designed to evaluate LLM-based scientific equation discovery methods while preventing trivial memorization. Our benchmark comprises two main categories: LSR-Transform, which transforms common physical models into less common mathematical representations to test reasoning beyond memorization, and LSR-Synth, which introduces synthetic, discovery-driven problems requiring data-driven reasoning. Through extensive evaluation of several state-of-the-art methods on LLM-SRBench, using both open and closed LLMs, we find that the best-performing system so far achieves only 31.5% symbolic accuracy.These findings highlight the challenges of scientific equation discovery, positioning LLM-SRBench as a valuable resource for future research.

Poster

#W-116

Fixing the Double Penalty in Data-Driven Weather Forecasting Through a Modified Spherical Harmonic Loss Function

Christopher Subich · Syed Husain · Leo Separovic · Jing Yang

Recent advancements in data-driven weather forecasting models have delivered deterministic models that outperform the leading operational forecast systems based on traditional, physics-based models. However, these data-driven models are typically trained with a mean squared error loss function, which causes smoothing of fine scales through a ``double penalty'' effect. We develop a simple, parameter-free modification to this loss function that avoids this problem by separating the loss attributable to decorrelation from the loss attributable to spectral amplitude errors. Fine-tuning the GraphCast model with this new loss function results in sharp deterministic weather forecasts, an increase of the model's effective resolution from 1,250km to 160km, improvements to ensemble spread, and improvements to predictions of tropical cyclone strength and surface wind extremes.

Poster

#W-117

Riemann Tensor Neural Networks: Learning Conservative Systems with Physics-Constrained Networks

Anas Jnini · Lorenzo Breschi · Flavio Vella

Divergence-free symmetric tensors (DFSTs) are fundamental in continuum mechanics, encoding conservation laws such as mass and momentum conservation. We introduce Riemann Tensor Neural Networks (RTNNs), a novel neural architecture that inherently satisfies the DFST condition to machine precision, providing a strong inductive bias for enforcing these conservation laws. We prove that RTNNs can approximate any sufficiently smooth DFST with arbitrary precision and demonstrate their effectiveness as surrogates for conservative PDEs, achieving improved accuracy across benchmarks. This work is the first to use DFSTs as an inductive bias in neural PDE surrogates and to explicitly enforce the conservation of both mass and momentum within a physics-constrained neural architecture.

Poster

#W-118

DragSolver: A Multi-Scale Transformer for Real-World Automotive Drag Coefficient Estimation

Ye Liu · Yuntian Chen

Automotive drag coefficient ($C_d$) is pivotal to energy efficiency, fuel consumption, and aerodynamic performance. However, costly computational fluid dynamics (CFD) simulations and wind tunnel tests struggle to meet the rapid-iteration demands of automotive design. We present DragSolver, a Transformer-based framework for rapid and accurate $C_d$ estimation from large-scale, diverse 3D vehicle models.DragSolver tackles four key real-world challenges: (1) multi-scale feature extraction to capture both global shape and fine local geometry; (2) heterogeneous scale normalization to handle meshes with varying sizes and densities;(3) surface-guided gating to suppress internal structures irrelevant to external aerodynamics;and (4) epistemic uncertainty estimation via Monte Carlo dropout for risk-aware design. Extensive evaluations on three industrial-scale datasets (DrivaerNet, DrivaerNet++, and DrivaerML) show that DragSolver outperforms existing approaches in accuracy and generalization, achieving an average reduction of relative $L_2$ error by 58.7% across real-world datasets. Crucially, DragSolver is the first to achieve reliable, real-time $C_d$ inference on production-level automotive geometries.

Poster

#W-119

PPDiff: Diffusing in Hybrid Sequence-Structure Space for Protein-Protein Complex Design

Zhenqiao Song · Tianxiao Li · Lei Li · Martin Min

Designing protein-binding proteins with high affinity is critical in biomedical research and biotechnology. Despite recent advancements targeting specific proteins, the ability to create high-affinity binders for arbitrary protein targets on demand, without extensive rounds of wet-lab testing, remains a significant challenge. Here, we introduce PPDiff, a diffusion model to jointly design the sequence and structure of binders for arbitrary protein targets in a non-autoregressive manner. PPDiff builds upon our developed Sequence Structure Interleaving Network with Causal attention layers (SSINC), which integrates interleaved self-attention layers to capture global amino acid correlations, $k$-nearest neighbor ($k$NN) equivariant graph convolutional layers to model local interactions in three-dimensional (3D) space, and causal attention layers to simplify the intricate interdependencies within the protein sequence. To assess PPDiff, we curate PPBench, a general protein-protein complex dataset comprising 706,360 complexes from the Protein Data Bank (PDB). The model is pretrained on PPBench and finetuned on two real-world applications: target-protein mini-binder complex design and antigen-antibody complex design. PPDiff consistently surpasses baseline methods, achieving success rates of 50.00\%, 23.16\%, and 16.89\% for the pretraining task and the two downstream applications, respectively.

Poster

#W-120

CoastalBench: A Decade-Long High-Resolution Dataset to Emulate Complex Coastal Processes

Zelin Xu · Yupu Zhang · Tingsong Xiao · Maitane Lizaso · Jose Gonzalez-Ondina · Zibo Liu · Shigang Chen · Zhe Jiang

Over 40\% of the global population lives within 100 kilometers of the coast, which contributes more than \$8 trillion annually to the global economy. Unfortunately, coastal ecosystems are increasingly vulnerable to more frequent and intense extreme weather events and rising sea levels. Coastal scientists use numerical models to simulate complex physical processes, but these models are often slow and expensive. In recent years, deep learning has become a promising alternative to reduce the cost of numerical models. However, progress has been hindered by the lack of a large-scale, high-resolution coastal simulation dataset to train and validate deep learning models. Existing studies often focus on relatively small datasets and simple processes. To fill this gap, we introduce a decade-long, high-resolution (<100m) coastal circulation modeling dataset on a real-world 3D mesh in southwest Florida with around 6 million cells. The dataset contains key oceanography variables (e.g., current velocities, free surface level, temperature, salinity) alongside external atmospheric and river forcings. We evaluated a customized Vision Transformer model that takes initial and boundary conditions and external forcings and predicts ocean variables at varying lead times. The dataset provides an opportunity to benchmark novel deep learning models for high-resolution coastal simulations (e.g., physics-informed machine learning, neural operator learning).The code and dataset can be accessed at https://github.com/spatialdatasciencegroup/CoastalBench.

Poster

#W-121

Energy-Based Flow Matching for Generating 3D Molecular Structure

Wenyin Zhou · Christopher I Sprague · Vsevolod Viliuga · Matteo Tadiello · Arne Elofsson · Hossein Azizpour

Molecular structure generation is a fundamental problem that involves determining the 3D positions of molecules' constituents. It has crucial biological applications, such as molecular docking, protein folding, and molecular design.Recent advances in generative modeling, such as diffusion models and flow matching, have made great progress on these tasks by modeling molecular conformations as a distribution.In this work, we focus on flow matching and adopt an energy-based perspective to improve training and inference of structure generation models. Our view results in a mapping function, represented by a deep network, that is directly learned to \textit{iteratively} map random configurations, i.e. samples from the source distribution, to target structures, i.e. points in the data manifold. This yields a conceptually simple and empirically effective flow matching setup that is theoretically justified and has interesting connections to fundamental properties such as idempotency and stability, as well as the empirically useful techniques such as structure refinement in AlphaFold. Experiments on protein docking as well as protein backbone generation consistently demonstrate the method's effectiveness, where it outperforms recent baselines of task-associated flow matching and diffusion models, using a similar computational budget.

Poster

#W-122

Deep Electromagnetic Structure Design Under Limited Evaluation Budgets

Shijian Zheng · Fangxiao Jin · Shuhai Zhang · Quan Xue · Mingkui Tan

Electromagnetic structure (EMS) design plays a critical role in developing advanced antennas and materials, but remains challenging due to high-dimensional design spaces and expensive evaluations. While existing methods commonly employ high-quality predictors or generators to alleviate evaluations, they are often data-intensive and struggle with real-world scale and budget constraints. To address this, we propose a novel method called Progressive Quadtree-based Search (PQS). Rather than exhaustively exploring the high-dimensional space, PQS converts the conventional image-like layout into a quadtree-based hierarchical representation, enabling a progressive search from global patterns to local details. Furthermore, to lessen reliance on highly accurate predictors, we introduce a consistency-driven sample selection mechanism. This mechanism quantifies the reliability of predictions, balancing exploitation and exploration when selecting candidate designs. We evaluate PQS on two real-world engineering tasks, i.e., Dual-layer Frequency Selective Surface and High-gain Antenna. Experimental results show that our method can achieve satisfactory designs under limited computational budgets, outperforming baseline methods. In particular, compared to generative approaches, it cuts evaluation costs by 75∼85%, effectively saving 20.27∼38.80 days of product designing cycle.

Poster

#W-123

Learning Condensed Graph via Differentiable Atom Mapping for Reaction Yield Prediction

Ankit Ghosh · Gargee Kashyap · Sarthak Mittal · Nupur Jain · Raghavan B Sunoj · Abir De

Yield of chemical reactions generally depends on the activation barrier, i.e., the energy difference between the reactant and the transition state. Computing the transition state from the reactant and product graphs requires prior knowledge of the correct node alignment (i.e., atom mapping), which is not available in yield prediction datasets. In this work, we propose YieldNet, a neural yield prediction model, which tackles these challenges. Here, we first approximate the atom mapping between the reactants and products using a differentiable node alignment network. We then use this approximate atom mapping to obtain a noisy realization of the condensed graph of reaction (CGR), which is a supergraph encompassing both the reactants and products. This CGR serves as a surrogate for the transition state graph structure. The CGR embeddings of different steps in a multi-step reaction are then passed into a transformer-guided reaction path encoder.Our experiments show that YieldNet can predict the yield more accurately than the baselines. Furthermore, the model is trained only under the distant supervision of yield values, without requiring fine-grained supervision of atom mapping.

Poster

#W-124

Counting atoms faster: policy-based nuclear magnetic resonance pulse sequencing for atomic abundance measurement

Rohan Shenoy · Evan Coleman · Hans Gaensbauer · Elsa Olivetti

Quantifying the elemental composition of a material is a general scientific challenge with broad relevance to environmental sustainability. Existing techniques for the measurement of atomic abundances generally require laboratory conditions and expensive equipment. As a result, they cannot be deployed in situ without significant capital investment, limiting their proliferation. Measurement techniques based on nuclear magnetic resonance (NMR) hold promise in this setting due to their applicability across the periodic table, their non-destructive manipulation of samples, and their amenability to in silico optimization. In this work, we learn policies to modulate NMR pulses for rapid atomic abundance quantification. Our approach involves three inter-operating agents which (1) rapidly align nuclear spins for measurement, (2) quickly force relaxation to equilibrium, and (3) toggle control between agents (1) and (2) to minimize overall measurement time. To demonstrate this technique, we consider a specific use case of low-magnetic-field carbon-13 quantification for low-cost, portable analysis of foodstuffs and soils. We find significant performance improvements relative to traditional NMR pulse sequencing, and discuss limitations on the applicability of this approach.

Poster

#W-200

TextCenGen: Attention-Guided Text-Centric Background Adaptation for Text-to-Image Generation

Tianyi Liang · Jiangqi Liu · Yifei Huang · Shiqi Jiang · Jianshen Shi · Changbo Wang · Chenhui Li

Text-to-image (T2I) generation has made remarkable progress in producing high-quality images, but a fundamental challenge remains: creating backgrounds that naturally accommodate text placement without compromising image quality. This capability is non-trivial for real-world applications like graphic design, where clear visual hierarchy between content and text is essential.Prior work has primarily focused on arranging layouts within existing static images, leaving unexplored the potential of T2I models for generating text-friendly backgrounds.We present TextCenGen, a training-free approach that actively relocates objects before optimizing text regions, rather than directly reducing cross-attention which degrades image quality. Our method introduces: (1) a force-directed graph approach that detects conflicting objects and guides them relocation using cross-attention maps, and (2) a spatial attention constraint that ensures smooth background generation in text regions. Our method is plug-and-play, requiring no additional training while well balancing both semantic fidelity and visual quality.Evaluated on our proposed text-friendly T2I benchmark of 27,000 images across three seed datasets, TextCenGen outperforms existing methods by achieving 23\% lower saliency overlap in text regions while maintaining 98\% of the original semantic fidelity measured by CLIP score and our proposed Visual-Textual Concordance Metric (VTCM).

Spotlight Poster

#W-201

Better to Teach than to Give: Domain Generalized Semantic Segmentation via Agent Queries with Diffusion Model Guidance

Fan Li · Xuan Wang · Min Qi · Zhaoxiang Zhang · yuelei xu

Domain Generalized Semantic Segmentation (DGSS) trains a model on a labeled source domain to generalize to unseen target domains with consistent contextual distribution and varying visual appearance.Most existing methods rely on domain randomization or data generation but struggle to capture the underlying scene distribution, resulting in the loss of useful semantic information. Inspired by the diffusion model's capability to generate diverse variations within a given scene context, we consider harnessing its rich prior knowledge of scene distribution to tackle the challenging DGSS task.In this paper, we propose a novel agent \textbf{Query}-driven learning framework based on \textbf{Diff}usion model guidance for DGSS, named QueryDiff. Our recipe comprises three key ingredients: (1) generating agent queries from segmentation features to aggregate semantic information about instances within the scene; (2) learning the inherent semantic distribution of the scene through agent queries guided by diffusion features; (3) refining segmentation features using optimized agent queries for robust mask predictions.Extensive experiments across various settings demonstrate that our method significantly outperforms previous state-of-the-art methods. Notably, it enhances the model's ability to generalize effectively to extreme domains, such as cubist art styles. Code is available at https://github.com/FanLiHub/QueryDiff.

Poster

#W-202

ExtPose: Robust and Coherent Pose Estimation by Extending ViTs

Glory Rongyu CHEN · Li'an Zhuo · Linlin Yang · Qi WANG · Liefeng Bo · Bang Zhang · Angela Yao

Vision Transformers (ViT) are remarkable at 3D pose estimation, yet they still encounter certain challenges. One issue is that the popular ViT architecture for pose estimation is limited to images and lacks temporal information. Another challenge is that the prediction often fails to maintain pixel alignment with the original images. To address these issues, we propose a systematic framework for 3D pose estimation, called ExtPose. ExtPose extends image ViT to the challenging scenario and video setting by taking in additional 2D pose evidence and capturing temporal information in a full attention-based manner. We use 2D human skeleton images to integrate structured 2D pose information. By sharing parameters and attending across modalities and frames, we enhance the consistency between 3D poses and 2D videos without introducing additional parameters. We achieve state-of-the-art (SOTA) performance on multiple human and hand pose estimation benchmarks with substantial improvements to 34.0mm (-23%) on 3DPW and 4.9mm (-18%) on FreiHAND in PA-MPJPE over the other ViT-based methods respectively.

Poster

#W-203

Robust Secure Swap: Responsible Face Swap With Persons of Interest Redaction and Provenance Traceability

Yunshu Dai · Jianwei Fei · Fangjun Huang · Chip Hong Chang

As AI generative models evolve, face swap technology has become increasingly accessible, raising concerns over potential misuse. Celebrities may be manipulated without consent, and ordinary individuals may fall victim to identity fraud. To address these threats, we propose Secure Swap, a method that protects persons of interest (POI) from face-swapping abuse and embeds a unique, invisible watermark into nonPOI swapped images for traceability. By introducing an ID Passport layer, Secure Swap redacts POI faces and generates watermarked outputs for nonPOI. A detachable watermark encoder and decoder are trained with the model to ensure provenance tracing. Experimental results demonstrate that Secure Swap not only preserves face swap functionality but also effectively prevents unauthorized swaps of POI and detects different embedded model's watermarks with high accuracy. Specifically, our method achieves a 100% success rate in protecting POI and over 99% watermark extraction accuracy for nonPOI. Besides fidelity and effectiveness, the robustness of protected models against image-level and model-level attacks in both online and offline application scenarios is also experimentally demonstrated.

Poster

#W-204

Predicting High-precision Depth on Low-Precision Devices Using 2D Hilbert Curves

Mykhailo Uss · Ruslan Yermolenko · Oleksii Shashko · Olena Kolodiazhna · Ivan Safonov · Volodymyr Savin · Yoonjae Yeo · Seowon Ji · Jaeyun Jeong

Dense depth prediction deep neural networks (DNN) have achieved impressive results for both monocular and binocular data but they are limited by high computational complexity, restricting their use on low-end devices. For better on-device efficiency and hardware utilization, weights and activations of the DNN should be converted to low-bit precision. However, this precision is not sufficient for representing high dynamic range depth. In this paper, we aim to overcome this limitation and restore high-precision depth from low-bit precision predictions. To achieve this, we propose to represent high dynamic range depth as two low dynamic range components of a Hilbert curve, and to train the full precision DNN to directly predict the latter. For on-device deployment, we use standard quantization methods and add a post-processing step that reconstructs depth from the Hilbert curve components predicted in low-bit precision. Extensive experiments demonstrate that our method increases bit precision of predicted depth by up to three bits with little computational overhead. We also observe a positive side effect of quantization error reduction by up to five times. Our method enables effective and accurate depth prediction with DNN weights and activations quantized to eight bit precision.

Poster

#W-205

LLaVA-ReID: Selective Multi-image Questioner for Interactive Person Re-Identification

Yiding Lu · Mouxing Yang · Dezhong Peng · Peng Hu · Yijie Lin · Xi Peng

Traditional text-based person ReID assumes that person descriptions from witnesses are complete and provided at once. However, in real-world scenarios, such descriptions are often partial or vague. To address this limitation, we introduce a new task called interactive person re-identification (Inter-ReID). Inter-ReID is a dialogue-based retrieval task that iteratively refines initial descriptions through ongoing interactions with the witnesses. To facilitate the study of this new task, we construct a dialogue dataset that incorporates multiple types of questions by decomposing fine-grained attributes of individuals. We further propose LLaVA-ReID, a question model that generates targeted questions based on visual and textual contexts to elicit additional details about the target person. Leveraging a looking-forward strategy, we prioritize the most informative questions as supervision during training. Experimental results on both Inter-ReID and text-based ReID benchmarks demonstrate that LLaVA-ReID significantly outperforms baselines.

Poster

#W-206

DragLoRA: Online Optimization of LoRA Adapters for Drag-based Image Editing in Diffusion Model

Siwei Xia · Li Sun · Tiantian Sun · Qingli Li

Drag-based editing within pretrained diffusion model provides a precise and flexible way to manipulate foreground objects. Traditional methods optimize the input feature obtained from DDIM inversion directly, adjusting them iteratively to guide handle points towards target locations. However, these approaches often suffer from limited accuracy due to the low representation ability of the feature in motion supervision, as well as inefficiencies caused by the large search space required for point tracking. To address these limitations, we present DragLoRA, a novel framework that integrates LoRA (Low-Rank Adaptation) adapters into the drag-based editing pipeline. To enhance the training of LoRA adapters, we introduce an additional denoising score distillation loss which regularizes the online model by aligning its output with that of the original model. Additionally, we improve the consistency of motion supervision by adapting the input features using the updated LoRA, giving a more stable and accurate input feature for subsequent operations. Building on this, we design an adaptive optimization scheme that dynamically toggles between two modes, prioritizing efficiency without compromising precision. Extensive experiments demonstrate that DragLoRA significantly enhances the control precision and computational efficiency for drag-based image editing. The Codes of DragLoRA are available at: https://github.com/Sylvie-X/DragLoRA.

Poster

#W-207

Improving Compositional Generation with Diffusion Models Using Lift Scores

Chenning Yu · Sicun Gao

We introduce a novel resampling criterion using lift scores, for improving compositional generation in diffusion models. By leveraging the lift scores, we evaluate whether generated samples align with each single condition and then compose the results to determine whether the composed prompt is satisfied. Our key insight is that lift scores can be efficiently approximated using only the original diffusion model, requiring no additional training or external modules. We develop an optimized variant that achieves relatively lower computational overhead during inference while maintaining effectiveness. Through extensive experiments, we demonstrate that lift scores significantly improved the condition alignment for compositional generation across 2D synthetic data, CLEVR position tasks, and text-to-image synthesis. Our code is available at github.com/rainorangelemon/complift.

Poster

#W-208

PF3plat: Pose-Free Feed-Forward 3D Gaussian Splatting for Novel View Synthesis

Sunghwan Hong · Jaewoo Jung · Heeseong Shin · Jisang Han · Jiaolong Yang · Chong Luo · Seungryong Kim

We consider the problem of novel view synthesis from unposed images in a single feed-forward. Our framework capitalizes on fast speed, scalability, and high-quality 3D reconstruction and view synthesis capabilities of 3DGS, where we further extend it to offer a practical solution that relaxes common assumptions such as dense image views, accurate camera poses, and substantial image overlaps. We achieve this through identifying and addressing unique challenges arising from the use of pixel-aligned 3DGS: misaligned 3D Gaussians across different views induce noisy or sparse gradients that destabilize training and hinder convergence, especially when above assumptions are not met. To mitigate this, we employ pre-trained monocular depth estimation and visual correspondence models to achieve coarse alignments of 3D Gaussians. We then introduce lightweight, learnable modules to refine depth and pose estimates from the coarse alignments, improving the quality of 3D reconstruction and novel view synthesis. Furthermore, the refined estimates are leveraged to estimate geometry confidence scores, which assess the reliability of 3D Gaussian centers and condition the prediction of Gaussian parameters accordingly. Extensive evaluations on large-scale real-world datasets demonstrate that PF3plat sets a new state-of-the-art across all benchmarks, supported by comprehensive ablation studies validating our design choices. We will make the code and weights publicly available.

Poster

#W-209

The Four Color Theorem for Cell Instance Segmentation

Ye Zhang · Yu Zhou · Yifeng Wang · Jun Xiao · Ziyue Wang · Yongbing Zhang · Jianxu Chen

Cell instance segmentation is critical to analyzing biomedical images, yet accurately distinguishing tightly touching cells remains a persistent challenge. Existing instance segmentation frameworks, including detection-based, contour-based, and distance mapping-based approaches, have made significant progress, but balancing model performance with computational efficiency remains an open problem. In this paper, we propose a novel cell instance segmentation method inspired by the four-color theorem. By conceptualizing cells as countries and tissues as oceans, we introduce a four-color encoding scheme that ensures adjacent instances receive distinct labels. This reformulation transforms instance segmentation into a constrained semantic segmentation problem with only four predicted classes, substantially simplifying the instance differentiation process. To solve the training instability caused by the non-uniqueness of four-color encoding, we design an asymptotic training strategy and encoding transformation method. Extensive experiments on various modes demonstrate our approach achieves state-of-the-art performance. The code is available at https://github.com/zhangye-zoe/FCIS.

Poster

#W-210

Instruct2See: Learning to Remove Any Obstructions Across Distributions

Junhang Li · Yu Guo · Xian · Shengfeng He

Images are often obstructed by various obstacles due to capture limitations, hindering the observation of objects of interest. Most existing methods address occlusions from specific elements like fences or raindrops, but are constrained by the wide range of real-world obstructions, making comprehensive data collection impractical. To overcome these challenges, we propose Instruct2See, a novel zero-shot framework capable of handling both seen and unseen obstacles. The core idea of our approach is to unify obstruction removal by treating it as a soft-hard mask restoration problem, where any obstruction can be represented using multi-modal prompts, such as visual semantics and textual instructions, processed through a cross-attention unit to enhance contextual understanding and improve mode control. Additionally, a tunable mask adapter allows for dynamic soft masking, enabling real-time adjustment of inaccurate masks. Extensive experiments on both in-distribution and out-of-distribution obstacles show that Instruct2See consistently achieves strong performance and generalization in obstruction removal, regardless of whether the obstacles were present during the training phase. Code and dataset are available at https://jhscut.github.io/Instruct2See.

Poster

#W-211

MiraGe: Editable 2D Images using Gaussian Splatting

Joanna Waczyńska · Tomasz Szczepanik · Piotr Borycki · Slawomir Tadeja · Thomas Bohné · Przemysław Spurek

Implicit Neural Representations (INRs) approximate discrete data through continuous functions and are commonly used for encoding 2D images. Traditional image-based INRs employ neural networks to map pixel coordinates to RGB values, capturing shapes, colors, and textures within the network’s weights. Recently, GaussianImage has been proposed as an alternative, using Gaussian functions instead of neural networks to achieve comparable quality and compression. Such a solution obtains a quality and compression ratio similar to classical INR models but does not allow image modification. In contrast, our work introduces a novel method, MiraGe, which uses mirror reflections to perceive 2D images in 3D space and employs flat-controlled Gaussians for precise 2D image editing. Our approach improves the rendering quality and allows realistic image modifications, including human-inspired perception of photos in the 3D world. Thanks to modeling images in 3D space, we obtain the illusion of 3D-based modification in 2D images. We also show that our Gaussian representation can be easily combined with a physics engine to produce physics-based modification of 2D images. Consequently, MiraGe allows for better quality than the standard approach and natural modification of 2D images.

Poster

#W-212

SSHR: More Secure Generative Steganography with High-Quality Revealed Secret Images

Jiannian Wang · Yao Lu · Guangming Lu

Image steganography ensures secure information transmission and storage by concealing secret messages within images. Recently, the diffusion model has been incorporated into the generative image steganography task, with text prompts being employed to guide the entire process. However, existing methods are plagued by three problems: (1) the restricted control exerted by text prompts causes generated stego images resemble the secret images and seem unnatural, raising the severe detection risk; (2) inconsistent intermediate states between Denoising Diffusion Implicit Models and its inversion, coupled with limited control of text prompts degrade the revealed secret images; (3) the descriptive text of images(i.e. text prompts) are also deployed as the keys, but this incurs significant security risks for both the keys and the secret images.To tackle these drawbacks, we systematically propose the SSHR, which joints the Reference Images with the adaptive keys to govern the entire process, enhancing the naturalness and imperceptibility of stego images. Additionally, we methodically construct an Exact Reveal Process to improve the quality of the revealed secret images. Furthermore, adaptive Reference-Secret Image Related Symmetric Keys are generated to enhance the security of both the keys and the concealed secret images. Various experiments indicate that our model outperforms existing methods in terms of recovery quality and secret image security.

Poster

#W-213

Not All Tokens Matter All The Time: Dynamic Token Aggregation Towards Efficient Detection Transformers

Jiacheng Cheng · Xiwen Yao · Xiang Yuan · Junwei Han

The substantial computational demands of detection transformers (DETRs) hinder their deployment in resource-constrained scenarios, with the encoder consistently emerging as a critical bottleneck. A promising solution lies in reducing token redundancy within the encoder. However, existing methods perform static sparsification while ignoring the varying importance of tokens across different levels and encoder blocks for object detection, leading to suboptimal sparsification and performance degradation. In this paper, we propose Dynamic DETR (Dynamic token aggregation for DEtection TRansformers), a novel strategy that leverages inherent importance distribution to control token density and performs multi-level token sparsification. Within each stage, we apply a proximal aggregation paradigm for low-level tokens to maintain spatial integrity, and a holistic strategy for high-level tokens to capture broader contextual information. Furthermore, we propose center-distance regularization to align the distribution of tokens throughout the sparsification process, thereby facilitating the representation consistency and effectively preserving critical object-specific patterns. Extensive experiments on canonical DETR models demonstrate that Dynamic DETR is broadly applicable across various models and consistently outperforms existing token sparsification methods.

Poster

#W-214

DCBM: Data-Efficient Visual Concept Bottleneck Models

Katharina Prasse · Patrick Knab · Sascha Marton · Christian Bartelt · Margret Keuper

Concept Bottleneck Models (CBMs) enhance the interpretability of neural networks by basing predictions on human-understandable concepts. However, current CBMs typically rely on concept sets extracted from large language models or extensive image corpora, limiting their effectiveness in data-sparse scenarios. We propose Data-efficient CBMs (DCBMs), which reduce the need for large sample sizes during concept generation while preserving interpretability. DCBMs define concepts as image regions detected by segmentation or detection foundation models, allowing each image to generate multiple concepts across different granularities. Exclusively containing dataset-specific concepts, DCBMs are well suited for fine-grained classification and out-of-distribution tasks. Attribution analysis using Grad-CAM demonstrates that DCBMs deliver visual concepts that can be localized in test images. By leveraging dataset-specific concepts insteadof predefined or general ones, DCBMs enhance adaptability to new domains. The code is available at: https://github.com/KathPra/DCBM.

Spotlight Poster

#W-215

LotteryCodec: Searching the Implicit Representation in a Random Network for Low-Complexity Image Compression

Haotian Wu · Gongpu Chen · Pier Luigi Dragotti · Deniz Gunduz

We introduce and validate the lottery codec hypothesis, which states that untrained subnetworks within randomly initialized networks can serve as synthesis networks for overfitted image compression, achieving rate-distortion (RD) performance comparable to trained networks. This hypothesis leads to a new paradigm for image compression by encoding image statistics into the network substructure. Building on this hypothesis, we propose LotteryCodec, which overfits a binary mask to an individual image, leveraging an over-parameterized and randomly initialized network shared by the encoder and the decoder. To address over-parameterization challenges and streamline subnetwork search, we develop a rewind modulation mechanism that improves the RD performance. LotteryCodec outperforms VTM and sets a new state-of-the-art in single-image compression. LotteryCodec also enables adaptive decoding complexity through adjustable mask ratios, offering flexible compression solutions for diverse device constraints and application requirements.

Poster

#W-216

Flex3D: Feed-Forward 3D Generation with Flexible Reconstruction Model and Input View Curation

Junlin Han · Jianyuan Wang · Andrea Vedaldi · Phil Torr · Filippos Kokkinos

Generating high-quality 3D content from text, single images, or sparse view images remains a challenging task with broad applications.Existing methods typically employ multi-view diffusion models to synthesize multi-view images, followed by a feed-forward process for 3D reconstruction. However, these approaches are often constrained by a small and fixed number of input views, limiting their ability to capture diverse viewpoints and, even worse, leading to suboptimal generation results if the synthesized views are of poor quality.To address these limitations, we propose Flex3D, a novel two-stage framework capable of leveraging an arbitrary number of high-quality input views. The first stage consists of a candidate view generation and curation pipeline. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs. Through extensive exploration of design and training strategies, we optimize FlexRM to achieve superior performance in both reconstruction and generation tasks. Our results demonstrate that Flex3D achieves state-of-the-art performance, with a user study winning rate of over 92% in 3D generation tasks when compared to several of the latest feed-forward 3D generative models.

Poster

#W-217

Origin Identification for Text-Guided Image-to-Image Diffusion Models

Wenhao Wang · Yifan Sun · Zongxin Yang · Zhentao Tan · Zhengdong Hu · Yi Yang

Text-guided image-to-image diffusion models excel in translating images based on textual prompts, allowing for precise and creative visual modifications. However, such a powerful technique can be misused for *spreading misinformation*, *infringing on copyrights*, and *evading content tracing*. This motivates us to introduce the task of origin **ID**entification for text-guided **I**mage-to-image **D**iffusion models (**ID$\mathbf{^2}$**), aiming to retrieve the original image of a given translated query. A straightforward solution to ID$^2$ involves training a specialized deep embedding model to extract and compare features from both query and reference images. However, due to *visual discrepancy* across generations produced by different diffusion models, this similarity-based approach fails when training on images from one model and testing on those from another, limiting its effectiveness in real-world applications. To solve this challenge of the proposed ID$^2$ task, we contribute the first dataset and a theoretically guaranteed method, both emphasizing generalizability. The curated dataset, **OriPID**, contains abundant **Ori**gins and guided **P**rompts, which can be used to train and test potential **ID**entification models across various diffusion models. In the method section, we first prove the *existence* of a linear transformation that minimizes the distance between the pre-trained Variational Autoencoder embeddings of generated samples and their origins. Subsequently, it is demonstrated that such a simple linear transformation can be *generalized* across different diffusion models. Experimental results show that the proposed method achieves satisfying generalization performance, significantly surpassing similarity-based methods (+31.6% mAP), even those with generalization designs. The project is available at https://id2icml.github.io.

Poster

#W-218

Mutual Learning for SAM Adaptation: A Dual Collaborative Network Framework for Source-Free Domain Transfer

Yabo Liu · Waikeung Wong · Chengliang Liu · Xiaoling Luo · Yong Xu · Jinghua Wang

Segment Anything Model (SAM) has demonstrated remarkable zero-shot segmentation capabilities across various visual tasks. However, its performance degrades significantly when deployed in new target domains with substantial distribution shifts. While existing self-training methods based on fixed teacher-student architectures have shown improvements, they struggle to ensure that the teacher network consistently outperforms the student under severe domain shifts. To address this limitation, we propose a novel Collaborative Mutual Learning Framework for source-free SAM adaptation, leveraging dual-networks in a dynamic and cooperative manner. Unlike fixed teacher-student paradigms, our method dynamically assigns the teacher and student roles by evaluating the reliability of each collaborative network in each training iteration. Our framework incorporates a dynamic mutual learning mechanism with three key components: a direct alignment loss for knowledge transfer, a reverse distillation loss to encourage diversity, and a triplet relationship loss to refine feature representations. These components enhance the adaptation capabilities of the collaborative networks, enabling them to generalize effectively to target domains while preserving their pre-trained knowledge. Extensive experiments on diverse target domains demonstrate that our proposed framework achieves state-of-the-art adaptation performance.

Poster

#W-219

Rethinking Score Distilling Sampling for 3D Editing and Generation

Xingyu Miao · Haoran Duan · Yang Long · Jungong Han

Score Distillation Sampling (SDS) has emerged as a prominent method for text-to-3D generation by leveraging the strengths of 2D diffusion models. However, SDS is limited to generation tasks and lacks the capability to edit existing 3D assets. Conversely, variants of SDS that introduce editing capabilities often can not generate new 3D assets effectively. In this work, we observe that the processes of generation and editing within SDS and its variants have unified underlying gradient terms. Building on this insight, we propose Unified Distillation Sampling (UDS), a method that seamlessly integrates both the generation and editing of 3D assets. Essentially, UDS refines the gradient terms used in vanilla SDS methods, unifying them to support both tasks. Extensive experiments demonstrate that UDS not only outperforms baseline methods in generating 3D assets with richer details but also excels in editing tasks, thereby bridging the gap between 3D generation and editing.

Poster

#W-220

Diffusion on Language Model Encodings for Protein Sequence Generation

Viacheslav Meshchaninov · Pavel Strashnov · Andrey Shevtsov · Fedor Nikolaev · Nikita Ivanisenko · Olga Kardymon · Dmitry Vetrov

Protein sequence design has seen significant advances through discrete diffusion and autoregressive approaches, yet the potential of continuous diffusion remains underexplored. Here, we present DiMA, a latent diffusion framework that operates on protein language model representations. Through systematic exploration of architectural choices and diffusion components, we develop a robust methodology that generalizes across multiple protein encoders ranging from 8M to 3B parameters. We demonstrate that our framework achieves consistently high performance across sequence-only (ESM-2, ESMc), dual-decodable (CHEAP), and multimodal (SaProt) representations using the same architecture and training approach. We conduct extensive evaluation of existing methods alongside DiMA using multiple metrics across two protein modalities, covering quality, diversity, novelty, and distribution matching of generated proteins. DiMA consistently produces novel, high-quality and diverse protein sequences and achieves strong results compared to baselines such as autoregressive, discrete diffusion and flow matching language models. The model demonstrates versatile functionality, supporting conditional generation tasks including protein family-generation, motif scaffolding and infilling, and fold-specific sequence design, despite being trained solely on sequence data. This work provides a universal continuous diffusion framework for protein sequence generation, offering both architectural insights and practical applicability across various protein design scenarios. Code is released at GitHub.

Spotlight Poster

#W-221

FlashTP: Fused, Sparsity-Aware Tensor Product for Machine Learning Interatomic Potentials

Seung Lee · Hojoon Kim · Yutack Park · Dawoon Jeong · Seungwu Han · Yeonhong Park · Jae W. Lee

Machine Learning Interatomic Potentials (MLIPs) enable efficient molecular dynamics (MD) simulations with high accuracy. While equivariant MLIPs achieve state-of-the-art accuracy, they face significant computational bottlenecks centered around their Tensor-Product layer, which account for up to 75\% of training time and cause substantial memory overhead. We present FlashTP, a highly optimized tensor-product library that addresses these inefficiencies through kernel fusion, sparse computation, and path-aggregated execution. FlashTP achieves up to 41.6$\times$ and 60.8$\times$ kernel speedups over _e3nn_ and NVIDIA cuEquivariance, respectively. For SevenNet-l3i5, it delivers 4.2$\times$ and 3.5$\times$ speedup while reducing peak memory usage by 6.3$\times$ and 6.2$\times$ for inference and training, respectively. The code is available at https://github.com/SNU-ARC/flashTP.

Poster

#W-300

Unifying Specialized Visual Encoders for Video Language Models

Jihoon Chung · Tyler Zhu · Max Gonzalez Saez-Diez · Juan Carlos Niebles · Honglu Zhou · Olga Russakovsky

Recent advances in vision backbones have yielded powerful and diverse visual and video encoders. Yet, current Video Large Language Models encode visual inputs using an encoder from a single backbone family, limiting the amount and type of visual information they can process. We propose MERV, a Multi-Encoder Video Representation, which utilizes multiple encoders for a comprehensive video representation. To optimize heterogeneous features from a broad spectrum of encoders and ensure efficient and coherent feature integration, MERV first aligns encoder features spatio-temporally, then projects them into a unified structure, and finally fuses them through cross-attention. Under fair comparison, MERV achieves up to 4.62% higher accuracy than its base model, while introducing minimal extra parameters and training faster than equivalent single-encoder methods after parallelizing visual processing. Qualitative analysis shows MERV successfully captures and integrates domain knowledge from each encoder, opening new possibilities for scaling enhanced video understanding.

Poster

#W-301

Symmetry-Robust 3D Orientation Estimation

Christopher Scarvelis · David Benhaim · Paul Zhang

Orientation estimation is a fundamental task in 3D shape analysis which consists of estimating a shape's orientation axes: its side-, up-, and front-axes. Using this data, one can rotate a shape into canonical orientation, where its orientation axes are aligned with the coordinate axes. Developing an orientation algorithm that reliably estimates complete orientations of general shapes remains an open problem. We introduce a two-stage orientation pipeline that achieves state of the art performance on up-axis estimation and further demonstrate its efficacy on full-orientation estimation, where one seeks all three orientation axes. Unlike previous work, we train and evaluate our method on all of Shapenet rather than a subset of classes. We motivate our engineering contributions by theory describing fundamental obstacles to orientation estimation for rotationally-symmetric shapes, and show how our method avoids these obstacles.

Spotlight Poster

#W-302

FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields

Gwanhyeong Koo · Sunjae Yoon · Younghwan Lee · Ji Woo Hong · Chang Yoo

Drag-based editing allows precise object manipulation through point-based control, offering user convenience. However, current methods often suffer from a geometric inconsistency problem by focusing exclusively on matching user-defined points, neglecting the broader geometry and leading to artifacts or unstable edits. We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations. Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points. The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment while preserving structural integrity. Additionally, existing drag-editing benchmarks provide no ground truth, making it difficult to assess how accurately the edits match the intended transformations. To address this, we present VFD (VidFrameDrag) benchmark dataset, which provides ground-truth frames using consecutive shots in a video dataset. FlowDrag outperforms existing drag-based editing methods on both VFD Bench and DragBench.

Poster

#W-303

RUN: Reversible Unfolding Network for Concealed Object Segmentation

Chunming He · Rihan Zhang · Fengyang Xiao · Chengyu Fang · Longxiang Tang · Yulun Zhang · Linghe Kong · Deng-Ping Fan · Kai Li · Sina Farsiu

Concealed object segmentation (COS) is a challenging problem that focuses on identifying objects that are visually blended into their background. Existing methods often employ reversible strategies to concentrate on uncertain regions but only focus on the mask level, overlooking the valuable of the RGB domain. To address this, we propose a Reversible Unfolding Network (RUN) in this paper. RUN formulates the COS task as a foreground-background separation process and incorporates an extra residual sparsity constraint to minimize segmentation uncertainties. The optimization solution of the proposed model is unfolded into a multistage network, allowing the original fixed parameters to become learnable. Each stage of RUN consists of two reversible modules: the Segmentation-Oriented Foreground Separation (SOFS) module and the Reconstruction-Oriented Background Extraction (ROBE) module. SOFS applies the reversible strategy at the mask level and introduces Reversible State Space to capture non-local information. ROBE extends this to the RGB domain, employing a reconstruction network to address conflicting foreground and background regions identified as distortion-prone areas, which arise from their separate estimation by independent modules. As the stages progress, RUN gradually facilitates reversible modeling of foreground and background in both the mask and RGB domains, reducing false-positive and false-negative regions. Extensive experiments demonstrate the superior performance of RUN and underscore the promise of unfolding-based frameworks for COS and other high-level vision tasks. Code is available at https://github.com/ChunmingHe/RUN.

Poster

#W-304

The Limits of Predicting Agents from Behaviour

Alexis Bellot · Jonathan Richens · Tom Everitt

As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent’s beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent’s behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent’s behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.

Poster

#W-305

S2-Track: A Simple yet Strong Approach for End-to-End 3D Multi-Object Tracking

Tao Tang · Lijun Zhou · Pengkun Hao · Zihang He · Kalok Ho · Shuo Gu · Zhihui Hao · Haiyang Sun · Kun Zhan · Peng Jia · XianPeng Lang · Xiaodan Liang

3D multiple object tracking (MOT) plays a crucial role in autonomous driving perception. Recent end-to-end query-based trackers simultaneously detect and track objects, which have shown promising potential for the 3D MOT task. However, existing methods are still in the early stages of development and lack systematic improvements, failing to track objects in certain complex scenarios, like occlusions and the small size of target object’s situations. In this paper, we first summarize the current end-to-end 3D MOT framework by decomposing it into three constituent parts: query initialization, query propagation, and query matching. Then we propose corresponding improvements, which lead to a strong yet simple tracker: S2-Track. Specifically, for query initialization, we present 2D-Prompted Query Initialization, which leverages predicted 2D object and depth information to prompt an initial estimate of the object’s 3D location. For query propagation, we introduce an Uncertainty-aware Probabilistic Decoder to capture the uncertainty of complex environment in object prediction with probabilistic attention. For query matching, we propose a Hierarchical Query Denoising strategy to enhance training robustness and convergence. As a result, our S2-Track achieves state-of-the-art performance on nuScenes benchmark, i.e., 66.3% AMOTA on test split, surpassing the previous best end-to-end solution by a significant margin of 8.9% AMOTA. We achieve 1st place on the nuScenes tracking task leaderboard.

Poster

#W-306

Adaptive Sensitivity Analysis for Robust Augmentation against Natural Corruptions in Image Segmentation

Laura Zheng · Wenjie Wei · Tony Wu · Jacob Clements · Shreelekha Revankar · Andre Harrison · Yu Shen · Ming Lin

Achieving robustness in image segmentation models is challenging due to the fine-grained nature of pixel-level classification. These models, which are crucial for many real-time perception applications, particularly struggle when faced with natural corruptions in the wild for autonomous systems. While sensitivity analysis can help us understand how input variables influence model outputs, its application to natural and uncontrollable corruptions in training data is computationally expensive. In this work, we present an adaptive, sensitivity-guided augmentation method to enhance robustness against natural corruptions. Our sensitivity analysis on average runs 10 times faster and requires about 200 times less storage than previous sensitivity analysis, enabling practical, on-the-fly estimation during training for a model-free augmentation policy. With minimal fine-tuning, our sensitivity-guided augmentation method achieves improved robustness on both real-world and synthetic datasets compared to state-of-the-art data augmentation techniques in image segmentation.

Poster

#W-307

GoIRL: Graph-Oriented Inverse Reinforcement Learning for Multimodal Trajectory Prediction

Muleilan Pei · Shaoshuai Shi · Lu Zhang · Peiliang Li · Shaojie Shen

Trajectory prediction for surrounding agents is a challenging task in autonomous driving due to its inherent uncertainty and underlying multimodality. Unlike prevailing data-driven methods that primarily rely on supervised learning, in this paper, we introduce a novel Graph-oriented Inverse Reinforcement Learning (GoIRL) framework, which is an IRL-based predictor equipped with vectorized context representations. We develop a feature adaptor to effectively aggregate lane-graph features into grid space, enabling seamless integration with the maximum entropy IRL paradigm to infer the reward distribution and obtain the policy that can be sampled to induce multiple plausible plans. Furthermore, conditioned on the sampled plans, we implement a hierarchical parameterized trajectory generator with a refinement module to enhance prediction accuracy and a probability fusion strategy to boost prediction confidence. Extensive experimental results showcase our approach not only achieves state-of-the-art performance on the large-scale Argoverse & nuScenes motion forecasting benchmarks but also exhibits superior generalization abilities compared to existing supervised models.

Poster

#W-308

Ultra Lowrate Image Compression with Semantic Residual Coding and Compression-aware Diffusion

Anle Ke · Xu Zhang · Tong Chen · Ming Lu · Chao Zhou · Jiawen Gu · Zhan Ma

Existing multimodal large model-based image compression frameworks often rely on a fragmented integration of semantic retrieval, latent compression, and generative models, resulting in suboptimal performance in both reconstruction fidelity and coding efficiency. To address these challenges, we propose a residual-guided ultra lowrate image compression named ResULIC, which incorporates residual signals into both semantic retrieval and the diffusion-based generation process. Specifically, we introduce Semantic Residual Coding (SRC) to capture the semantic disparity between the original image and its compressed latent representation. A perceptual fidelity optimizer is further applied for superior reconstruction quality. Additionally, we present the Compression-aware Diffusion Model (CDM), which establishes an optimal alignment between bitrates and diffusion time steps, improving compression-reconstruction synergy. Extensive experiments demonstrate the effectiveness of ResULIC, achieving superior objective and subjective performance compared to state-of-the-art diffusion-based methods with -80.7\%, -66.3\% BD-rate saving in terms of LPIPS and FID.

Poster

#W-309

Divide and Conquer: Exploring Language-centric Tree Reasoning for Video Question-Answering

Zhaohe Liao · Jiangtong Li · Siyu Sun · Qingyang Liu · Fengshun Xiao · Tianjiao Li · Qiang Zhang · Guang Chen · Li Niu · Changjun Jiang · Liqing Zhang

Video Question-Answering (VideoQA) remains challenging in achieving advanced cognitive reasoning due to the uncontrollable and opaque reasoning processes in existing Multimodal Large Language Models (MLLMs). To address this issue, we propose a novel Language-centric Tree Reasoning (LTR) framework that targets on enhancing the reasoning ability of models. In detail, it recursively divides the original question into logically manageable parts and conquers them piece by piece, enhancing the reasoning capabilities and interpretability of existing MLLMs. Specifically, in the first stage, the LTR focuses on language to recursively generate a language-centric logical tree, which gradually breaks down the complex cognitive question into simple perceptual ones and plans the reasoning path through a RAG-based few-shot approach. In the second stage, with the aid of video content, the LTR performs bottom-up logical reasoning within the tree to derive the final answer along with the traceable reasoning path. Experiments across 11 VideoQA benchmarks demonstrate that our LTR framework significantly improves both accuracy and interpretability compared to state-of-the-art MLLMs. To our knowledge, this is the first work to implement a language-centric logical tree to guide MLLM reasoning in VideoQA, paving the way for language-centric video understanding from perception to cognition.

Poster

#W-310

NeuralCohort: Cohort-aware Neural Representation Learning for Healthcare Analytics

Changshuo Liu · Lingze Zeng · Kaiping Zheng · Shaofeng Cai · Beng Chin Ooi · James Yip

Electronic health records (EHR) aggregate extensive data critical for advancing patient care and refining intervention strategies. EHR data is essential for epidemiological study, more commonly referred to as cohort study, where patients with shared characteristics or similar diseases are analyzed over time. Unfortunately, existing studies on cohort modeling are limited, struggling to derive fine-grained cohorts or effectively utilize cohort information, which hinders their ability to uncover intrinsic relationships between cohorts. To this end, we propose NeuralCohort, a cohort-aware neural representation learning method that precisely segments patients into finer-grained cohorts via an innovative cohort contextualization mechanism and captures both intra- and inter-cohort information using a Biscale Cohort Learning Module. Designed as a plug-in, NeuralCohort integrates seamlessly with existing backbone models, enhancing their cohort analysis capabilities by infusing deep cohort insights into the representation learning processes. The effectiveness and generalizability of NeuralCohort are validated across extensive real-world EHR datasets. Experimental results demonstrate that NeuralCohort consistently improves the performance of various backbone models, achieving up to an 8.1% increase in AUROC.

Spotlight Poster

#W-311

scSSL-Bench: Benchmarking Self-Supervised Learning for Single-Cell Data

Olga Ovcharenko · Florian Barkmann · Philip Toma · Imant Daunhawer · Julia Vogt · Sebastian Schelter · Valentina Boeva

Self-supervised learning (SSL) has proven to be a powerful approach for extracting biologically meaningful representations from single-cell data. To advance our understanding of SSL methods applied to single-cell data, we present scSSL-Bench, a comprehensive benchmark that evaluates nineteen SSL methods. Our evaluation spans nine datasets and focuses on three common downstream tasks: batch correction, cell type annotation, and missing modality prediction. Furthermore, we systematically assess various data augmentation strategies. Our analysis reveals task-specific trade-offs: the specialized single-cell frameworks, scVI, CLAIRE, and the finetuned scGPT excel at uni-modal batch correction, while generic SSL methods, such as VICReg and SimCLR, demonstrate superior performance in cell typing and multi-modal data integration. Random masking emerges as the most effective augmentation technique across all tasks, surpassing domain-specific augmentations. Notably, our results indicate the need for a specialized single-cell multi-modal data integration framework. scSSL-Bench provides a standardized evaluation platform and concrete recommendations for applying SSL to single-cell analysis, advancing the convergence of deep learning and single-cell genomics.

Poster

#W-312

NTPP: Generative Speech Language Modeling for Dual-Channel Spoken Dialogue via Next-Token-Pair Prediction

Qichao Wang · Ziqiao Meng · Wenqian Cui · Yifei Zhang · Pengcheng Wu · Bingzhe Wu · Irwin King · Liang Chen · Peilin Zhao

Inspired by the impressive capabilities of GPT-4o, there is growing interest in enabling speech language models (SLMs) to engage in natural, fluid spoken interactions with humans. Recent advancements have led to the development of several SLMs that demonstrate promising results in this area. However, current approaches have yet to fully exploit dual-channel speech data, which inherently captures the structure and dynamics of human conversation. In this work, we systematically explore the use of dual-channel speech data in the context of modern large language models, and introduce a novel generative modeling paradigm—Next-Token-Pair Prediction (NTPP)—to enable speaker-independent dual-channel spoken dialogue learning using decoder-only architectures for the first time. We evaluate our approach on standard benchmarks, and empirical results show that our proposed method, NTPP, significantly improves the conversational abilities of SLMs in terms of turn-taking prediction, response coherence, and naturalness. Moreover, compared to existing methods, NTPP achieves substantially lower inference latency, highlighting its practical efficiency for real-time applications. Demo and code can be found at https://audio-3059.pages.dev.

Poster

#W-313

Flexible and Efficient Grammar-Constrained Decoding

Kanghee Park · Timothy Zhou · Loris D'Antoni

Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can ``align'' with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.

Poster

#W-314

Otter: Generating Tests from Issues to Validate SWE Patches

Toufique Ahmed · Jatin Ganhotra · Rangeet Pan · Avraham Shinnar · Saurabh Sinha · Martin Hirzel

While there has been plenty of work on generating tests from existing code, there has been limited work on generating tests from issues. A correct test must validate the code patch that resolves the issue. This paper focuses on the scenario where that code patch does not yet exist. Doing so supports two major use-cases. First, it supports TDD (test-driven development), the discipline of "test first, write code later" that has well-documented benefits for human software engineers. Second, it also validates SWE (software engineering) agents, which generate code patches for resolving issues. This paper introduces TDD-Bench-Verified, a benchmark for generating tests from issues, and Otter, an LLM-based solution for this task. Otter augments LLMs with rule-based analysis to check and repair their outputs, and introduces a novel self-reflective action planner. Experiments show Otter outperforming state-of-the-art systems for generating tests from issues, in addition to enhancing systems that generate patches from issues. We hope that Otter helps make developers more productive at resolving issues and leads to more robust, well-tested code.

Poster

#W-315

Are Large Language Models Ready for Multi-Turn Tabular Data Analysis?

Jinyang Li · Nan Huo · Yan Gao · Jiayi Shi · Yingxiu Zhao · Qu Ge · Bowen Qin · Yurong Wu · Xiaodong Li · Chenhao Ma · Jian-Guang Lou · Reynold Cheng

Conversational Tabular Data Analysis, a collaboration between humans and machines, enables real-time data exploration for informed decision-making. The challenges and costs of collecting realistic conversational logs for tabular data analysis hinder comprehensive quantitative evaluation of Large Language Models (LLMs) in this task. To mitigate this issue, we introduce CoTA, a new benchmark to evaluate LLMs on conversational tabular data analysis. CoTA contains 1013 conversations, covering 4 practical scenarios: Normal, Action, Private, and Private Action. Notably, CoTA is constructed by an economical multi-agent environment, Decision Company, with few human efforts. This environment ensures efficiency and scalability of generating new conversational data. Our comprehensive study, conducted by data analysis experts, demonstrates that Decision Company is capable of producing diverse and high-quality data, laying the groundwork for efficient data annotation. We evaluate popular and advanced LLMs in CoTA, which highlights the challenges of conversational tabular data analysis. Furthermore, we propose Adaptive Conversation Reflection (ACR), a self-generated reflection strategy that guides LLMs to learn from successful histories. Experiments demonstrate that ACR can evolve LLMs into effective conversational data analysis agents, achieving a relative performance improvement of up to 35.14%.

Poster

#W-316

Aligning Spoken Dialogue Models from User Interactions

Anne Wu · Laurent Mazaré · Neil Zeghidour · Alexandre Défossez

We propose a novel preference alignment framework for improving spoken dialogue models on real-time conversations from user interactions. Current preference learning methods primarily focus on text-based language models, and are not directly suited to the complexities of real-time speech interactions, with richer dynamics (e.g. interruption, interjection) and no explicit segmentation between speaker turns.We create a large-scale dataset of more than 150,000 preference pairs from raw multi-turn speech conversations, annotated with AI feedback, to cover preferences over both linguistic content and temporal context variations. We leverage offline alignment methods to finetune a full-duplex autoregressive speech-to-speech model. Extensive experiments demonstrate that feedback on generic conversations can be consistently effective in improving spoken dialogue models to produce more factual, safer and more contextually aligned interactions. We deploy the finetuned model and conduct holistic human evaluations to assess the impact beyond single-turn conversations. Our findings shed light on the importance of a well-calibrated balance among various dynamics, crucial for natural real-time speech dialogue systems.

Poster

#W-317

SING: Spatial Context in Large Language Model for Next-Gen Wearables

Ayushi Mishra · Yang Bai · Priyadarshan Narayanasamy · Nakul Garg · Nirupam Roy

Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI’s Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. SING supports spatially-aware automatic speech recognition (ASR), achieving a mean error of 25.72°—a substantial improvement compared to the 88.52° median error in existing work—with a word error rate (WER) of 5.3. SING also supports soundscaping, for example, inference how many people were talking and their directions, with up to 5 people and a median DoA error of 16°. Our system demonstrates superior performance in spatial speech understanding while addressing the challenges of power efficiency, privacy, and hardware constraints, paving the way for advanced applications in augmented reality, accessibility, and immersive experiences.

Poster

#W-318

OmniAudio: Generating Spatial Audio from 360-Degree Video

Huadai Liu · Tianyi Luo · Kaicheng Luo · Qikai Jiang · Peiwen Sun · Jialei Wang · Rongjie Huang · Qian Chen · Wen Wang · Xiangtai Li · ShiLiang Zhang · Zhijie Yan · Zhou Zhao · Wei Xue

Traditional video-to-audio generation techniques primarily focus on perspective video and non-spatial audio, often missing the spatial cues necessary for accurately representing sound sources in 3D environments. To address this limitation, we introduce a novel task, \textbf{360V2SA}, to generate spatial audio from 360-degree videos, specifically producing First-order Ambisonics (FOA) audio - a standard format for representing 3D spatial audio that captures sound directionality and enables realistic 3D audio reproduction. We first create \textbf{Sphere360}, a novel dataset tailored for this task that is curated from real-world data. We also design an efficient semi-automated pipeline for collecting and cleaning paired video-audio data. To generate spatial audio from 360-degree video, we propose a novel framework \textbf{OmniAudio}, which leverages self-supervised pre-training using both spatial audio data (in FOA format) and large-scale non-spatial data. Furthermore, OmniAudio features a dual-branch framework that utilizes both panoramic and perspective video inputs to capture comprehensive local and global information from 360-degree videos. Experimental results demonstrate that OmniAudio achieves state-of-the-art performance across both objective and subjective metrics on Sphere360. Code and datasets are available at~\href{https://github.com/liuhuadai/OmniAudio}{\texttt{github.com/liuhuadai/OmniAudio}}. The project website is available at \href{https://OmniAudio-360V2SA.github.io}{\texttt{OmniAudio-360V2SA.github.io}}.

Poster

#W-319

MindCustomer: Multi-Context Image Generation Blended with Brain Signal

Muzhou Yu · Shuyun Lin · Lei Ma · Bo Lei · Kaisheng Ma

Advancements in generative models have promoted text- and image-based multi-context image generation. Brain signals, offering a direct representation of user intent, present new opportunities for image customization. However, it faces challenges in brain interpretation, cross-modal context fusion and retention. In this paper, we present MindCustomer to explore the blending of visual brain signals in multi-context image generation. We first design shared neural data augmentation for stable cross-subject brain embedding by introducing the Image-Brain Translator (IBT) to generate brain responses from visual images. Then, we propose an effective cross-modal information fusion pipeline that mask-freely adapts distinct semantics from image and brain contexts within a diffusion model. It resolves semantic conflicts for context preservation and enables harmonious context integration. During the fusion pipeline, we further utilize the IBT to transfer image context to the brain representation to mitigate the cross-modal disparity. MindCustomer enables cross-subject generation, delivering unified, high-quality, and natural image outputs. Moreover, it exhibits strong generalization for new subjects via few-shot learning, indicating the potential for practical application. As the first work for multi-context blending with brain signal, MindCustomer lays a foundational exploration and inspiration for future brain-controlled generative technologies.

Poster

#W-320

Mind Your Step (by Step): Chain-of-Thought can Reduce Performance on Tasks where Thinking Makes Humans Worse

Ryan Liu · Jiayi Geng · Addison J. Wu · Ilia Sucholutsky · Tania Lombrozo · Thomas Griffiths

Chain-of-thought (CoT) prompting has become a widely used strategy for improving large language and multimodal model performance. However, it is still an open question under which settings CoT systematically reduces performance. In this paper, we seek to identify the characteristics of tasks where CoT reduces performance by drawing inspiration from cognitive psychology, focusing on six representative tasks from the psychological literature where deliberation hurts performance in humans. In three of these tasks, state-of-the-art models exhibit significant performance drop-offs with CoT (up to 36.3\% absolute accuracy for OpenAI o1-preview compared to GPT-4o), while in others, CoT effects are mixed, with positive, neutral, and negative changes. While models and humans do not exhibit perfectly parallel cognitive processes, considering cases where thinking has negative consequences for humans helps identify settings where it negatively impacts models. By connecting the literature on human verbal thinking and deliberation with evaluations of CoT, we offer a perspective for understanding the impact of inference-time reasoning.

Poster

#W-321

Dendritic Localized Learning: Toward Biologically Plausible Algorithm

Changze Lv · Jingwen Xu · Yiyang Lu · Xiaohua Wang · Zhenghua Wang · Zhibo Xu · Di Yu · Xin Du · Xiaoqing Zheng · Xuanjing Huang

Backpropagation is the foundational algorithm for training neural networks and a key driver of deep learning's success.However, its biological plausibility has been challenged due to three primary limitations: weight symmetry, reliance on global error signals, and the dual-phase nature of training, as highlighted by the existing literature. Although various alternative learning approaches have been proposed to address these issues, most either fail to satisfy all three criteria simultaneously or yield suboptimal results.Inspired by the dynamics and plasticity of pyramidal neurons, we propose Dendritic Localized Learning (DLL), a novel learning algorithm designed to overcome these challenges.Extensive empirical experiments demonstrate that DLL satisfies all three criteria of biological plausibility while achieving state-of-the-art performance among algorithms that meet these requirements.Furthermore, DLL exhibits strong generalization across a range of architectures, including MLPs, CNNs, and RNNs.These results, benchmarked against existing biologically plausible learning algorithms, offer valuable empirical insights for future research.We hope this study can inspire the development of new biologically plausible algorithms for training multilayer networks and advancing progress in both neuroscience and machine learning.Our code is available at https://github.com/Lvchangze/Dendritic-Localized-Learning.

Poster

#W-400

Galileo: Learning Global & Local Features of Many Remote Sensing Modalities

Gabriel Tseng · Anthony Fuller · Marlena Reil · Henry Herzog · Patrick Beukema · Favyen Bastani · James Green · Evan Shelhamer · Hannah Kerner · David Rolnick

We introduce a highly multimodal transformer to represent many remote sensing modalities - multispectral optical, synthetic aperture radar, elevation, weather, pseudo-labels, and more - across space and time. These inputs are useful for diverse remote sensing tasks, such as crop mapping and flood detection. However, learning shared representations of remote sensing data is challenging, given the diversity of relevant data modalities, and because objects of interest vary massively in scale, from small boats (1-2 pixels and fast) to glaciers (thousands of pixels and slow). We present a novel self-supervised learning algorithm that extracts multi-scale features across a flexible set of input modalities through masked modeling. Our dual global and local contrastive losses differ in their targets (deep representations vs. shallow input projections) and masking strategies (structured vs. not). Our Galileo is a single generalist model that outperforms SoTA specialist models for satellite images and pixel time series across eleven benchmarks and multiple tasks.

Spotlight Poster

#W-401

Neural Discovery in Mathematics: Do Machines Dream of Colored Planes?

Konrad Mundinger · Max Zimmer · Aldo Kiem · Christoph Spiegel · Sebastian Pokutta

We demonstrate how neural networks can drive mathematical discovery through a case study of the Hadwiger-Nelson problem, a long-standing open problem at the intersection of discrete geometry and extremal combinatorics that is concerned with coloring the plane while avoiding monochromatic unit-distance pairs. Using neural networks as approximators, we reformulate this mixed discrete-continuous geometric coloring problem with hard constraints as an optimization task with a probabilistic, differentiable loss function. This enables gradient-based exploration of admissible configurations that most significantly led to the discovery of two novel six-colorings, providing the first improvement in thirty years to the off-diagonal variant of the original problem (Mundinger et al., 2024a). Here, we establish the underlying machine learning approach used to obtain these results and demonstrate its broader applicability through additional numerical insights.

Poster

#W-402

AlphaQCM: Alpha Discovery in Finance with Distributional Reinforcement Learning

Zhoufan Zhu · Ke Zhu

For researchers and practitioners in finance, finding synergistic formulaic alphas is very important but challenging. In this paper, we reconsider the discovery of synergistic formulaic alphas from the viewpoint of sequential decision-making, and conceptualize the entire alpha discovery process as a non-stationary and reward-sparse Markov decision process. To overcome the challenges of non-stationarity and reward-sparsity, we propose the AlphaQCM method, a novel distributional reinforcement learning method designed to search for synergistic formulaic alphas efficiently. The AlphaQCM method first learns the Q function and quantiles via a Q network and a quantile network, respectively. Then, the AlphaQCM method applies the quantiled conditional moment method to learn unbiased variance from the potentially biased quantiles. Guided by the learned Q function and variance, the AlphaQCM method navigates the non-stationarity and reward-sparsity to explore the vast search space of formulaic alphas with high efficacy. Empirical applications to real-world datasets demonstrate that our AlphaQCM method significantly outperforms its competitors, particularly when dealing with large datasets comprising numerous stocks.

Poster

#W-403

Unbiased Recommender Learning from Implicit Feedback via Weakly Supervised Learning

Eric Wang · Zhichao Chen · Haotian Wang · Yanchao Tan · Licheng Pan · Tianqiao Liu · Xu Chen · Haoxuan Li · Zhouchen Lin

Implicit feedback recommendation is challenged by the missing negative feedback essential for effective model training. Existing methods often resort to negative sampling, a technique that assumes unlabeled interactions as negative samples. This assumption risks misclassifying potential positive samples within the unlabeled data, thereby undermining model performance. To address this issue, we introduce PURL, a model-agnostic framework that reframes implicit feedback recommendation as a weakly supervised learning task, eliminating the need for negative samples. However, its unbiasedness hinges on the accurate estimation of the class prior. To address this challenge, we propose Progressive Proximal Transport (PPT), which estimates the class prior by minimizing the proximal transport cost between positive and unlabeled samples. Experiments on three real-world datasets validate the efficacy of PURL in terms of improved recommendation quality. Code is available at https://github.com/HowardZJU/weakrec.

Poster

#W-404

AdaPTS: Adapting Univariate Foundation Models to Probabilistic Multivariate Time Series Forecasting

Abdelhakim Benechehab · Vasilii Feofanov · Giuseppe Paolo · Albert Thomas · Maurizio Filippone · Balázs Kégl

Pre-trained foundation models (FMs) have shown exceptional performance in univariate time series forecasting tasks. However, several practical challenges persist, including managing intricate dependencies among features and quantifying uncertainty in predictions. This study aims to tackle these critical limitations by introducing adapters—feature-space transformations that facilitate the effective use of pre-trained univariate time series FMs for multivariate tasks. Adapters operate by projecting multivariate inputs into a suitable latent space and applying the FM independently to each dimension. Inspired by the literature on representation learning and partially stochastic Bayesian neural networks, we present a range of adapters and optimization/inference strategies. Experiments conducted on both synthetic and real-world datasets confirm the efficacy of adapters, demonstrating substantial enhancements in forecasting accuracy and uncertainty quantification compared to baseline methods. Our framework, AdaPTS, positions adapters as a modular, scalable, and effective solution for leveraging time series FMs in multivariate contexts, thereby promoting their wider adoption in real-world applications. We release the code at https://github.com/abenechehab/AdaPTS.

Poster

#W-405

Cross-City Latent Space Alignment for Consistency Region Embedding

Meng Chen · Hongwei Jia · Zechen Li · Wenzhen Jia · Kai Zhao · Hongjun Dai · Weiming Huang

Learning urban region embeddings has substantially advanced urban analysis, but their typical focus on individual cities leads to disparate embedding spaces, hindering cross-city knowledge transfer and the reuse of downstream task predictors. To tackle this issue, we present Consistent Region Embedding (CoRE), a unified framework integrating region embedding learning with cross-city latent space alignment. CoRE first embeds regions from two cities into separate latent spaces, followed by the alignment of latent space manifolds and fine-grained individual regions from both cities. This ensures compatible and comparable embeddings within aligned latent spaces, enabling predictions of various socioeconomic indicators without ground truth labels by migrating knowledge from label-rich cities. Extensive experiments show CoRE outperforms competitive baselines, confirming its effectiveness for cross-city knowledge transfer via aligned latent spaces.

Poster

#W-406

Counterfactual Voting Adjustment for Quality Assessment and Fairer Voting in Online Platforms with Helpfulness Evaluation

Chang Liu · Yixin Wang · Moontae Lee

Efficient access to high-quality information is vital for online platforms. To promote more useful information, users not only create new content but also evaluate existing content, often through helpfulness voting. Although aggregated votes help service providers rank their user content, these votes are often biased by disparate accessibility per position and the cascaded influence of prior votes. For a fairer assessment of information quality, we propose the Counterfactual Voting Adjustment (CVA), a causal framework that accounts for the context in which individual votes are cast. Through preliminary and semi-synthetic experiments, we show that CVA effectively models the position and herding biases, accurately recovering the predefined content quality. In a real experiment, we demonstrate that reranking content based on the learned quality by CVA exhibits stronger alignment with both user sentiment and quality evaluation assessed by GPT-4o, outperforming system rankings based on aggregated votes and model-based rerankings without causal inference. Beyond the individual quality inference, our embeddings offer comparative insights into the behavioral dynamics of expert user groups across 120 major StackExchange communities.

Poster

#W-407

DynaMind: Reasoning over Abstract Video Dynamics for Embodied Decision-Making

Ziru Wang · Mengmeng Wang · Jade Dai · Teli Ma · Guo-Jun Qi · Yong Liu · Guang Dai · Jingdong Wang

Integrating natural language instructions and visual perception with decision-making is a critical challenge for embodied agents. Existing methods often struggle to balance the conciseness of language commands with the richness of video content. To bridge the gap between modalities, we propose extracting key spatiotemporal patterns from video that capture visual saliency and temporal evolution, referred to as dynamic representation. Building on this, we introduce DynaMind, a framework that enhances decision-making through dynamic reasoning. Specifically, we design an adaptive FrameScorer to evaluate video frames based on semantic consistency and visual saliency, assigning each frame an importance score. These scores are used to filter redundant video content and synthesize compact dynamic representations. Leveraging these representations, we predict critical future dynamics and apply a dynamic-guided policy to generate coherent and context-aware actions. Extensive results demonstrate that DynaMind significantly outperforms the baselines across several simulation benchmarks and real-world scenarios.

Poster

#W-408

ELEMENTAL: Interactive Learning from Demonstrations and Vision-Language Models for Reward Design in Robotics

Letian Chen · Nina Moorman · Matthew Gombolay

Reinforcement learning (RL) has demonstrated compelling performance in robotic tasks, but its success often hinges on the design of complex, ad hoc reward functions. Researchers have explored how Large Language Models (LLMs) could enable non-expert users to specify reward functions more easily. However, LLMs struggle to balance the importance of different features, generalize poorly to out-of-distribution robotic tasks, and cannot represent the problem properly with only text-based descriptions. To address these challenges, we propose ELEMENTAL (intEractive LEarning froM dEmoNstraTion And Language), a novel framework that combines natural language guidance with visual user demonstrations to align robot behavior with user intentions better. By incorporating visual inputs, ELEMENTAL overcomes the limitations of text-only task specifications, while leveraging inverse reinforcement learning (IRL) to balance feature weights and match the demonstrated behaviors optimally. ELEMENTAL also introduces an iterative feedback-loop through self-reflection to improve feature, reward, and policy learning. Our experiment results demonstrate that ELEMENTAL outperforms prior work by 42.3% on task success, and achieves 41.3% better generalization in out-of-distribution tasks, highlighting its robustness in LfD.

Poster

#W-409

Pre-training Auto-regressive Robotic Models with 4D Representations

Dantong Niu · Yuvan Sharma · Haoru Xue · Giscard Biamby · Junyi Zhang · Ziteng Ji · Trevor Darrell · Roi Herzig

Foundation models pre-trained on massive unlabeled datasets have revolutionized natural language and computer vision, exhibiting remarkable generalization capabilities, thus highlighting the importance of pre-training. Yet, efforts in robotics have struggled to achieve similar success, limited by either the need for costly robotic annotations or the lack of representations that effectively model the physical world. In this paper, we introduce ARM4R, an Auto-regressive Robotic Model that leverages low-level 4D Representations learned from human video data to yield a better pre-trained robotic model. Specifically, we focus on utilizing 3D point tracking representations from videos derived by lifting 2D representations into 3D space via monocular depth estimation across time. These 4D representations maintain a shared geometric structure between the points and robot state representations up to a linear transformation, enabling efficient transfer learning from human video data to low-level robotic control. Our experiments show that ARM4R can transfer efficiently from human video data to robotics and consistently improves performance on tasks across various robot environments and configurations.

Poster

#W-410

Dynamical phases of short-term memory mechanisms in RNNs

Bariscan Kurtkaya · Fatih Dinc · Mert Yuksekgonul · Marta Blanco-Pozo · Ege Cirakman · Mark Schnitzer · Yucel Yemez · Hidenori Tanaka · Yuan · Nina Miolane

Short-term memory is essential for cognitive processing, yet our understanding of its neural mechanisms remains unclear. Neuroscience has long focused on how sequential activity patterns, where neurons fire one after another within large networks, can explain how information is maintained. While recurrent connections were shown to drive sequential dynamics, a mechanistic understanding of this process still remains unknown. In this work, we introduce two unique mechanisms that can support this form of short-term memory: slow-point manifolds generating direct sequences or limit cycles providing temporally localized approximations. Using analytical models, we identify fundamental properties that govern the selection of each mechanism. Precisely, on short-term memory tasks (delayed cue-discrimination tasks), we derive theoretical scaling laws for critical learning rates as a function of the delay period length, beyond which no learning is possible. We empirically verify these results by training and evaluating approximately 80,000 recurrent neural networks (RNNs), which are publicly available for further analysis. Overall, our work provides new insights into short-term memory mechanisms and proposes experimentally testable predictions for systems neuroscience.

Poster

#W-411

EEG-Language Pretraining for Highly Label-Efficient Clinical Phenotyping

Sam Gijsen · Kerstin Ritter

Multimodal language modeling has enabled breakthroughs for representation learning, yet remains unexplored in the realm of functional brain data for clinical phenotyping. This paper pioneers EEG-language models (ELMs) trained on clinical reports and 15000 EEGs. We propose to combine multimodal alignment in this novel domain with timeseries cropping and text segmentation, enabling an extension based on multiple instance learning to alleviate misalignment between irrelevant EEG or text segments. Our multimodal models significantly improve over EEG-only models across four clinical evaluations and for the first time enable zero-shot classification as well as retrieval of both neural signals and reports. In sum, these results highlight the potential of ELMs, representing significant progress for clinical applications.

Poster

#W-412

Inverse Reinforcement Learning with Switching Rewards and History Dependency for Characterizing Animal Behaviors

Jingyang Ke · Feiyang Wu · Jiyi Wang · Jeffrey Markowitz · Anqi Wu

Traditional approaches to studying decision-making in neuroscience focus on simplified behavioral tasks where animals perform repetitive, stereotyped actions to receive explicit rewards. While informative, these methods constrain our understanding of decision-making to short timescale behaviors driven by explicit goals. In natural environments, animals exhibit more complex, long-term behaviors driven by intrinsic motivations that are often unobservable. Recent works in time-varying inverse reinforcement learning (IRL) aim to capture shifting motivations in long-term, freely moving behaviors. However, a crucial challenge remains: animals make decisions based on their history, not just their current state. To address this, we introduce SWIRL (SWitching IRL), a novel framework that extends traditional IRL by incorporating time-varying, history-dependent reward functions. SWIRL models long behavioral sequences as transitions between short-term decision-making processes, each governed by a unique reward function. SWIRL incorporates biologically plausible history dependency to capture how past decisions and environmental contexts shape behavior, offering a more accurate description of animal decision-making. We apply SWIRL to simulated and real-world animal behavior datasets and show that it outperforms models lacking history dependency, both quantitatively and qualitatively. This work presents the first IRL model to incorporate history-dependent policies and rewards to advance our understanding of complex, naturalistic decision-making in animals.

Poster

#W-413

ReverB-SNN: Reversing Bit of the Weight and Activation for Spiking Neural Networks

Yufei Guo · Yuhan Zhang · Zhou Jie · Xiaode Liu · Xin Tong · Yuanpei Chen · Weihang Peng · Zhe Ma

The Spiking Neural Network (SNN), a biologically inspired neural network infrastructure, has garnered significant attention recently. SNNs utilize binary spike activations for efficient information transmission, replacing multiplications with additions, thereby enhancing energy efficiency. However, binary spike activation maps often fail to capture sufficient data information, resulting in reduced accuracy.To address this challenge, we advocate reversing the bit of the weight and activation, called \textbf{ReverB}, inspired by recent findings that highlight greater accuracy degradation from quantizing activations compared to weights. Specifically, our method employs real-valued spike activations alongside binary weights in SNNs. This preserves the event-driven and multiplication-free advantages of standard SNNs while enhancing the information capacity of activations.Additionally, we introduce a trainable factor within binary weights to adaptively learn suitable weight amplitudes during training, thereby increasing network capacity. To maintain efficiency akin to vanilla \textbf{ReverB}, our trainable binary weight SNNs are converted back to standard form using a re-parameterization technique during inference.Extensive experiments across various network architectures and datasets, both static and dynamic, demonstrate that our approach consistently outperforms state-of-the-art methods.

Poster

#W-414

A Multi-Region Brain Model to Elucidate the Role of Hippocampus in Spatially Embedded Decision-Making

Yi Xie · Jaedong Hwang · Carlos Brody · David Tank · Ila R. Fiete

Brains excel at robust decision-making and data-efficient learning. Understanding the architectures and dynamics underlying these capabilities can inform inductive biases for deep learning. We present a multi-region brain model that explores the normative role of structured memory circuits in a spatially embedded binary decision-making task from neuroscience.We counterfactually compare the learning performance and neural representations of reinforcement learning (RL) agents with brain models of different interaction architectures between grid and place cells in the entorhinal cortex and hippocampus, coupled with an action-selection cortical recurrent neural network. We demonstrate that a specific architecture--where grid cells receive and jointly encode self-movement velocity signals and decision evidence increments--optimizes learning efficiency while best reproducing experimental observations relative to alternative architectures.Our findings thus suggest brain-inspired structured architectures for efficient RL. Importantly, the models make novel, testable predictions about organization and information flow within the entorhinal-hippocampal-neocortical circuit: we predict that grid cells must conjunctively encode position and evidence for effective spatial decision-making, directly motivating new neurophysiological experiments.

Poster

#W-415

The Brain's Bitter Lesson: Scaling Speech Decoding With Self-Supervised Learning

Dulhan Jayalath · Gilad Landau · Brendan Shillingford · Mark Woolrich · ʻŌiwi Parker Jones

The past few years have seen remarkable progress in the decoding of speech from brain activity, primarily driven by large single-subject datasets. However, due to individual variation, such as anatomy, and differences in task design and scanning hardware, leveraging data across subjects and datasets remains challenging. In turn, the field has not benefited from the growing number of open neural data repositories to exploit large-scale deep learning. To address this, we develop neuroscience-informed self-supervised objectives, together with an architecture, for learning from heterogeneous brain recordings. Scaling to nearly 400 hours of MEG data and 900 subjects, our approach shows generalisation across participants, datasets, tasks, and even to novel subjects. It achieves improvements of 15-27% over state-of-the-art models and matches surgical decoding performance with non-invasive data. These advances unlock the potential for scaling speech decoding models beyond the current frontier.

Poster

#W-416

Flow Matching for Few-Trial Neural Adaptation with Stable Latent Dynamics

Puli Wang · Yu Qi · Yueming Wang · Gang Pan

The primary goal of brain-computer interfaces (BCIs) is to establish a direct linkage between neural activities and behavioral actions via neural decoders. Due to the nonstationary property of neural signals, BCIs trained on one day usually obtain degraded performance on other days, hindering the user experience. Existing studies attempted to address this problem by aligning neural signals across different days. However, these neural adaptation methods may exhibit instability and poor performance when only a few trials are available for alignment, limiting their practicality in real-world BCI deployment. To achieve efficient and stable neural adaptation with few trials, we propose Flow-Based Distribution Alignment (FDA), a novel framework that utilizes flow matching to learn flexible neural representations with stable latent dynamics, thereby facilitating source-free domain alignment through likelihood maximization. The latent dynamics of FDA framework is theoretically proven to be stable using Lyapunov exponents, allowing for robust adaptation. Further experiments across multiple motor cortex datasets demonstrate the superior performance of FDA, achieving reliable results with fewer than five trials. Our FDA approach offers a novel and efficient solution for few-trial neural data adaptation, offering significant potential for improving the long-term viability of real-world BCI applications.

Poster

#W-417

Identifying Neural Dynamics Using Interventional State Space Models

Amin Nejatbakhsh · Yixin Wang

Neural circuits produce signals that are complex and nonlinear. To facilitate the understanding of neural dynamics, a popular approach is to fit state space models (SSM) to the data and analyze the dynamics of the low-dimensional latent variables. Despite the power of SSM to explain the dynamics of neural circuits, these models have been shown to merely capture statistical associations in the data and cannot be causally interpreted. Therefore, an important research problem is to build models that can predict neural dynamics under causal manipulations. Here, we propose interventional state-space models (iSSM), a class of causal models that can predict neural responses to novel perturbations. We draw on recent advances in causal dynamical systems and present theoretical results for the identifiability of iSSM. In simulations of the motor cortex, we show that iSSM can recover the true latents and the underlying dynamics. In addition, we illustrate two applications of iSSM in biological datasets. First, we applied iSSM to a dataset of calcium recordings from ALM neurons in mice during photostimulation. Second, we applied iSSM to a dataset of electrophysiological recordings from macaque dlPFC during micro-stimulation. In both cases, we show that iSSM outperforms SSM and results in identifiable parameters. The code is available at https://github.com/amin-nejat/issm.

Poster

#W-418

Human-Aligned Image Models Improve Visual Decoding from the Brain

Nona Rajabi · Antonio Ribeiro · Miguel Vasco · Farzaneh Taleb · Mårten Björkman · Danica Kragic

Decoding visual images from brain activity has significant potential for advancing brain-computer interaction and enhancing the understanding of human perception. Recent approaches align the representation spaces of images and brain activity to enable visual decoding. In this paper, we introduce the use of human-aligned image encoders to map brain signals to images. We hypothesize that these models more effectively capture perceptual attributes associated with the rapid visual stimuli presentations commonly used in visual brain data recording experiments. Our empirical results support this hypothesis, demonstrating that this simple modification improves image retrieval accuracy by up to 21\% compared to state-of-the-art methods. Comprehensive experiments confirm consistent performance improvements across diverse EEG architectures, image encoders, alignment methods, participants, and brain imaging modalities.

Spotlight Poster

#W-419

Feature Learning beyond the Lazy-Rich Dichotomy: Insights from Representational Geometry

Chi-Ning Chou · Hang Le · Yichen Wang · SueYeon Chung

Integrating task-relevant information into neural representations is a fundamental ability of both biological and artificial intelligence systems. Recent theories have categorized learning into two regimes: the rich regime, where neural networks actively learn task-relevant features, and the lazy regime, where networks behave like random feature models. Yet this simple lazy–rich dichotomy overlooks a diverse underlying taxonomy of feature learning, shaped by differences in learning algorithms, network architectures, and data properties. To address this gap, we introduce an analysis framework to study feature learning via the geometry of neural representations. Rather than inspecting individual learned features, we characterize how task-relevant representational manifolds evolve throughout the learning process. We show, in both theoretical and empirical settings, that as networks learn features, task-relevant manifolds untangle, with changes in manifold geometry revealing distinct learning stages and strategies beyond the lazy–rich dichotomy. This framework provides novel insights into feature learning across neuroscience and machine learning, shedding light on structural inductive biases in neural circuits and the mechanisms underlying out-of-distribution generalization.

Spotlight Poster

#W-420

Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks

Lukas Braun · Erin Grant · Andrew Saxe

A foundational principle of connectionism is that perception, action, and cognition emerge from parallel computations among simple, interconnected units that generate and rely on neural representations. Accordingly, researchers employ multivariate pattern analysis to decode and compare the neural codes of artificial and biological networks, aiming to uncover their functions. However, there is limited analytical understanding of how a network’s representation and function relate, despite this being essential to any quantitative notion of underlying function or functional similarity. We address this question using fully analysable two-layer linear networks and numerical simulations in nonlinear networks. We find that function and representation are dissociated, allowing representational similarity without functional similarity and vice versa. Further, we show that neither robustness to input noise nor the level of generalisation error constrain representations to the task. In contrast, networks robust to parameter noise have limited representational flexibility and must employ task-specific representations. Our findings suggest that representational alignment reflects computational advantages beyond functional alignment alone, with significant implications for interpreting and comparing the representations of connectionist systems

Poster

#W-421

Synthesizing Images on Perceptual Boundaries of ANNs for Uncovering and Manipulating Human Perceptual Variability

Chen Wei · Chi Zhang · Jiachen Zou · Haotian Deng · Dietmar Heinke · Quanying Liu

Human decision-making in cognitive tasks and daily life exhibits considerable variability, shaped by factors such as task difficulty, individual preferences, and personal experiences. Understanding this variability across individuals is essential for uncovering the perceptual and decision-making mechanisms that humans rely on when faced with uncertainty and ambiguity. We propose a systematic Boundary Alignment Manipulation (BAM) framework for studying human perceptual variability through image generation. BAM combines perceptual boundary sampling in ANNs and human behavioral experiments to systematically investigate this phenomenon. Our perceptual boundary sampling algorithm generates stimuli along ANN perceptual boundaries that intrinsically induce significant perceptual variability. The efficacy of these stimuli is empirically validated through large-scale behavioral experiments involving 246 participants across 116,715 trials, culminating in the variMNIST dataset containing 19,943 systematically annotated images.Through personalized model alignment and adversarial generation, we establish a reliable method for simultaneously predicting and manipulating the divergent perceptual decisions of pairs of participants.This work bridges the gap between computational models and human individual difference research, providing new tools for personalized perception analysis. Code and data for this work are publicly available.

Spotlight Poster

#W-500

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

Mahir Labib Dihan · Tanvir Hassan · Md Tanvir Parvez · Hasebul Hasan · Almash Alam · Muhammad Aamir Cheema · Mohammed Eunus Ali · Md Rizwan Parvez

Recent advancements in foundation models have improved autonomous tool usage and reasoning, but their capabilities in map-based reasoning remain underexplored. To address this, we introduce MapEval, a benchmark designed to assess foundation models across three distinct tasks—textual, API-based, and visual reasoning—through 700 multiple-choice questions spanning 180 cities and 54 countries, covering spatial relationships, navigation, travel planning, and real-world map interactions. Unlike prior benchmarks that focus on simple location queries, MapEval requires models to handle long-context reasoning, API interactions and visual map analysis, making it the most comprehensive evaluation framework for geospatial AI. On evaluation of 30 foundation models, including Claude-3.5-Sonnet, GPT-4o, Gemini-1.5-Pro, none surpasses 67% accuracy, with open-source models performing significantly worse and all models lagging over 20% behind human performance. These results expose critical gaps in spatial inference, as models struggle with distances, directions, route planning, and place-specific reasoning, highlighting the need for better geospatial AI to bridge the gap between foundation models and real-world navigation.

Spotlight Poster

#W-501

Bridging Layout and RTL: Knowledge Distillation based Timing Prediction

Mingjun Wang · Yihan Wen · Bin Sun · Jianan Mu · Juan Li · Xiaoyi Wang · Jing Ye · Bei Yu · Huawei Li

Accurate and efficient timing prediction at the register-transfer level (RTL) remains a fundamental challenge in electronic design automation (EDA), particularly in striking a balance between accuracy and computational efficiency. While static timing analysis (STA) provides high-fidelity results through comprehensive physical parameters, its computational overhead makes it impractical for rapid design iterations. Conversely, existing RTL-level approaches sacrifice accuracy due to the limited physical information available. We propose RTLDistil, a novel cross-stage knowledge distillation framework that bridges this gap by transferring precise physical characteristics from a layout-aware teacher model (Teacher GNN) to an efficient RTL-level student model (Student GNN), both implemented as graph neural networks (GNNs). RTLDistil efficiently predicts key timing metrics, such as arrival time (AT), and employs a multi-granularity distillation strategy that captures timing-critical features at node, subgraph, and global levels. Experimental results demonstrate that RTLDistil achieves significant improvement in RTL-level timing prediction error reduction, compared to state-of-the-art prediction models. This framework enables accurate early-stage timing prediction, advancing EDA's ``left-shift'' paradigm while maintaining computational efficiency. Our code and dataset will be publicly available at https://github.com/sklp-eda-lab/RTLDistil.

Poster

#W-502

Bridging Protein Sequences and Microscopy Images with Unified Diffusion Models

Dihan Zheng · Bo Huang

Fluorescence microscopy is ubiquitously used in cell biology research to characterize the cellular role of a protein. To help elucidate the relationship between the amino acid sequence of a protein and its cellular function, we introduce CELL-Diff, a unified diffusion model facilitating bidirectional transformations between protein sequences and their corresponding microscopy images. Utilizing reference cell morphology images and a protein sequence, CELL-Diff efficiently generates corresponding protein images. Conversely, given a protein image, the model outputs protein sequences. CELL-Diff integrates continuous and diffusion models within a unified framework and is implemented using a transformer-based network. We train CELL-Diff on the Human Protein Atlas (HPA) dataset and fine-tune it on the OpenCell dataset. Experimental results demonstrate that CELL-Diff outperforms existing methods in generating high-fidelity protein images, making it a practical tool for investigating subcellular protein localization and interactions.

Spotlight Poster

#W-503

Machine Learning meets Algebraic Combinatorics: A Suite of Datasets Capturing Research-level Conjecturing Ability in Pure Mathematics

Herman Chau · Helen Jenne · Davis Brown · Jesse He · Mark Raugas · Sara Billey · Henry Kvinge

With recent dramatic increases in AI system capabilities, there has been growing interest in utilizing machine learning for reasoning-heavy, quantitative tasks, particularly mathematics. While there are many resources capturing mathematics at the high-school, undergraduate, and graduate level, there are far fewer resources available that align with the level of difficulty and open endedness encountered by professional mathematicians working on open problems. To address this, we introduce a new collection of datasets, the Algebraic Combinatorics Dataset Repository (ACD Repo), representing either foundational results or open problems in algebraic combinatorics, a subfield of mathematics that studies discrete structures arising from abstract algebra. Further differentiating our dataset collection is the fact that it aims at the conjecturing process. Each dataset includes an open-ended research level question and a large collection of examples (up to 10M in some cases) from which conjectures should be generated. We describe all nine datasets, the different ways machine learning models can be applied to them (e.g., training with narrow models followed by interpretability analysis or program synthesis with LLMs), and discuss some of the challenges involved in designing datasets like these.

Poster

#W-504

Feedforward Few-shot Species Range Estimation

Christian Lange · Max Hamilton · Elijah Cole · Alexander Shepard · Samuel Heinrich · Angela Zhu · Subhransu Maji · Grant Horn · Oisin Mac Aodha

Knowing where a particular species can or cannot be found on Earth is crucial for ecological research and conservation efforts. By mapping the spatial ranges of all species, we would obtain deeper insights into how global biodiversity is affected by climate change and habitat loss. However, accurate range estimates are only available for a relatively small proportion of all known species. For the majority of the remaining species, we typically only have a small number of records denoting the spatial locations where they have previously been observed. We outline a new approach for few-shot species range estimation to address the challenge of accurately estimating the range of a species from limited data. During inference, our model takes a set of spatial locations as input, along with optional metadata such as text or an image, and outputs a species encoding that can be used to predict the range of a previously unseen species in a feedforward manner. We evaluate our approach on two challenging benchmarks, where we obtain state-of-the-art range estimation performance, in a fraction of the compute time, compared to recent alternative approaches.

Poster

#W-505

In-Context Adaptation to Concept Drift for Learned Database Operations

Jiaqi Zhu · Shaofeng Cai · Shen · Gang Chen · Fang Deng · Beng Chin Ooi

Machine learning has demonstrated transformative potential for database operations, such as query optimization and in-database data analytics. However, dynamic database environments, characterized by frequent updates and evolving data distributions, introduce concept drift, which leads to performance degradation for learned models and limits their practical applicability. Addressing this challenge requires efficient frameworks capable of adapting to shifting concepts while minimizing the overhead of retraining or fine-tuning.In this paper, we propose FLAIR, an online adaptation framework that introduces a new paradigm called \textit{in-context adaptation} for learned database operations. FLAIR leverages the inherent property of data systems, i.e., immediate availability of execution results for predictions, to enable dynamic context construction. By formalizing adaptation as $f:(\mathbf{x} | \mathcal{C}_t) \to \mathbf{y}$, with $\mathcal{C}_t$ representing a dynamic context memory, FLAIR delivers predictions aligned with the current concept, eliminating the need for runtime parameter optimization. To achieve this, FLAIR integrates two key modules: a Task Featurization Module for encoding task-specific features into standardized representations, and a Dynamic Decision Engine, pre-trained via Bayesian meta-training, to adapt seamlessly using contextual information at runtime. Extensive experiments across key database tasks demonstrate that FLAIR outperforms state-of-the-art baselines, achieving up to $5.2\times$ faster adaptation and reducing error by 22.5\% for cardinality estimation.

Poster

#W-506

DEFAME: Dynamic Evidence-based FAct-checking with Multimodal Experts

Tobias Braun · Mark Rothermel · Marcus Rohrbach · Anna Rohrbach

The proliferation of disinformation demands reliable and scalable fact-checking solutions. We present Dynamic Evidence-based FAct-checking with Multimodal Experts (DEFAME), a modular, zero-shot MLLM pipeline for open-domain, text-image claim verification. DEFAME operates in a six-stage process, dynamically selecting the tools and search depth to extract and evaluate textual and visual evidence. Unlike prior approaches that are text-only, lack explainability, or rely solely on parametric knowledge, DEFAME performs end-to-end verification, accounting for images in claims and evidence while generating structured, multimodal reports. Evaluation on the popular benchmarks VERITE, AVeriTeC, and MOCHEG shows that DEFAME surpasses all previous methods, establishing itself as the new general state-of-the-art fact-checking system for uni- and multimodal fact-checking. Moreover, we introduce a new multimodal benchmark, ClaimReview2024+, featuring claims after the knowledge cutoff of GPT-4o, avoiding data leakage. Here, DEFAME drastically outperforms the GPT-4o baselines, showing temporal generalizability and the potential for real-time fact-checking.

Poster

#W-507

SPD: Sync-Point Drop for Efficient Tensor Parallelism of Large Language Models

Han-Byul Kim · Duc Hoang · Arnav Kundu · Mohammad Samragh · Minsik Cho

With the rapid expansion in the scale of large language models (LLMs), enabling efficient distributed inference across multiple computing units has become increasingly critical. However, communication overheads from popular distributed inference techniques such as Tensor Parallelism pose a significant challenge to achieve scalability and low latency. Therefore, we introduce a novel optimization technique, Sync-Point Drop (SPD), to reduce communication overheads in tensor parallelism by selectively dropping synchronization on attention outputs. In detail, we first propose a block design that allows execution to proceed without communication through SPD. Second, we apply different SPD strategies to attention blocks based on their sensitivity to the model accuracy. The proposed methods effectively alleviate communication bottlenecks while minimizing accuracy degradation during LLM inference, offering a scalable solution for diverse distributed environments: SPD offered about 20\% overall inference latency reduction with < 1\% accuracy regression for LLaMA2-70B inference over 8 GPUs.

Poster

#W-508

Loss Functions and Operators Generated by f-Divergences

Vincent Roulet · Tianlin Liu · Nino Vieillard · Michael Sander · Mathieu Blondel

The logistic loss (a.k.a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback-Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones.By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence.On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $\alpha$-divergence (which is equivalent to Tsallis $\alpha$-negentropy in the case of unit reference measures) with $\alpha=1.5$ performs well across several tasks.

Poster

#W-509

Enhancing Performance of Explainable AI Models with Constrained Concept Refinement

Geyu Liang · Senne Michielssen · Salar Fattahi

The trade-off between accuracy and interpretability has long been a challenge in machine learning (ML). This tension is particularly significant for emerging interpretable-by-design methods, which aim to redesign ML algorithms for trustworthy interpretability but often sacrifice accuracy in the process. In this paper, we address this gap by investigating the impact of deviations in concept representations—an essential component of interpretable models—on prediction performance and propose a novel framework to mitigate these effects. The framework builds on the principle of optimizing concept embeddings under constraints that preserve interpretability. Using a generative model as a test-bed, we rigorously prove that our algorithm achieves zero loss while progressively enhancing the interpretability of the resulting model. Additionally, we evaluate the practical performance of our proposed framework in generating explainable predictions for image classification tasks across various benchmarks. Compared to existing explainable methods, our approach not only improves prediction accuracy while preserving model interpretability across various large-scale benchmarks but also achieves this with significantly lower computational cost.

Poster

#W-510

The Surprising Agreement Between Convex Optimization Theory and Learning-Rate Scheduling for Large Model Training

Fabian Schaipp · Alexander Hägele · Adrien Taylor · Umut Simsekli · Francis Bach

We show that learning-rate schedules for large model training behave surprisingly similar to a performance bound from non-smooth convex optimization theory. We provide a bound for the constant schedule with linear cooldown; in particular, the practical benefit of cooldown is reflected in the bound due to the absence of logarithmic terms.Further, we show that this surprisingly close match between optimization theory and practice can be exploited for learning-rate tuning: we achieve noticeable improvements for training 124M and 210M Llama-type models by (i) extending the schedule for continued training with optimal learning-rate, and (ii) transferring the optimal learning-rate across schedules.

Poster

#W-511

FedECADO: A Dynamical System Model of Federated Learning

Aayushya Agarwal · Gauri Joshi · Lawrence Pileggi

Federated learning harnesses the power of distributed optimization to train a unified machine learning model across separate clients. However, heterogeneous data distributions and computational workloads can lead to inconsistent updates and limit model performance. This work tackles these challenges by proposing FedECADO, a new algorithm inspired by a dynamical system representation of the federated learning process. FedECADO addresses non-IID data distribution through an aggregate sensitivity model that reflects the amount of data processed by each client. To tackle heterogeneous computing, we design a multi-rate integration method with adaptive step-size selections that synchronizes active client updates in continuous time. Compared to prominent techniques, including FedProx, FedExp, and FedNova, FedECADO achieves higher classification accuracies in numerous heterogeneous scenarios.

Poster

#W-512

Distributed Event-Based Learning via ADMM

Guner Dilsad ER · Sebastian Trimpe · Michael Muehlebach

We consider a distributed learning problem, where agents minimize a global objective function by exchanging information over a network. Our approach has two distinct features: (i) It substantially reduces communication by triggering communication only when necessary, and (ii) it is agnostic to the data-distribution among the different agents. We can therefore guarantee convergence even if the local data-distributions of the agents are arbitrarily distinct. We analyze the convergence rate of the algorithm both in convex and nonconvex settings and derive accelerated convergence rates in a convex setting. We also characterize the effect of communication failures and demonstrate that our algorithm is robust to communication failures. The article concludes by presenting numerical results from distributed learning tasks on the MNIST and CIFAR-10 datasets. The experiments underline communication savings of 35\% or more due to the event-based communication strategy, show resilience towards heterogeneous data-distributions, and highlight that our approach outperforms common baselines such as FedAvg, FedProx, SCAFFOLD and FedADMM.

Poster

#W-513

Joint Learning of Energy-based Models and their Partition Function

Michael Sander · Vincent Roulet · Tianlin Liu · Mathieu Blondel

Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks.However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function.In this paper, we propose a novel min-min formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points.On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions.Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtainthe first tractable method for optimizing the sparsemax loss in combinatorially-large spaces.We demonstrate our approach on multilabel classification and label ranking.

Poster

#W-514

A Bregman Proximal Viewpoint on Neural Operators

Abdel-Rahim Mezidi · Jordan Patracone · Saverio Salzo · Amaury Habrard · Massimiliano Pontil · Rémi Emonet · Marc Sebban

We present several advances on neural operators by viewing the action of operator layers as the minimizers of Bregman regularized optimization problems over Banach function spaces. The proposed framework allows interpreting the activation operators as Bregman proximity operators from dual to primal space. This novel viewpoint is general enough to recover classical neural operators as well as a new variant, coined Bregman neural operators, which includes the inverse activation operator and features the same expressivity of standard neural operators. Numerical experiments support the added benefits of the Bregman variant of Fourier neural operators for training deeper and more accurate models.

Poster

#W-515

Multinoulli Extension: A Lossless Yet Effective Probabilistic Framework for Subset Selection over Partition Constraints

Qixin Zhang · Wei Huang · Can Jin · Puning Zhao · Yao Shu · Li Shen · Dacheng Tao

Identifying the most representative subset for a close-to-submodular objective while satisfying the predefined partition constraint is a fundamental task with numerous applications in machine learning. However, the existing distorted local-search methods are often hindered by their prohibitive query complexities and the rigid requirement for prior knowledge of difficult-to-obtain structural parameters. To overcome these limitations, we introduce a novel algorithm titled Multinoulli-SCG, which not only is parameter-free, but also can achieve the same approximation guarantees as the distorted local-search methods with significantly fewer function evaluations. The core of our Multinoulli-SCG algorithm is an innovative continuous-relaxation framework named Multinoulli Extension(ME), which can effectively convert the discrete subset selection problem subject to partition constraints into a solvable continuous maximization focused on learning the optimal multinoulli priors across the considered partition. In sharp contrast with the well-established multi-linear extension for submodular subset selection, a notable advantage of our proposed ME is its intrinsic capacity to provide a lossless rounding scheme for any set function. Finally, we validate the practical efficacy of our proposed algorithms by applying them to video summarization, bayesian A-optimal design and coverage maximization.

Poster

#W-516

Hybrid Quantum-Classical Multi-Agent Pathfinding

Thore Gerlach · Loong Kuan Lee · Frederic BARBARESCO · Nico Piatkowski

Multi-Agent Path Finding (MAPF) focuses on determining conflict-free paths for multiple agents navigating through a shared space to reach specified goal locations. This problem becomes computationally challenging, particularly when handling large numbers of agents, as frequently encountered in practical applications like coordinating autonomous vehicles. Quantum Computing (QC) is a promising candidate in overcoming such limits. However, current quantum hardware is still in its infancy and thus limited in terms of computing power and error robustness. In this work, we present the first optimal hybrid quantum-classical MAPF algorithms which are based on branch-and-cut-and-prize. QC is integrated by iteratively solving QUBO problems, based on conflict graphs. Experiments on actual quantum hardware and results on benchmark data suggest that our approach dominates previous QUBO formulations and state-of-the-art MAPF solvers.

Poster

#W-517

Improved Theoretically-Grounded Evolutionary Algorithms for Subset Selection with a Linear Cost Constraint

Dan-Xuan Liu · Chao Qian

The subset selection problem with a monotone and submodular objective function under a linear cost constraint has wide applications, such as maximum coverage, influence maximization, and feature selection, just to name a few. Various greedy algorithms have been proposed with good performance both theoretically and empirically. Recently, evolutionary algorithms (EAs), inspired by Darwin's evolution theory, have emerged as a prominent methodology, offering both empirical advantages and theoretical guarantees. Among these, the multi-objective EA, POMC, has demonstrated the best empirical performance to date, achieving an approximation guarantee of $(1/2)(1-1/e)$. However, there remains a gap in the approximation bounds of EAs compared to greedy algorithms, and their full theoretical potential is yet to be realized. In this paper, we re-analyze the approximation performance of POMC theoretically, and derive an improved guarantee of $1/2$, which thus provides theoretical justification for its encouraging empirical performance. Furthermore, we propose a novel multi-objective EA, EPOL, which not only achieves the best-known practical approximation guarantee of $0.6174$, but also delivers superior empirical performance in applications of maximum coverage and influence maximization. We hope this work can help better solving the subset selection problem, but also enhance our theoretical understanding of EAs.

Poster

#W-518

Simple Randomized Rounding for Max-Min Eigenvalue Augmentation

Jourdain Lamperski · Haeseong Yang · Oleg Prokopyev

We consider the *max-min eigenvalue augmentation* problem: given $n \times n$ symmetric positive semidefinite matrices $M,A_1,\ldots, A_m$ and a positive integer $k < m$, the goal is to choose a subset $I \subset \{1,\ldots,m\}$ of cardinality at most $k$ that maximizes the minimum eigenvalue of the matrix $M + \sum_{i \in I} A_i$. The problem captures both the *Bayesian E-optimal design* and *maximum algebraic connectivity augmentation* problems. In contrast to the existing work, we do not assume that the *augmentation matrices* are rank-one matrices, and we focus on the setting in which $k < n$. We show that a *simple* randomized rounding method provides a constant-factor approximation if the *optimal increase* is sufficiently large, specifically, if $\mathrm{OPT} - \lambda_{\mathrm{min}}(M) = \Omega(R \ln k)$, where $\mathrm{OPT}$ is the optimal value, and $R$ is the maximum trace of an augmentation matrix. To establish the guarantee, we derive a matrix concentration inequality that is of independent interest. The inequality can be interpreted as an *intrinsic dimension* analog of the matrix Chernoff inequality for the minimum eigenvalue of a sum of independent random positive semidefinite matrices; such an inequality has already been established for the maximum eigenvalue, but not for the minimum eigenvalue.

Poster

#W-519

Aequa: Fair Model Rewards in Collaborative Learning via Slimmable Networks

Nurbek Tastan · Samuel Horváth · Karthik Nandakumar

Collaborative learning enables multiple participants to learn a single global model by exchanging focused updates instead of sharing data. One of the core challenges in collaborative learning is ensuring that participants are rewarded fairly for their contributions, which entails two key sub-problems: contribution assessment and reward allocation. This work focuses on fair reward allocation, where the participants are incentivized through model rewards - differentiated final models whose performance is commensurate with the contribution. In this work, we leverage the concept of slimmable neural networks to collaboratively learn a shared global model whose performance degrades gracefully with a reduction in model width. We also propose a post-training fair allocation algorithm that determines the model width for each participant based on their contributions. We theoretically study the convergence of our proposed approach and empirically validate it using extensive experiments on different datasets and architectures. We also extend our approach to enable training-time model reward allocation.

Poster

#W-520

BSemiFL: Semi-supervised Federated Learning via a Bayesian Approach

Haozhao Wang · Shengyu Wang · Jiaming Li · Hao Ren · Xingshuo Han · Wenchao Xu · Shangwei Guo · Tianwei Zhang · Ruixuan Li

Semi-supervised Federated Learning (SSFL) is a promising approach that allows clients to collaboratively train a global model in the absence of their local data labels. The key step of SSFL is the re-labeling where each client adopts two types of available models, namely global and local models, to re-label the local data. While various technologies such as using the global model or the average of two models have been proposed to conduct the re-labeling step, little literature delves deeply into the performance dominance and limitations of the two models. In this paper, we first theoretically and empirically demonstrate that the local model achieves higher re-labeling accuracy over local data while the global model can progressively improve the re-labeling performance by introducing the extra data knowledge of other clients. Based on these findings, we propose BSemiFL which re-labels the local data through the collaboration between the local and global model in a Bayesian approach. Specifically, to re-label any given local sample, BSemiFL first uses Bayesian inference to assess the closeness of the local/global model to the sample. Then, it applies a weighted combination of their pseudo labels, using the closeness as the weights. Theoretical analysis shows that the labeling error of our method is smaller than that of simply using the global model, the local model, or their simple average. Experimental results show that BSemiFL improves the performance by up to $9.8\%$ as compared to state-of-the-art methods.

Spotlight Poster

#W-521

On the Tension between Byzantine Robustness and No-Attack Accuracy in Distributed Learning

Yi-Rui Yang · Chang-Wei Shi · Wu-Jun Li

Byzantine-robust distributed learning (BRDL), which refers to distributed learning that can work with potential faulty or malicious workers (also known as Byzantine workers), has recently attracted much research attention. Robust aggregators are widely used in existing BRDL methods to obtain robustness against Byzantine workers. However, Byzantine workers do not always exist in applications. As far as we know, there is almost no existing work theoretically investigating the effect of using robust aggregators when there are no Byzantine workers. To bridge this knowledge gap, we theoretically analyze the aggregation error for robust aggregators when there are no Byzantine workers. Specifically, we show that the worst-case aggregation error without Byzantine workers increases with the increase of the number of Byzantine workers that a robust aggregator can tolerate. The theoretical result reveals the tension between Byzantine robustness and no-attack accuracy, which refers to accuracy without faulty workers and malicious workers in this paper. Furthermore, we provide lower bounds for the convergence rate of gradient descent with robust aggregators for non-convex objective functions and objective functions that satisfy the Polyak-Lojasiewicz (PL) condition, respectively. We also prove the tightness of the lower bounds. The lower bounds for convergence rate reveal similar tension between Byzantine robustness and no-attack accuracy. Empirical results further support our theoretical findings.

Poster

#W-600

Improving Value Estimation Critically Enhances Vanilla Policy Gradient

Tao Wang · Ruipeng Zhang · Sicun Gao

Modern policy gradient algorithms, such as TRPO and PPO, outperform vanilla policy gradient in many RL tasks. Questioning the common belief that enforcing approximate trust regions leads to steady policy improvement in practice, we show that the more critical factor is the enhanced value estimation accuracy from more value update steps in each iteration. To demonstrate, we show that by simply increasing the number of value update steps per iteration, vanilla policy gradient itself can achieve performance comparable to or better than PPO in all the standard continuous control benchmark environments. Importantly, this simple change to vanilla policy gradient is significantly more robust to hyperparameter choices, opening up the possibility that RL algorithms may still become more effective and easier to use.

Poster

#W-601

Can We Predict Performance of Large Models across Vision-Language Tasks?

Qinyu Zhao · Ming Xu · Kartik Gupta · Akshay Asthana · Liang Zheng · Stephen Gould

Evaluating large vision-language models (LVLMs) is very expensive, due to high computational cost and the wide variety of tasks. The good news is that if we already have some observed performance scores, we may be able to infer unknown ones. In this study, we propose a new framework for predicting unknown performance scores based on observed ones from other LVLMs or tasks. We first formulate the performance prediction as a matrix completion task. Specifically, we construct a sparse performance matrix $\boldsymbol{R}$, where each entry $R_{mn}$ represents the performance score of the $m$-th model on the $n$-th dataset. By applying probabilistic matrix factorization (PMF) with Markov chain Monte Carlo (MCMC), we can complete the performance matrix, i.e., predict unknown scores. Additionally, we estimate the uncertainty of performance prediction based on MCMC. Practitioners can evaluate their models on untested tasks with higher uncertainty first, which quickly reduces the prediction errors. We further introduce several improvements to enhance PMF for scenarios with sparse observed performance scores. Our experiments demonstrate the accuracy of PMF in predicting unknown scores, the reliability of uncertainty estimates in ordering evaluations, and the effectiveness of our enhancements for handling sparse data. Our code is available at https://github.com/Qinyu-Allen-Zhao/CrossPred-LVLM.

Poster

#W-602

Efficient Online Reinforcement Learning for Diffusion Policy

Haitong Ma · Tianyi Chen · Kai Wang · Na Li · Bo Dai

Diffusion policies have achieved superior performance in imitation learning and offline reinforcement learning (RL) due to their rich expressiveness. However, the conventional diffusion training procedure requires samples from target distribution, which is impossible in online RL since we cannot sample from the optimal policy. Backpropagating policy gradient through the diffusion process incurs huge computational costs and instability, thus being expensive and not scalable. To enable efficient training of diffusion policies in online RL, we generalize the conventional denoising score matching by reweighting the loss function. The resulting Reweighted Score Matching (RSM) preserves the optimal solution and low computational cost of denoising score matching, while eliminating the need to sample from the target distribution and allowing learning to optimize value functions. We introduce two tractable reweighted loss functions to solve two commonly used policy optimization problems, policy mirror descent and max-entropy policy, resulting in two practical algorithms named Diffusion Policy Mirror Descent (DPMD) and Soft Diffusion Actor-Critic (SDAC). We conducted comprehensive comparisons on MuJoCo benchmarks. The empirical results show that the proposed algorithms outperform recent diffusion-policy online RLs on most tasks, and the DPMD improves more than 120% over soft actor-critic on Humanoid and Ant.

Poster

#W-603

Vintix: Action Model via In-Context Reinforcement Learning

Andrei Polubarov · Nikita Lyubaykin · Alexander Derevyagin · Ilya Zisman · Denis Tarasov · Alexander Nikulin · Vladislav Kurenkov

In-Context Reinforcement Learning (ICRL) represents a promising paradigm for developing generalist agents that learn at inference time through trial-and-error interactions, analogous to how large language models adapt contextually, but with a focus on reward maximization. However, the scalability of ICRL beyond toy tasks and single-domain settings remains an open challenge. In this work, we present the first steps toward scaling ICRL by introducing a fixed, cross-domain model capable of learning behaviors through in-context reinforcement learning. Our results demonstrate that Algorithm Distillation, a framework designed to facilitate ICRL, offers a compelling and competitive alternative to expert distillation to construct versatile action models. These findings highlight the potential of ICRL as a scalable approach for generalist decision-making systems.

Poster

#W-604

Policy Gradient with Tree Expansion

Gal Dalal · Assaf Hallak · Gugan Chandrashekhar Mallika Thoppe · Shie Mannor · Gal Chechik

Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax---a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy. Specifically, we show that the closer the induced transitions are to being state-independent, the stronger the variance decay. With approximate forward models, we prove that the resulting gradient bias diminishes with the approximation error while retaining the same variance reduction. Ours is the first result to bound the gradient bias for an approximate model. In a practical implementation of SoftTreeMax we utilize a parallel GPU-based simulator for fast and efficient tree expansion. Using this implementation in Atari, we show that SoftTreeMax reduces the gradient variance by three orders of magnitude. This leads to better sample complexity and improved performance compared to distributed PPO.

Poster

#W-605

Behavioral Exploration: Learning to Explore via In-Context Adaptation

Andrew Wagenmaker · Zhiyuan Zhou · Sergey Levine

Developing autonomous agents that quickly explore an environment and adapt their behavior online is a canonical challenge in robotics and machine learning. While humans are able to achieve such fast online exploration and adaptation, often acquiring new information and skills in only a handful of interactions, existing algorithmic approaches tend to rely on random exploration and slow, gradient-based behavior updates. How can we endow autonomous agents with such capabilities on par with humans? Taking inspiration from recent progress on both in-context learning and large-scale behavioral cloning, in this work we propose behavioral exploration: training agents to internalize what it means to explore and adapt in-context over the space of ''expert'' behaviors. To achieve this, given access to a dataset of expert demonstrations, we train a long-context generative model to predict expert actions conditioned on a context of past observations and a measure of how ''exploratory'' the expert's behaviors are relative to this context. This enables the model to not only mimic the behavior of an expert, but also, by feeding its past history of interactions into its context, to select different expert behaviors than what have been previously selected, thereby allowing for fast online adaptation and targeted, ''expert-like'' exploration. We demonstrate the effectiveness of our method in both simulated locomotion and manipulation settings, as well as on real-world robotic manipulation tasks, illustrating its ability to learn adaptive, exploratory behavior.

Poster

#W-606

Craftium: Bridging Flexibility and Efficiency for Rich 3D Single- and Multi-Agent Environments

Mikel Malagón · Josu Ceberio · Jose A Lozano

Advances in large models, reinforcement learning, and open-endedness have accelerated progress toward autonomous agents that can learn and interact in the real world. To achieve this, flexible tools are needed to create rich, yet computationally efficient, environments. While scalable 2D environments fail to address key real-world challenges like 3D navigation and spatial reasoning, more complex 3D environments are computationally expensive and lack features like customizability and multi-agent support. This paper introduces Craftium, a highly customizable and easy-to-use platform for building rich 3D single- and multi-agent environments. We showcase environments of different complexity and nature: from single- and multi-agent tasks to vast worlds with many creatures and biomes, and customizable procedural task generators. Benchmarking shows that Craftium significantly reduces the computational cost of alternatives of similar richness, achieving +2K steps per second more than Minecraft-based frameworks.

Poster

#W-607

Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics

Tyler Kastner · Mark Rowland · Yunhao Tang · Murat Erdogdu · Amir-massoud Farahmand

We study the problem of distributional reinforcement learning using categorical parametrisations and a KL divergence loss. Previous work analyzing categorical distributional RL has done so using a Cramér distance-based loss, simplifying the analysis but creating a theory-practice gap. We introduce a preconditioned version of the algorithm, and prove that it is guaranteed to converge. We further derive the asymptotic variance of the categorical estimates under different learning rate regimes, and compare to that of classical reinforcement learning. We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners.

Poster

#W-608

Beyond The Rainbow: High Performance Deep Reinforcement Learning on a Desktop PC

Tyler Clark · Mark Towers · Christine Evers · Jonathon Hare

Rainbow Deep Q-Network (DQN) demonstrated combining multiple independent enhancements could significantly boost a reinforcement learning (RL) agent’s performance. In this paper, we present “Beyond The Rainbow” (BTR), a novel algorithm that integrates six improvements from across the RL literature to Rainbow DQN, establishing a new state-of-the-art for RL using a desktop PC, with a human-normalized interquartile mean (IQM) of 7.6 on Atari-60. Beyond Atari, we demonstrate BTR’s capability to handle complex 3D games, successfully training agents to play Super Mario Galaxy, Mario Kart, and Mortal Kombat with minimal algorithmic changes. Designing BTR with computational efficiency in mind, agents can be trained using a high-end desktop PC on 200 million Atari frames within 12 hours. Additionally, we conduct detailed ablation studies of each component, analyzing the performance and impact using numerous measures.

Poster

#W-609

Directly Forecasting Belief for Reinforcement Learning with Delays

Qingyuan Wu · Yuhui Wang · Simon Zhan · Yixuan Wang · Chung-Wei Lin · Chen Lv · Qi Zhu · Jürgen Schmidhuber · Chao Huang

Reinforcement learning (RL) with delays is challenging as sensory perceptions lag behind the actual events: the RL agent needs to estimate the real state of its environment based on past observations. State-of-the-art (SOTA) methods typically employ recursive, step-by-step forecasting of states. This can cause the accumulation of compounding errors. To tackle this problem, our novel belief estimation method, named Directly Forecasting Belief Transformer (DFBT), directly forecasts states from observations without incrementally estimating intermediate states step-by-step. We theoretically demonstrate that DFBT greatly reduces compounding errors of existing recursively forecasting methods, yielding stronger performance guarantees. In experiments with D4RL offline datasets, DFBT reduces compounding errors with remarkable prediction accuracy. DFBT's capability to forecast state sequences also facilitates multi-step bootstrapping, thus greatly improving learning efficiency. On the MuJoCo benchmark, our DFBT-based method substantially outperforms SOTA baselines. Code is available at \href{https://github.com/QingyuanWuNothing/DFBT}{https://github.com/QingyuanWuNothing/DFBT}.

Poster

#W-610

Online Learning in Risk Sensitive constrained MDP

Arnob Ghosh · Mehrdad Moharrami

We consider a setting in which the agent aims to maximize the expected cumulative reward, subject to a constraint that the entropic risk of the total utility exceeds a given threshold. Unlike the risk-neutral case, standard primal-dual approaches fail to directly yield regret and violation bounds, as value iteration with respect to a combined state-action value function is not applicable in the risk-sensitive setting. To address this, we adopt the Optimized Certainty Equivalent (OCE) representation of the entropic risk measure and reformulate the problem by augmenting the state space with a continuous budget variable. We then propose a primal-dual algorithm tailored to this augmented formulation. In contrast to the standard approach for risk-neutral CMDPs, our method incorporates a truncated dual update to account for the possible absence of strong duality. We show that the proposed algorithm achieves regret of $\tilde{\mathcal{O}}\big(V_{g,\max}K^{3/4} + \sqrt{H^4 S^2 A \log(1/\delta)}K^{3/4}\big)$ and constraint violation of $\tilde{\mathcal{O}}\big(V_{g,\max} \sqrt{ {H^3 S^2 A \log(1/\delta)}}K^{3/4} \big)$ with probability at least $1-\delta$, where $S$ and $A$ denote the cardinalities of the state and action spaces, respectively, $H$ is the episode length, $K$ is the number of episodes, $\alpha < 0$ is the risk-aversion parameter, and $V_{g,\max} = \frac{1}{|\alpha|}(\exp(|\alpha|H) - 1)$. *To the best of our knowledge, this is the first result establishing sublinear regret and violation bounds for the risk-sensitive CMDP problem.*

Poster

#W-611

Rank-One Modified Value Iteration

Arman Sharifi Kolarijani · Tolga Ok · Peyman Mohajerin Esfahani · Mohamad Amin Sharifi Kolarijani

In this paper, we provide a novel algorithm for solving planning and learning problems of Markov decision processes. The proposed algorithm follows a policy iteration-type update by using a rank-one approximation of the transition probability matrix in the policy evaluation step. This rank-one approximation is closely related to the stationary distribution of the corresponding transition probability matrix, which is approximated using the power method. We provide theoretical guarantees for the convergence of the proposed algorithm to optimal (action-)value function with the same rate and computational complexity as the value iteration algorithm in the planning problem and as the Q-learning algorithm in the learning problem. Through our extensive numerical simulations, however, we show that the proposed algorithm consistently outperforms first-order algorithms and their accelerated versions for both planning and learning problems.

Poster

#W-612

Improved Off-policy Reinforcement Learning in Biological Sequence Design

Hyeonah Kim · Minsu Kim · Taeyoung Yun · Sanghyeok Choi · Emmanuel Bengio · Alex Hernandez-Garcia · Jinkyoo Park

Designing biological sequences with desired properties is challenging due to vast search spaces and limited evaluation budgets. Although reinforcement learning methods use proxy models for rapid reward evaluation, insufficient training data can cause proxy misspecification on out-of-distribution inputs. To address this, we propose a novel off-policy search, $\delta$-Conservative Search, that enhances robustness by restricting policy exploration to reliable regions. Starting from high-score offline sequences, we inject noise by randomly masking tokens with probability $\delta$, then denoise them using our policy. We further adapt $\delta$ based on proxy uncertainty on each data point, aligning the level of conservativeness with model confidence. Experimental results show that our conservative search consistently enhances the off-policy training, outperforming existing machine learning methods in discovering high-score sequences across diverse tasks, including DNA, RNA, protein, and peptide design.

Poster

#W-613

DEALing with Image Reconstruction: Deep Attentive Least Squares

Mehrsa Pourya · Erich Kobler · Michael Unser · Sebastian Neumayer

State-of-the-art image reconstruction often relies on complex, abundantly parameterized deep architectures. We propose an alternative: a data-driven reconstruction method inspired by the classic Tikhonov regularization. Our approach iteratively refines intermediate reconstructions by solving a sequence of quadratic problems. These updates have two key components: (i) learned filters to extract salient image features; and (ii) an attention mechanism that locally adjusts the penalty of the filter responses. Our method matches leading plug-and-play and learned regularizer approaches in performance while offering interpretability, robustness, and convergent behavior. In effect, we bridge traditional regularization and deep learning with a principled reconstruction approach.

Poster

#W-614

Density Ratio Estimation-based Bayesian Optimization with Semi-Supervised Learning

Jungtaek Kim

Bayesian optimization has attracted huge attention from diverse research areas in science and engineering, since it is capable of efficiently finding a global optimum of an expensive-to-evaluate black-box function. In general, a probabilistic regression model is widely used as a surrogate function to model an explicit distribution over function evaluations given an input to estimate and a training dataset. Beyond the probabilistic regression-based methods, density ratio estimation-based Bayesian optimization has been suggested in order to estimate a density ratio of the groups relatively close and relatively far to a global optimum. Developing this line of research further, supervised classifiers are employed to estimate a class probability for the two groups instead of a density ratio. However, the supervised classifiers used in this strategy are prone to be overconfident for known knowledge on global solution candidates. Supposing that we have access to unlabeled points, e.g., predefined fixed-size pools, we propose density ratio estimation-based Bayesian optimization with semi-supervised learning to solve this challenge. Finally, we show the empirical results of our methods and several baseline methods in two distinct scenarios with unlabeled point sampling and a fixed-size pool, and analyze the validity of our methods in diverse experiments.

Poster

#W-615

Adaptive Partitioning Schemes for Optimistic Optimization

Raja Sunkara · Ardhendu Tripathy

Applications such as engineering design often require us to optimize a black-box function, i.e., a system whose inner processing is not analytically known and whose gradients are not available. Practitioners often have a fixed budget for the number of function evaluations and the performance of an optimization algorithm is measured by its simple regret. In this paper, we study the class of ``Optimistic Optimization'' algorithms for black-box optimization that use a partitioning scheme for the domain. We develop algorithms that learn a good partitioning scheme and use flexible surrogate models such as neural networks in the optimization procedure. For multi-index functions on an $m$-dimensional subspace within $d$ dimensions, our algorithm attains $\tilde{O}(n^{-\beta / d})$ regret, where $\beta = 1 + \frac{d-m}{2m-1}$, as opposed to $\tilde{O}(n^{-1/d})$ for SequOOL, a state-of-the-art optimistic optimization algorithm. Our approach is competitive across a wide range of numerical benchmarks. Additionally, we introduce weight quantization in a large language model as a novel task for black-box optimization. Our approach improves the quality of Activation-aware Weight Quantization (AWQ) of the OPT-1.3B model, achieving an approximate 10\% improvement in performance relative to the best possible unquantized model.

Poster

#W-616

Diversity By Design: Leveraging Distribution Matching for Offline Model-Based Optimization

Michael S Yao · James Gee · Osbert Bastani

The goal of offline model-based optimization (MBO) is to propose new designs that maximize a reward function given only an offline dataset. However, an important desiderata is to also propose a diverse set of final candidates that capture many optimal and near-optimal design configurations. We propose Diversity In Adversarial Model-based Optimization (DynAMO) as a novel method to introduce design diversity as an explicit objective into any MBO problem. Our key insight is to formulate diversity as a distribution matching problem where the distribution of generated designs captures the inherent diversity contained within the offline dataset. Extensive experiments spanning multiple scientific domains show that DynAMO can be used with common optimization methods to significantly improve the diversity of proposed designs while still discovering high-quality candidates.

Poster

#W-617

Meta-Black-Box-Optimization through Offline Q-function Learning

Zeyuan Ma · Zhiguang Cao · Zhou Jiang · Hongshu Guo · Yue-Jiao Gong

Recent progress in Meta-Black-Box-Optimization (MetaBBO) has demonstrated that using RL to learn a meta-level policy for dynamic algorithm configuration (DAC) over an optimization task distribution could significantly enhance the performance of the low-level BBO algorithm. However, the online learning paradigms in existing works makes the efficiency of MetaBBO problematic.To address this, we propose an offline learning-based MetaBBO framework in this paper, termed Q-Mamba, to attain both effectiveness and efficiency in MetaBBO. Specifically, we first transform DAC task into long-sequence decision process. This allows us further introduce an effective Q-function decomposition mechanism to reduce the learning difficulty within the intricate algorithm configuration space. Under this setting, we propose three novel designs to meta-learn DAC policy from offline data: we first propose a novel collection strategy for constructing offline DAC experiences dataset with balanced exploration and exploitation. We then establish a decomposition-based Q-loss that incorporates conservative Q-learning to promote stable offline learning from the offline dataset. To further improve the offline learning efficiency, we equip our work with a Mamba architecture which helps long-sequence learning effectiveness and efficiency by selective state model and hardware-aware parallel scan respectively. Through extensive benchmarking, we observe that Q-Mamba achieves competitive or even superior performance to prior online/offline baselines, while significantly improving the training efficiency of existing online baselines. We provide sourcecodes of Q-Mamba \href{https://github.com/MetaEvo/Q-Mamba}{online}.

Poster

#W-618

A Near-Optimal Single-Loop Stochastic Algorithm for Convex Finite-Sum Coupled Compositional Optimization

Bokun Wang · Tianbao Yang

This paper studies a class of convex Finite-sum Coupled Compositional Optimization (cFCCO) problems with applications including group distributionally robust optimization (GDRO) and learning with imbalanced data. To better address these problems, we introduce an efficient single-loop primal-dual block-coordinate stochastic algorithm called ALEXR. The algorithm employs block-coordinate stochastic mirror ascent with extrapolation for the dual variable and stochastic proximal gradient descent updates for the primal variable. We establish the convergence rates of ALEXR in both convex and strongly convex cases under smoothness and non-smoothness conditions of involved functions, which not only improve the best rates in previous works on smooth cFCCO problems but also expand the realm of cFCCO for solving more challenging non-smooth problems such as the dual form of GDRO. Finally, we derive lower complexity bounds, demonstrating the (near-)optimality of ALEXR within a broad class of stochastic algorithms for cFCCO. Experimental results on GDRO and partial Area Under the ROC Curve (pAUC) maximization demonstrate the promising performance of our algorithm.

Poster

#W-619

Learning Curves of Stochastic Gradient Descent in Kernel Regression

Haihan Zhang · Weicheng Lin · Yuanshi Liu · Cong Fang

This paper considers a canonical problem in kernel regression: how good are the model performances when it is trained by the popular online first-order algorithms, compared to the offline ones, such as ridge and ridgeless regression? In this paper, we analyze the foundational single-pass Stochastic Gradient Descent (SGD) in kernel regression under source condition where the optimal predictor can even not belong to the RKHS, i.e. the model is misspecified. Specifically, we focus on the inner product kernel over the sphere and characterize the exact orders of the excess risk curves under different scales of sample sizes $n$ concerning the input dimension $d$. Surprisingly, we show that SGD achieves min-max optimal rates up to constants among all the scales, $without$ suffering the saturation, a prevalent phenomenon observed in (ridge) regression, except when the model is highly misspecified and the learning is in a final stage where $n\gg d^\gamma$ with any constant $\gamma >0$. The main reason for SGD to overcome the curse of saturation is the exponentially decaying step size schedule, a common practice in deep neural network training. As a byproduct, we provide the $first$ provable advantage of the scheme over the iterative averaging method in the common setting.

Poster

#W-620

Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects

Kevin Xiao · Noah Marshall · Atish Agarwala · Elliot Paquette

In recent years, SignSGD has garnered interest as both a practical optimizer as well as a simple model to understand adaptive optimizers like Adam. Though there is a general consensus that SignSGD acts to precondition optimization and reshapes noise, quantitatively understanding these effects in theoretically solvable settings remains difficult. We present an analysis of SignSGD in a high dimensional limit, and derive a limiting SDE and ODE to describe the risk. Using this framework we quantify four effects of SignSGD: effective learning rate, noise compression, diagonal preconditioning, and gradient noise reshaping. Our analysis is consistent with experimental observations but moves beyond that by quantifying the dependence of these effects on the data and noise distributions. We conclude with a conjecture on how these results might be extended to Adam.

Poster

#W-621

Cover learning for large-scale topology representation

Luis Scoccola · Uzu Lim · Heather Harrington

Classical unsupervised learning methods like clustering and linear dimensionality reduction parametrize large-scale geometrywhen it is discrete or linear, while more modern methods from manifold learning find low dimensional representation or infer local geometry by constructing a graph on the input data. More recently, topological data analysis popularized the use of simplicial complexes to represent data topology with two main methodologies: topological inference with geometric complexes and large-scale topology representation with Mapper graphs -- central to these is the nerve construction from topology, which builds a simplicial complex given any cover of a space by subsets. While successful, these have limitations: geometric complexes scale poorly with data size, and Mapper graphs can be hard to tune and only contain low dimensional information. In this paper, we propose to study the problem of learning covers in its own right, and from the perspective of optimization. We describe a method to learn topologically-faithful covers of geometric datasets, and show that the simplicial complexes thus obtained can outperform standard topological inference approaches in terms of size, and Mapper-type algorithms in terms of representation of large-scale topology.

Poster

#W-700

Task-Aware Virtual Training: Enhancing Generalization in Meta-Reinforcement Learning for Out-of-Distribution Tasks

Jeongmo Kim · Yisak Park · Minung Kim · Seungyul Han

Meta reinforcement learning aims to develop policies that generalize to unseen tasks sampled from a task distribution. While context-based meta-RL methods improve task representation using task latents, they often struggle with out-of-distribution (OOD) tasks. To address this, we propose Task-Aware Virtual Training (TAVT), a novel algorithm that accurately captures task characteristics for both training and OOD scenarios using metric-based representation learning. Our method successfully preserves task characteristics in virtual tasks and employs a state regularization technique to mitigate overestimation errors in state-varying environments. Numerical results demonstrate that TAVT significantly enhances generalization to OOD tasks across various MuJoCo and MetaWorld environments. Our code is available at https://github.com/JM-Kim-94/tavt.git.

Poster

#W-701

KEA: Keeping Exploration Alive by Proactively Coordinating Exploration Strategies

Shih-Min Yang · Martin Magnusson · Johannes Stork · Todor Stoyanov

Soft Actor-Critic (SAC) has achieved notable success in continuous control tasks but struggles in sparse reward settings, where infrequent rewards make efficient exploration challenging. While novelty-based exploration methods address this issue by encouraging the agent to explore novel states, they are not trivial to apply to SAC. In particular, managing the interaction between novelty-based exploration and SAC’s stochastic policy can lead to inefficient exploration and redundant sample collection. In this paper, we propose KEA (Keeping Exploration Alive) which tackles the inefficiencies in balancing exploration strategies when combining SAC with novelty-based exploration. KEA integrates a novelty-augmented SAC with a standard SAC agent, proactively coordinated via a switching mechanism. This coordination allows the agent to maintain stochasticity in high-novelty regions, enhancing exploration efficiency and reducing repeated sample collection. We first analyze this potential issue in a 2D navigation task, and then evaluate KEA on the DeepSea hard-exploration benchmark as well as sparse reward control tasks from the DeepMind Control Suite. Compared to state-of-the-art novelty-based exploration baselines, our experiments show that KEA significantly improves learning efficiency and robustness in sparse reward setups.

Poster

#W-702

PIGDreamer: Privileged Information Guided World Models for Safe Partially Observable Reinforcement Learning

Dongchi Huang · Jiaqi WANG · Yang Li · Chunhe Xia · Tianle Zhang · Kaige Zhang

Partial observability presents a significant challenge for safe reinforcement learning, as it impedes the identification of potential risks and rewards. Leveraging specific types of privileged information during training to mitigate the effects of partial observability has yielded notable empirical successes. In this paper, we propose Asymmetric Constrained Partially Observable Markov Decision Processes (ACPOMDPs) to theoretically examine the advantages of incorporating privileged information. Building upon ACPOMDPs, we propose the Privileged Information Guided Dreamer, a model-based safe reinforcement learning approach that leverages privileged information to enhance the agent's safety and performance through privileged representation alignment and an asymmetric actor-critic structure. Our empirical results demonstrate that our approach significantly outperforms existing methods in terms of safety and task-centric performance. Meanwhile, compared to alternative privileged model-based reinforcement learning methods, our approach exhibits superior performance and ease of training.

Poster

#W-703

Learning Utilities from Demonstrations in Markov Decision Processes

Filippo Lazzati · Alberto Maria Metelli

Although it is well-known that humans commonly engage in *risk-sensitive* behaviors in the presence of stochasticity, most Inverse Reinforcement Learning (IRL) models assume a *risk-neutral* agent. As such, beyond $(i)$ introducing model misspecification, $(ii)$ they do not permit direct inference of the risk attitude of the observed agent, which can be useful in many applications. In this paper, we propose a novel model of behavior to cope with these issues. By allowing for risk sensitivity, our model alleviates $(i)$, and by explicitly representing risk attitudes through (learnable) *utility* functions, it solves $(ii)$. Then, we characterize the partial identifiability of an agent’s utility under the new model and note that demonstrations from multiple environments mitigate the problem. We devise two provably-efficient algorithms for learning utilities in a finite-data regime, and we conclude with some proof-of-concept experiments to validate *both* our model and our algorithms.

Poster

#W-704

O-MAPL: Offline Multi-agent Preference Learning

The Viet Bui · Tien Mai · Thanh Nguyen

Inferring reward functions from demonstrations is a key challenge in reinforcement learning (RL), particularly in multi-agent RL (MARL). The large joint state-action spaces and intricate inter-agent interactions in MARL make inferring the joint reward function especially challenging. While prior studies in single-agent settings have explored ways to recover reward functions and expert policies from human preference feedback, such studies in MARL remain limited. Existing methods typically combine two separate stages, supervised reward learning, and standard MARL algorithms, leading to unstable training processes. In this work, we exploit the inherent connection between reward functions and Q functions in cooperative MARL to introduce a novel end-to-end preference-based learning framework.Our framework is supported by a carefully designed multi-agent value decomposition strategy that enhances training efficiency. Extensive experiments on two state-of-the-art benchmarks, SMAC and MAMuJoCo, using preference data generated by both rule-based and large language model approaches demonstrate that our algorithm consistently outperforms existing methods across various tasks.

Poster

#W-705

Learning Mean Field Control on Sparse Graphs

Christian Fabian · Kai Cui · Heinz Koeppl

Large agent networks are abundant in applications and nature and pose difficult challenges in the field of multi-agent reinforcement learning (MARL) due to their computational and theoretical complexity. While graphon mean field games and their extensions provide efficient learning algorithms for dense and moderately sparse agent networks, the case of realistic sparser graphs remains largely unsolved. Thus, we propose a novel mean field control model inspired by local weak convergence to include sparse graphs such as power law networks with coefficients above two. Besides a theoretical analysis, we design scalable learning algorithms which apply to the challenging class of graph sequences with finite first moment. We compare our model and algorithms for various examples on synthetic and real world networks with mean field algorithms based on Lp graphons and graphexes. As it turns out, our approach outperforms existing methods in many examples and on various networks due to the special design aiming at an important, but so far hard to solve class of MARL problems.

Poster

#W-706

Learning Imperfect Information Extensive-form Games with Last-iterate Convergence under Bandit Feedback

Canzhe Zhao · Yutian Cheng · Jing Dong · Baoxiang Wang · Shuai Li

We investigate learning approximate Nash equilibrium (NE) policy profiles in two-player zero-sum imperfect information extensive-form games (IIEFGs) with last-iterate convergence guarantees. Existing algorithms either rely on full-information feedback or provide only asymptotic convergence rates. In contrast, we focus on the bandit feedback setting, where players receive feedback solely from the rewards associated with the experienced information set and action pairs in each episode. Our proposed algorithm employs a negentropy regularizer weighted by a "virtual transition" over the information set-action space to facilitate an efficient approximate policy update. Through a carefully designed virtual transition and leveraging the entropy regularization technique, we demonstrate finite-time last-iterate convergence to the NE with a rate of $\widetilde{\mathcal{O}}(k^{-\frac{1}{8}})$ under bandit feedback in each episode $k$. Empirical evaluations across various IIEFG instances show its competitive performance compared to baseline methods.

Poster

#W-707

Sleeping Reinforcement Learning

Simone Drago · Marco Mussi · Alberto Maria Metelli

In the standard Reinforcement Learning (RL) paradigm, the action space is assumed to be fixed and immutable throughout the learning process. However, in many real-world scenarios, not all actions are available at every decision stage. The available action set may depend on the current environment state, domain-specific constraints, or other (potentially stochastic) factors outside the agent's control. To address these realistic scenarios, we introduce a novel paradigm called *Sleeping Reinforcement Learning*, where the available action set varies during the interaction with the environment. We start with the simpler scenario in which the available action sets are revealed at the beginning of each episode. We show that a modification of UCBVI achieves regret of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, where $H$ is the horizon, $S$ and $A$ are the cardinalities of the state and action spaces, respectively, and $T$ is the learning horizon. Next, we address the more challenging and realistic scenario in which the available actions are disclosed only at each decision stage. By leveraging a novel construction, we establish a minimax lower bound of order $\Omega(\sqrt{T 2^{A/2}})$ when the availability of actions is governed by a Markovian process, establishing a statistical barrier of the problem. Focusing on the statistically tractable case where action availability depends only on the current state and stage, we propose a new optimistic algorithm that achieves regret guarantees of order $\widetilde{\mathcal{O}}(H\sqrt{SAT})$, showing that the problem shares the same complexity of standard RL.

Poster

#W-708

Beyond CVaR: Leveraging Static Spectral Risk Measures for Enhanced Decision-Making in Distributional Reinforcement Learning

Mehrdad Moghimi · Hyejin Ku

In domains such as finance, healthcare, and robotics, managing worst-case scenarios is critical, as failure to do so can lead to catastrophic outcomes. Distributional Reinforcement Learning (DRL) provides a natural framework to incorporate risk sensitivity into decision-making processes. However, existing approaches face two key limitations: (1) the use of fixed risk measures at each decision step often results in overly conservative policies, and (2) the interpretation and theoretical properties of the learned policies remain unclear. While optimizing a static risk measure addresses these issues, its use in the DRL framework has been limited to the simple static CVaR risk measure. In this paper, we present a novel DRL algorithm with convergence guarantees that optimizes for a broader class of static Spectral Risk Measures (SRM). Additionally, we provide a clear interpretation of the learned policy by leveraging the distribution of returns in DRL and the decomposition of static coherent risk measures. Extensive experiments demonstrate that our model learns policies aligned with the SRM objective, and outperforms existing risk-neutral and risk-sensitive DRL models in various settings.

Poster

#W-709

Unveiling Markov heads in Pretrained Language Models for Offline Reinforcement Learning

Wenhao Zhao · Qiushui Xu · Linjie Xu · Lei Song · Jinyu Wang · Chunlai Zhou · Jiang Bian

Recently, incorporating knowledge from pretrained language models (PLMs) into decision transformers (DTs) has generated significant attention in offline reinforcement learning (RL). These PLMs perform well in RL tasks, raising an intriguing question: what kind of knowledge from PLMs has been transferred to RL to achieve such good results? This work first dives into this problem by analyzing each head quantitatively and points out Markov head, a crucial component that exists in the attention heads of PLMs. It leads to extreme attention on the last-input token and performs well only in short-term environments. Furthermore, we prove that this extreme attention cannot be changed by re-training embedding layer or fine-tuning. Inspired by our analysis, we propose a general method GPT-DTMA, which equips a pretrained DT with Mixture of Attention (MoA), to enable adaptive learning and accommodate diverse attention requirements during fine-tuning. Extensive experiments demonstrate the effectiveness of GPT-DTMA: it achieves superior performance in short-term environments compared to baselines, significantly reduces the performance gap of PLMs in long-term scenarios, and the experimental results also validate our theorems.

Poster

#W-710

Trust-Region Twisted Policy Improvement

Joery de Vries · Jinke He · Yaniv Oren · Matthijs T. J. Spaan

Monte-Carlo tree search (MCTS) has driven many recent breakthroughs in deep reinforcement learning (RL).However, scaling MCTS to parallel compute has proven challenging in practice which has motivated alternative planners like sequential Monte-Carlo (SMC). Many of these SMC methods adopt particle filters for smoothing through a reformulation of RL as a policy inference problem.Yet, persisting design choices of these particle filters often conflict with the aim of online planning in RL, which is to obtain a policy improvement at the start of planning.Drawing inspiration from MCTS, we tailor SMC planners specifically to RL by improving data generation within the planner through constrained action sampling and explicit terminal state handling, as well as improving policy and value target estimation.This leads to our Trust-Region Twisted SMC (TRT-SMC), which shows improved runtime and sample-efficiency over baseline MCTS and SMC methods in both discrete and continuous domains.

Poster

#W-711

LARM: Large Auto-Regressive Model for Long-Horizon Embodied Intelligence

Zhuoling Li · Xiaogang Xu · Zhenhua Xu · Ser-Nam Lim · Hengshuang Zhao

Recent embodied agents are primarily built based on reinforcement learning (RL) or large language models (LLMs). Among them, RL agents are efficient for deployment but only perform very few tasks. By contrast, giant LLM agents (often more than 1000B parameters) present strong generalization while demanding enormous computing resources. In this work, we combine their advantages while avoiding the drawbacks by conducting the proposed referee RL on our developed large auto-regressive model (LARM). Specifically, LARM is built upon a lightweight LLM (fewer than 5B parameters) and directly outputs the next action to execute rather than text. We mathematically reveal that classic RL feedbacks vanish in long-horizon embodied exploration and introduce a giant LLM based referee to handle this reward vanishment during training LARM. In this way, LARM learns to complete diverse open-world tasks without human intervention. Especially, LARM successfully harvests enchanted diamond equipment in Minecraft, which demands significantly longer decision-making chains than the highest achievements of prior best methods.

Poster

#W-712

Latent Imputation before Prediction: A New Computational Paradigm for De Novo Peptide Sequencing

Ye DU · Chen Yang · Nanxi Yu · Wanyu LIN · Qian Zhao · Shujun Wang

*De novo* peptide sequencing is a fundamental computational technique for ascertaining amino acid sequences of peptides directly from tandem mass spectrometry data, eliminating the need for reference databases. Cutting-edge models encode the observed mass spectra into latent representations from which peptides are predicted auto-regressively. However, the issue of missing fragmentation, attributable to factors such as suboptimal fragmentation efficiency and instrumental constraints, presents a formidable challenge in practical applications. To tackle this obstacle, we propose a novel computational paradigm called $\underline{\textbf{L}}$atent $\underline{\textbf{I}}$mputation before $\underline{\textbf{P}}$rediction (LIPNovo). LIPNovo is devised to compensate for missing fragmentation information within observed spectra before executing the final peptide prediction. Rather than generating raw missing data, LIPNovo performs imputation in the latent space, guided by the theoretical peak profile of the target peptide sequence. The imputation process is conceptualized as a set-prediction problem, utilizing a set of learnable peak queries to reason about the relationships among observed peaks and directly generate the latent representations of theoretical peaks through optimal bipartite matching. In this way, LIPNovo manages to supplement missing information during inference and thus boosts performance. Despite its simplicity, experiments on three benchmark datasets demonstrate that LIPNovo outperforms state-of-the-art methods by large margins. Code is available at https://github.com/usr922/LIPNovo.

Poster

#W-713

Algorithms and Hardness for Active Learning on Graphs

Vincent Cohen-Addad · Silvio Lattanzi · Simon Meierhans

We study the offline active learning problem on graphs. In this problem, one seeks to select k vertices whose labels are best suited for predicting the labels of all the other vertices in the graph.Guillory and Bilmes (Guillory & Bilmes, 2009) introduced a natural theoretical model motivated by a label smoothness assumption. Prior to our work, algorithms with theoretical guarantees were only known for restricted graph types such as trees (Cesa-Bianchi et al., 2010) despite the models simplicity. We present the first O(log n)-resource augmented algorithm for general weighted graphs. To complement our algorithm, we show constant hardness of approximation.

Poster

#W-714

Curse of High Dimensionality Issue in Transformer for Long Context Modeling

Shuhai Zhang · Zeng You · Yaofo Chen · Zhiquan Wen · Qianyue Wang · Zhijie Qiu · Yuanqing Li · Mingkui Tan

Transformer-based large language models (LLMs) excel in natural language processing tasks by capturing long-range dependencies through self-attention mechanisms. However, long-context modeling faces significant computational inefficiencies due to redundant attention computations: while attention weights are often sparse, all tokens consume equal computational resources. In this paper, we reformulate traditional probabilistic sequence modeling as a supervised learning task, enabling the separation of relevant and irrelevant tokens and providing a clearer understanding of redundancy. Based on this reformulation, we theoretically analyze attention sparsity, revealing that only a few tokens significantly contribute to predictions. Building on this, we formulate attention optimization as a linear coding problem and propose a group coding strategy, theoretically showing its ability to improve robustness against random noise and enhance learning efficiency. Motivated by this, we propose Dynamic Group Attention (DGA), which leverages the group coding to explicitly reduce redundancy by aggregating less important tokens during attention computation. Empirical results show that our DGA significantly reduces computational costs while maintaining competitive performance.

Poster

#W-715

Mirror, Mirror of the Flow: How Does Regularization Shape Implicit Bias?

Tom Jacobs · Chao Zhou · Rebekka Burkholz

Implicit bias plays an important role in explaining how overparameterized models generalize well. Explicit regularization like weight decay is often employed in addition to prevent overfitting. While both concepts have been studied separately, in practice, they often act in tandem. Understanding their interplay is key to controlling the shape and strength of implicit bias, as it can be modified by explicit regularization. To this end, we incorporate explicit regularization into the mirror flow framework and analyze its lasting effects on the geometry of the training dynamics, covering three distinct effects: positional bias, type of bias, and range shrinking. Our analytical approach encompasses a broad class of problems, including sparse coding, matrix sensing, single-layer attention, and LoRA, for which we demonstrate the utility of our insights. To exploit the lasting effect of regularization and highlight the potential benefit of dynamic weight decay schedules, we propose to switch off weight decay during training, which can improve generalization, as we demonstrate in experiments.

Poster

#W-716

Consensus Is All You Get: The Role of Attention in Transformers

Alvaro Rodriguez Abella · João Pedro Silvestre · Paulo Tabuada

A key component of transformers is the attention mechanism orchestrating how each token influences the propagation of every other token along the layers of a transformer. In this paper we provide a rigorous, mathematical analysis of the asymptotic properties of attention in transformers. Although we present several results based on different assumptions, all of them point to the same conclusion, all tokens asymptotically converge to each other, a phenomenon that has been empirically reported in the literature. Our findings are carefully compared with existing theoretical results and illustrated by simulations and experimental studies using the GPT-2 model.

Poster

#W-717

From Kernels to Features: A Multi-Scale Adaptive Theory of Feature Learning

Noa Rubin · Kirsten Fischer · Javed Lindner · Inbar Seroussi · Zohar Ringel · Michael Krämer · Moritz Helias

Feature learning in neural networks is crucial for their expressive power and inductive biases, motivating various theoretical approaches. Some approaches describe network behavior after training through a change in kernel scale from initialization, resulting in a generalization power comparable to a Gaussian process. Conversely, in other approaches training results in the adaptation of the kernel to the data, involving directional changes to the kernel. The relationship and respective strengths of these two views have so far remained unresolved. This work presents a theoretical framework of multi-scale adaptive feature learning bridging these two views. Using methods from statistical mechanics, we derive analytical expressions for network output statistics which are valid across scaling regimes and in the continuum between them. A systematic expansion of the network's probability distribution reveals that mean-field scaling requires only a saddle-point approximation, while standard scaling necessitates additional correction terms. Remarkably, we find across regimes that kernel adaptation can be reduced to an effective kernel rescaling when predicting the mean network output in the special case of a linear network. However, for linear and non-linear networks, the multi-scale adaptive approach captures directional feature learning effects, providing richer insights than what could be recovered from a rescaling of the kernel alone.

Poster

#W-718

Equivariant Neural Tangent Kernels

Philipp Misof · Pan Kessel · Jan Gerken

Little is known about the training dynamics of equivariant neural networks, in particular how it compares to data augmented training of their non-equivariant counterparts. Recently, neural tangent kernels (NTKs) have emerged as a powerful tool to analytically study the training dynamics of wide neural networks. In this work, we take an important step towards a theoretical understanding of training dynamics of equivariant models by deriving neural tangent kernels for a broad class of equivariant architectures based on group convolutions. As a demonstration of the capabilities of our framework, we show an interesting relationship between data augmentation and group convolutional networks. Specifically, we prove that they share the same expected prediction over initializations at all training times and even off the data manifold. In this sense, they have the same training dynamics. We demonstrate in numerical experiments that this still holds approximately for finite-width ensembles. By implementing equivariant NTKs for roto-translations in the plane ($G=C_{n}\ltimes\mathbb{R}^{2}$) and 3d rotations ($G=\mathrm{SO}(3)$), we show that equivariant NTKs outperform their non-equivariant counterparts as kernel predictors for histological image classification and quantum mechanical property prediction.

Poster

#W-719

Compact Matrix Quantum Group Equivariant Neural Networks

Edward Pearce-Crump

Group equivariant neural networks have proven effective in modelling a wide range of tasks where the data lives in a classical geometric space and exhibits well-defined group symmetries. However, these networks are not suitable for learning from data that lives in a non-commutative geometry, described formally by non-commutative $\mathcal{C}^{\ast}$-algebras, since the $\mathcal{C}^{\ast}$-algebra of continuous functions on a compact matrix group is commutative. To address this limitation, we derive the existence of a new type of equivariant neural network, called compact matrix quantum group equivariant neural networks, which encode symmetries that are described by compact matrix quantum groups. We characterise the weight matrices that appear in these neural networks for the easy compact matrix quantum groups, which are defined by set partitions. As a result, we obtain new characterisations of equivariant weight matrices for some compact matrix groups that have not appeared previously in the machine learning literature.

Poster

#W-720

Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers

Thiziri Nait Saada · Alireza Naderi · Jared Tanner

Attention layers are the core component of transformers, the current state-of-the-art neural network architecture. Alternatives to softmax-based attention are being explored due to its tendency to hinder effective information flow. Even at initialisation, it remains poorly understood why the propagation of signals and gradients through these random networks can be pathological, resulting in issues known as (i) vanishing/exploding gradients and (ii) rank collapse in depth, i.e. when all tokens converge to a single representation along layers. While rank collapse in depth naturally arises from repeated matrix multiplications---a common pattern across various architectures---we identify an additional and previously unknown challenge unique to softmax attention layers: (iii) rank collapse in width, which occurs as the context length increases. Using Random Matrix Theory, we conduct a rigorous analysis that uncovers a spectral gap between the two largest singular values of the attention matrix as the cause of (iii), which in turn exacerbates (i) and (ii).Building on this insight, we propose a novel yet simple practical solution to mitigate rank collapse in width by removing the outlier eigenvalue(s). Our theoretical framework offers a fresh perspective on recent practical studies, such as (Ye et al., 2024; Ali et al., 2023), whose ad hoc solutions can now be interpreted as implicit efforts to address the spectral gap issue. This work provides valuable theoretical support for ongoing large-scale empirical research, bringing theory and practice one step closer in the understanding of transformers.

Poster

#W-721

Gradient Flow Provably Learns Robust Classifiers for Orthonormal GMMs

Hancheng Min · Rene Vidal

Deep learning-based classifiers are known to be vulnerable to adversarial attacks. Existing methods for defending against such attacks require adding a defense mechanism or modifying the learning procedure (e.g., by adding adversarial examples). This paper shows that for certain data distributions one can learn a provably robust classifier using standard learning methods and without adding a defense mechanism. More specifically, this paper addresses the problem of finding a robust classifier for a binary classification problem in which the data comes from an isotropic mixture of Gaussians with orthonormal cluster centers. First, we characterize the largest $\ell_2$-attack any classifier can defend against while maintaining high accuracy, and show the existence of optimal robust classifiers achieving this maximum $\ell_2$-robustness. Next, we show that given data from the orthonormal Gaussian mixture model, gradient flow on a two-layer network with a polynomial ReLU activation and without adversarial examples provably finds an optimal robust classifier.

Poster

#W-800

Nearly Optimal Sample Complexity for Learning with Label Proportions

Robert Busa-Fekete · Travis Dick · Claudio Gentile · Haim Kaplan · Tomer Koren · Uri Stemmer

We investigate Learning from Label Proportions (LLP), a partial information setting where examples in a training set are grouped into bags, and only aggregate label values in each bag are available. Despite the partial observability, the goal is still to achieve small regret at the level of individual examples. We give results on the sample complexity of LLP under square loss, showing that our sample complexity is essentially optimal. From an algorithmic viewpoint, we rely on carefully designed variants of Empirical Risk Minimization, and Stochastic Gradient Descent algorithms, combined with ad hoc variance reduction techniques. On one hand, our theoretical results improve in important ways on the existing literature on LLP, specifically in the way the sample complexity depends on the bag size. On the other hand, we validate our algorithmic solutions on several datasets, demonstrating improved empirical performance (better accuracy for less samples) against recent baselines.

Poster

#W-801

A Two-Stage Learning-to-Defer Approach for Multi-Task Learning

Yannis Montreuil · Shu Heng Yeo · Axel Carlier · Lai Xing Ng · Wei Tsang Ooi

The Two-Stage Learning-to-Defer (L2D) framework has been extensively studied for classification and, more recently, regression tasks. However, many real-world applications require solving both tasks jointly in a multi-task setting. We introduce a novel Two-Stage L2D framework for multi-task learning that integrates classification and regression through a unified deferral mechanism. Our method leverages a two-stage surrogate loss family, which we prove to be both Bayes-consistent and $(\mathcal{G}, \mathcal{R})$-consistent, ensuring convergence to the Bayes-optimal rejector. We derive explicit consistency bounds tied to the cross-entropy surrogate and the $L_1$-norm of agent-specific costs, and extend minimizability gap analysis to the multi-expert two-stage regime. We also make explicit how shared representation learning—commonly used in multi-task models—affects these consistency guarantees. Experiments on object detection and electronic health record analysis demonstrate the effectiveness of our approach and highlight the limitations of existing L2D methods in multi-task scenarios.

Spotlight Poster

#W-802

All-Purpose Mean Estimation over R: Optimal Sub-Gaussianity with Outlier Robustness and Low Moments Performance

Jasper Lee · Walter McKelvie · Maoyuan Song · Paul Valiant

We consider the basic statistical challenge of designing an "all-purpose" mean estimation algorithm that is recommendable across a variety of settings and models.Recent work by [Lee and Valiant 2022] introduced the first 1-d mean estimator whose error in the standard finite-variance+i.i.d. setting is optimal even in its constant factors; experimental demonstration of its good performance was shown by [Gobet et al. 2022].Yet, unlike for classic (but not necessarily practical) estimators such as median-of-means and trimmed mean, this new algorithm lacked proven robustness guarantees in other settings, including the settings of adversarial data corruption and heavy-tailed distributions with infinite variance.Such robustness is important for practical use cases.This raises a research question: is it possible to have a mean estimator that is robust, *without* sacrificing provably optimal performance in the standard i.i.d. setting?In this work, we show that Lee and Valiant's estimator is in fact an "all-purpose" mean estimator by proving:(A) It is robust to an $\eta$-fraction of data corruption, even in the strong contamination model; it has optimal estimation error $O(\sigma\sqrt{\eta})$ for distributions with variance $\sigma^2$.(B) For distributions with finite $z^\text{th}$ moment, for $z \in (1,2)$, it has optimal estimation error, matching the lower bounds of [Devroye et al. 2016] up to constants.We further show (C) that outlier robustness for 1-d mean estimators in fact implies neighborhood optimality, a notion of beyond worst-case and distribution-dependent optimality recently introduced by [Dang et al. 2023].Previously, such an optimality guarantee was only known for median-of-means, but now it holds also for all estimators that are simultaneously *robust* and *sub-Gaussian*, including Lee and Valiant's, resolving a question raised by Dang et al.Lastly, we show (D) the asymptotic normality and efficiency of Lee and Valiant's estimator, as further evidence for its performance across many settings.

Poster

#W-803

Generation from Noisy Examples

Ananth Raman · Vinod Raman

We continue to study the learning-theoretic foundations of generation by extending the results from Kleinberg and Mullainathan [2024] and Li et al. [2024] to account for noisy example streams. In the noiseless setting of Kleinberg and Mullainathan [2024] and Li et al. [2024], an adversary picks a hypothesis from a binary hypothesis class and provides a generator with a sequence of its positive examples. The goal of the generator is to eventually output new, unseen positive examples. In the noisy setting, an adversary still picks a hypothesis and a sequence of its positive examples. But, before presenting the stream to the generator, the adversary inserts a finite number of negative examples. Unaware of which examples are noisy, the goal of the generator is to still eventually output new, unseen positive examples. In this paper, we provide necessary and sufficient conditions for when a binary hypothesis class can be noisily generatable. We provide such conditions with respect to various constraints on the number of distinct examples that need to be seen before perfect generation of positive examples. Interestingly, for finite and countable classes we show that generatability is largely unaffected by the presence of a finite number of noisy examples.

Poster

#W-804

When do neural networks learn world models?

Tianren Zhang · Guanyu Chen · Feng Chen

Humans develop world models that capture the underlying generation process of data. Whether neural networks can learn similar world models remains an open problem. In this work, we present the first theoretical results for this problem, showing that in a multi-task setting, models with a low-degree bias provably recover latent data-generating variables under mild assumptions--even if proxy tasks involve complex, non-linear functions of the latents. However, such recovery is sensitive to model architecture. Our analysis leverages Boolean models of task solutions via the Fourier-Walsh transform and introduces new techniques for analyzing invertible Boolean transforms, which may be of independent interest. We illustrate the algorithmic implications of our results and connect them to related research areas, including self-supervised learning, out-of-distribution generalization, and the linear representation hypothesis in large language models.

Poster

#W-805

Meta Optimality for Demographic Parity Constrained Regression via Post-Processing

Kazuto Fukuchi

We address the regression problem under the constraint of demographic parity, a commonly used fairness definition. Recent studies have revealed fair minimax optimal regression algorithms, the most accurate algorithms that adhere to the fairness constraint. However, these analyses are tightly coupled with specific data generation models. In this paper, we provide meta-theorems that can be applied to various situations to validate the fair minimax optimality of the corresponding regression algorithms. Furthermore, we demonstrate that fair minimax optimal regression can be achieved through post-processing methods, allowing researchers and practitioners to focus on improving conventional regression techniques, which can then be efficiently adapted for fair regression.

Poster

#W-806

Sample-Optimal Agnostic Boosting with Unlabeled Data

Udaya Ghai · Karan Singh

Boosting provides a practical and provably effective framework for constructing accurate learning algorithms from inaccurate rules of thumb. It extends the promise of sample-efficient learning to settings where direct Empirical Risk Minimization (ERM) may not be implementable efficiently. In the realizable setting, boosting is known to offer this computational reprieve without compromising on sample efficiency. However, in the agnostic case, existing boosting algorithms fall short of achieving the optimal sample complexity. We highlight a previously unexplored avenue of improvement: unlabeled samples. We design a computationally efficient agnostic boosting algorithm that matches the sample complexity of ERM, given polynomially many additional unlabeled samples. In fact, we show that the total number of samples needed, unlabeled and labeled inclusive, is never more than that for the best known agnostic boosting algorithm -- so this result is never worse -- while only a vanishing fraction of these need to be labeled for the algorithm to succeed. This is particularly fortuitous for learning-theoretic applications of agnostic boosting, which often take place in the distribution-specific setting, where unlabeled samples can be availed for free. We also prove that the resultant guarantee is resilient against mismatch between the distributions governing the labeled and unlabeled samples. Finally, we detail an application of this result in reinforcement learning.

Poster

#W-807

On the Generalization Ability of Next-Token-Prediction Pretraining

Zhihao Li · Xue JIANG · Liyuan Liu · xuelin zhang · Hong Chen · Feng Zheng

Large language models (LLMs) have demonstrated remarkable potential in handling natural language processing (NLP) tasks and beyond. LLMs usually can be categorized as transformer decoder-only models (DOMs), utilizing Next-Token-Prediction (NTP) as their pre-training methodology. Despite their tremendous empirical successes, the theoretical understanding of how NTP pre-training affects the model's generalization behavior is lacking. To fill this gap, we establish the fine-grained generalization analysis for NTP pre-training based on Rademacher complexity, where the dependence between tokens is also addressed. Technically, a novel decomposition of Rademacher complexity is developed to study DOMs from the representation learner and the token predictor, respectively. Furthermore, the upper bounds of covering number are established for multi-layer and multi-head transformer-decoder models under the Frobenius norm, which theoretically pioneers the incorporation of mask matrix within the self-attention mechanism. Our results reveal that the generalization ability of NTP pre-training is affected quantitively by the number of token sequences $N$, the maximum length of sequence $m$, and the count of parameters in the transformer model $\Theta$. Additionally, experiments on public datasets verify our theoretical findings.

Poster

#W-808

Adversarial Robustness in Two-Stage Learning-to-Defer: Algorithms and Guarantees

Yannis Montreuil · Axel Carlier · Lai Xing Ng · Wei Tsang Ooi

Two-stage Learning-to-Defer (L2D) enables optimal task delegation by assigning each input to either a fixed main model or one of several offline experts, supporting reliable decision-making in complex, multi-agent environments. However, existing L2D frameworks assume clean inputs and are vulnerable to adversarial perturbations that can manipulate query allocation—causing costly misrouting or expert overload. We present the first comprehensive study of adversarial robustness in two-stage L2D systems. We introduce two novel attack strategies—untargeted and targeted—which respectively disrupt optimal allocations or force queries to specific agents. To defend against such threats, we propose SARD, a convex learning algorithm built on a family of surrogate losses that are provably Bayes-consistent and $(\mathcal{R}, \mathcal{G})$-consistent. These guarantees hold across classification, regression, and multi-task settings. Empirical results demonstrate that SARD significantly improves robustness under adversarial attacks while maintaining strong clean performance, marking a critical step toward secure and trustworthy L2D deployment.

Spotlight Poster

#W-809

Provable Benefits of Unsupervised Pre-training and Transfer Learning via Single-Index Models

Taj Jones-McCormick · Aukosh Jagannath · Subhabrata Sen

Unsupervised pre-training and transfer learning are commonly used techniques to initialize training algorithms for neural networks, particularly in settings with limited labeled data. In this paper, we study the effects of unsupervised pre-training and transfer learning on the sample complexity of high-dimensional supervised learning. Specifically, we consider the problem of training a single-layer neural network via online stochastic gradient descent. We establish that pre-training and transfer learning (under concept shift) reduce sample complexity by polynomial factors (in the dimension) under very general assumptions. We also uncover some surprising settings where pre-training grants exponential improvement over random initialization in terms of sample complexity.

Spotlight Poster

#W-810

Rapid Overfitting of Multi-Pass SGD in Stochastic Convex Optimization

Shira Vansover-Hager · Tomer Koren · Roi Livni

We study the out-of-sample performance of multi-pass stochastic gradient descent (SGD) in the fundamental stochastic convex optimization (SCO) model. While one-pass SGD is known to achieve an optimal $\Theta(1/\sqrt{n})$ excess population loss given a sample of size $n$, much less is understood about the multi-pass version of the algorithm which is widely used in practice. Somewhat surprisingly, we show that in the general non-smooth case of SCO, just a few epochs of SGD can already hurt its out-of-sample performance significantly and lead to overfitting. In particular, using a step size $\eta = \Theta(1/\sqrt{n})$, which gives the optimal rate after one pass, can lead to population loss as large as $\Omega(1)$ after just one additional pass. More generally, we show that the population loss from the second pass onward is of the order $\Theta(1/(\eta T) + \eta \sqrt{T})$, where $T$ is the total number of steps. These results reveal a certain phase-transition in the out-of-sample behavior of SGD after the first epoch, as well as a sharp separation between the rates of overfitting in the smooth and non-smooth cases of SCO. Additionally, we extend our results to with-replacement SGD, proving that the same asymptotic bounds hold after $O(n \log n)$ steps. Finally, we also prove a lower bound of $\Omega(\eta \sqrt{n})$ on the generalization gap of one-pass SGD in dimension $d = {\widetilde O}(n)$, improving on recent results of Koren et al. (2022) and Schliserman et al. (2024).

Poster

#W-811

Hypothesis Testing for Generalized Thurstone Models

Anuran Makur · Japneet Singh

In this work, we develop a hypothesis testing framework to determine whether pairwise comparison data is generated by an underlying *generalized Thurstone model* $\mathcal{T}_F$ for a given choice function $F$. While prior work has predominantly focused on parameter estimation and uncertainty quantification for such models, we address the fundamental problem of minimax hypothesis testing for $\mathcal{T}_F$ models. We formulate this testing problem by introducing a notion of separation distance between general pairwise comparison models and the class of $\mathcal{T}_F$ models. We then derive upper and lower bounds on the critical threshold for testing that depend on the topology of the observation graph. For the special case of complete observation graphs, this threshold scales as $\Theta((nk)^{-1/2})$, where $n$ is the number of agents and $k$ is the number of comparisons per pair. Furthermore, we propose a hypothesis test based on our separation distance, construct confidence intervals, establish time-uniform bounds on the probabilities of type I and II errors using reverse martingale techniques, and derive minimax lower bounds using information-theoretic methods. Finally, we validate our results through experiments on synthetic and real-world datasets.

Spotlight Poster

#W-812

Statistical Query Hardness of Multiclass Linear Classification with Random Classification Noise

Ilias Diakonikolas · Mingchen Ma · Lisheng Ren · Christos Tzamos

We study the task of Multiclass Linear Classification (MLC) in the distribution-free PAC model with Random Classification Noise (RCN). Specifically, the learner is given a set of labeled examples $(x, y)$, where $x$ is drawn from an unknown distribution on $R^d$ and the labels are generated by a multiclass linear classifier corrupted with RCN. That is, the label $y$ is flipped from $i$ to $j$ with probability $H_{ij}$ according to a known noise matrix $H$ with non-negative separation $\sigma: = \min_{i \neq j} H_{ii}-H_{ij}$. The goal is to compute a hypothesis with small 0-1 error. For the special case of two labels, prior work has given polynomial-time algorithms achieving the optimal error. Surprisingly, little is known about the complexity of this task even for three labels.As our main contribution, we show that the complexity of MLC with RCN becomes drastically different in the presence of three or more labels. Specifically, we prove super-polynomialStatistical Query (SQ) lower bounds for this problem. In more detail, even for three labels and constant separation, we give a super-polynomial lower bound on the complexity of any SQ algorithm achieving optimal error. For a larger number of labels and smaller separation, we show a super-polynomial SQ lower bound even for the weaker goal of achieving any constant factor approximation to the optimal loss or even beating the trivial hypothesis.

Poster

#W-813

System-Aware Unlearning Algorithms: Use Lesser, Forget Faster

Linda Lu · Ayush Sekhari · Karthik Sridharan

Machine unlearning addresses the problem of updating a machine learning model/system trained on a dataset $S$ so that the influence of a set of deletion requests $U \subseteq S$ on the unlearned model is minimized. The gold standard definition of unlearning demands that the updated model, after deletion, be nearly identical to the model obtained by retraining. This definition is designed for a worst-case attacker (one who can recover not only the unlearned model but also the remaining data samples, i.e., $S \setminus U$). Such a stringent definition has made developing efficient unlearning algorithms challenging. However, such strong attackers are also unrealistic. In this work, we propose a new definition, *system-aware unlearning*, which aims to provide unlearning guarantees against an attacker that can at best only gain access to the data stored in the system for learning/unlearning requests and not all of $S\setminus U$. With this new definition, we use the simple intuition that if a system can store less to make its learning/unlearning updates, it can be more secure and update more efficiently against a system-aware attacker. Towards that end, we present an exact system-aware unlearning algorithm for linear classification using a selective sampling-based approach, and we generalize the method for classification with general function classes. We theoretically analyze the tradeoffs between deletion capacity, accuracy, memory, and computation time.

Spotlight Poster

#W-814

Theoretical Limitations of Ensembles in the Age of Overparameterization

Niclas Dern · John Cunningham · Geoff Pleiss

Classic ensembles generalize better than any single component model. In contrast, recent empirical studies find that modern ensembles of (overparameterized) neural networks may not provide any inherent generalization advantage over single but larger neural networks. This paper clarifies how modern overparameterized ensembles differ from their classic underparameterized counterparts, using ensembles of random feature (RF) regressors as a basis for developing theory. In contrast to the underparameterized regime, where ensembling typically induces regularization and increases generalization, we prove with minimal assumptions that infinite ensembles of overparameterized RF regressors become pointwise equivalent to (single) infinite-width RF regressors, and finite width ensembles rapidly converge to single models with the same parameter budget. These results, which are exact for ridgeless models and approximate for small ridge penalties, imply that overparameterized ensembles and single large models exhibit nearly identical generalization. We further characterize the predictive variance amongst ensemble members, demonstrating that it quantifies the expected effects of increasing capacity rather than capturing any conventional notion of uncertainty. Our results challenge common assumptions about the advantages of ensembles in overparameterized settings, prompting a reconsideration of how well intuitions from underparameterized ensembles transfer to deep ensembles and the overparameterized regime.

Poster

#W-815

Mixture of Experts Provably Detect and Learn the Latent Cluster Structure in Gradient-Based Learning

Ryotaro Kawata · Kohsei Matsutani · Yuri Kinoshita · Naoki Nishikawa · Taiji Suzuki

Mixture of Experts (MoE), an ensemble of specialized models equipped with a router that dynamically distributes each input to appropriate experts, has achieved successful results in the field of machine learning. However, theoretical understanding of this architecture is falling behind due to its inherent complexity. In this paper, we theoretically study the sample and runtime complexity of MoE following the stochastic gradient descent when learning a regression task with an underlying cluster structure of single index models. On the one hand, we show that a vanilla neural network fails in detecting such a latent organization as it can only process the problem as a whole. This is intrinsically related to the concept of information exponent which is low for each cluster, but increases when we consider the entire task. On the other hand, with a MoE, we show that it succeeds in dividing the problem into easier subproblems by leveraging the ability of each expert to weakly recover the simpler function corresponding to an individual cluster. To the best of our knowledge, this work is among the first to explore the benefits of the MoE framework by examining its SGD dynamics in the context of nonlinear regression.

Poster

#W-816

A-PSRO: A Unified Strategy Learning Method with Advantage Metric for Normal-form Games

Yudong Hu · Haoran Li · Congying Han · Tiande Guo · Bonan Li · Mingqiang Li

Solving the Nash equilibrium in normal-form games with large-scale strategy spaces presents significant challenges. Open-ended learning frameworks, such as PSRO and its variants, have emerged as effective solutions. However, these methods often lack an efficient metric for evaluating strategy improvement, which limits their effectiveness in approximating equilibria.In this paper, we introduce a novel evaluative metric called Advantage, which possesses desirable properties inherently connected to the Nash equilibrium, ensuring that each strategy update approaches equilibrium. Building upon this, we propose the Advantage Policy Space Response Oracle (A-PSRO), an innovative unified open-ended learning framework applicable to both zero-sum and general-sum games. A-PSRO leverages the Advantage as a refined evaluation metric, leading to a consistent learning objective for agents in normal-form games. Experiments showcase that A-PSRO significantly reduces exploitability in zero-sum games and improves rewards in general-sum games, outperforming existing algorithms and validating its practical effectiveness.

Poster

#W-817

Finite-Time Convergence Rates in Stochastic Stackelberg Games with Smooth Algorithmic Agents

Eric Frankel · Kshitij Kulkarni · Dmitriy Drusvyatskiy · Sewoong Oh · Lillian Ratliff

Decision-makers often adaptively influence downstream competitive agents' behavior to minimize their cost, yet in doing so face critical challenges: $(i)$ decision-makers might not *a priori* know the agents' objectives; $(ii)$ agents might *learn* their responses, introducing stochasticity and non-stationarity into the decision-making process; and $(iii)$ there may be additional non-strategic environmental stochasticity. Characterizing convergence of this complex system is contingent on how the decision-maker controls for the tradeoff between the induced drift and additional noise from the learning agent behavior and environmental stochasticity. To understand how the learning agents' behavior is influenced by the decision-maker’s actions, we first consider a decision-maker that deploys an arbitrary sequence of actions which induces a sequence of games and corresponding equilibria. We characterize how the drift and noise in the agents' stochastic algorithms decouples from their optimization error. Leveraging this decoupling and accompanying finite-time efficiency estimates, we design decision-maker algorithms that control the induced drift relative to the agent noise. This enables efficient finite-time tracking of game theoretic equilibrium concepts that adhere to the incentives of the players' collective learning processes.

Poster

#W-818

Best of Both Worlds: Regret Minimization versus Minimax Play

Adrian Müller · Jon Schneider · EFSTRATIOS PANTELEIMON SKOULAKIS · Luca Viano · Volkan Cevher

In this paper, we investigate the existence of online learning algorithms with bandit feedback that simultaneously guarantee $O(1)$ regret compared to a given comparator strategy, and $\tilde{O}(\sqrt{T})$ regret compared to any fixed strategy, where $T$ is the number of rounds. We provide the first affirmative answer to this question whenever the comparator strategy supports every action. In the context of zero-sum games with min-max value zero, both in normal- and extensive form, we show that our results allow us to guarantee to risk at most $O(1)$ loss while being able to gain $\Omega(T)$ from exploitable opponents, thereby combining the benefits of both no-regret algorithms and minimax play.

Poster

#W-819

Computing Voting Rules with Improvement Feedback

Evi Micha · Vasilis Varsamis

Aggregating preferences under incomplete or constrained feedback is a fundamental problem in social choice and related domains. While prior work has established strong impossibility results for pairwise comparisons, this paper extends the inquiry to improvement feedback, where voters express incremental adjustments rather than complete preferences. We provide a complete characterization of the positional scoring rules that can be computed given improvement feedback. Interestingly, while plurality is learnable under improvement feedback—unlike with pairwise feedback—strong impossibility results persist for many other positional scoring rules. Furthermore, we show that improvement feedback, unlike pairwise feedback, does not suffice for the computation of any Condorcet-consistent rule. We complement our theoretical findings with experimental results, providing further insights into the practical implications of improvement feedback for preference aggregation.

Poster

#W-820

Should Decision-Makers Reveal Classifiers in Online Strategic Classification?

Han Shao · Shuo Xie · Kunhe Yang

Strategic classification addresses a learning problem where a decision-maker implements a classifier over agents who may manipulate their features in order to receive favorable predictions. In the standard model of online strategic classification, in each round, the decision-maker implements and publicly reveals a classifier, after which agents perfectly best respond based on this knowledge. However, in practice, whether to disclose the classifier is often debated---some decision-makers believe that hiding the classifier can prevent misclassification errors caused by manipulation. In this paper, we formally examine how limiting the agents' access to the current classifier affects the decision-maker's performance. Specifically, we consider an extended online strategic classification setting where agents lack direct knowledge about the current classifier and instead manipulate based on a weighted average of historically implemented classifiers. Our main result shows that in this setting, the decision-maker incurs $(1-\gamma)^{-1}$ or $k_{\text{in}}$ times more mistakes compared to the full-knowledge setting, where $k_{\text{in}}$ is the maximum in-degree of the manipulation graph (representing how many distinct feature vectors can be manipulated to appear as a single one), and $\gamma$ is the discount factor indicating agents' memory of past classifiers. Our results demonstrate how withholding access to the classifier can backfire and degrade the decision-maker's performance in online strategic classification.

Poster

#W-821

A General Representation-Based Approach to Multi-Source Domain Adaptation

Ignavier Ng · Yan Li · Zijian Li · Yujia Zheng · Guangyi Chen · Kun Zhang

A central problem in unsupervised domain adaptation is determining what to transfer from labeled source domains to an unlabeled target domain. To handle high-dimensional observations (e.g., images), a line of approaches use deep learning to learn latent representations of the observations, which facilitate knowledge transfer in the latent space. However, existing approaches often rely on restrictive assumptions to establish identifiability of the joint distribution in the target domain, such as independent latent variables or invariant label distributions, limiting their real-world applicability. In this work, we propose a general domain adaptation framework that learns compact latent representations to capture distribution shifts relative to the prediction task and address the fundamental question of what representations should be learned and transferred. Notably, we first demonstrate that learning representations based on all the predictive information, i.e., the label's Markov blanket in terms of the learned representations, is often underspecified in general settings. Instead, we show that, interestingly, general domain adaptation can be achieved by partitioning the representations of Markov blanket into those of the label's parents, children, and spouses. Moreover, its identifiability guarantee can be established. Building on these theoretical insights, we develop a practical, nonparametric approach for domain adaptation in a general setting, which can handle different types of distribution shifts.

Poster

#W-900

Projection Optimization: A General Framework for Multi-Objective and Multi-Group RLHF

Nuoya Xiong · Aarti Singh

Reinforcement Learning with Human Feedback (RLHF) is a widely used fine-tuning approach that aligns machine learning models, particularly Language Models (LMs) with human preferences. There are typically multiple objectives driving the preference, hence humans find it easier to express per-objective comparisons rather than a global preference between two choices, e.g. compare two papers on their novelty, clarity, correctness, etc. Multi-Objective RLHF aims to use per-objective preference feedback and achieve a Pareto optimal tradeoff among these objectives by aggregating them into a single unified objective for optimization. However, nearly all prior works rely on linear aggregation, which rules out policies that favor specific objectives such as the worst one. The only existing approach using non-linear aggregation is computationally expensive due to its reward-based nature and the need for retraining whenever the aggregation parameters change. In this work, we address this limitation by transforming the non-linear aggregation maximization problem into a series of sub-problems. Each sub-problem involves only linear aggregation, making it computationally efficient to solve. We further extend our framework to handle multi-group scenarios, where each group has distinct weights for the objectives. Our method enables achieving consensus or maximizing the aggregated objective across all groups. Theoretically, we demonstrate that our algorithmic framework achieves sublinear regret and can be easily adapted to a reward-free algorithm. Empirically, leveraging our theoretical insights, we propose a nearly training-free algorithm once the optimal policies for individual objectives are obtained.

Poster

#W-901

Fluctuations of the largest eigenvalues of transformed spiked Wigner matrices

Aro Lee · Ji Oon Lee

We consider a spiked random matrix model obtained by applying a function entrywise to a signal-plus-noise symmetric data matrix. We prove that the largest eigenvalue of this model, which we call a transformed spiked Wigner matrix, exhibits Baik-Ben Arous-Péché (BBP) type phase transition. We show that the law of the fluctuation converges to the Gaussian distribution when the effective signal-to-noise ratio (SNR) is above the critical number, and to the GOE Tracy-Widom distribution when the effective SNR is below the critical number. We provide precise formulas for the limiting distributions and also concentration estimates for the largest eigenvalues, both in the supercritical and the subcritical regimes.

Poster

#W-902

Heavy-Tailed Linear Bandits: Huber Regression with One-Pass Update

Jing Wang · Yu-Jie Zhang · Peng Zhao · Zhi-Hua Zhou

We study the stochastic linear bandits with heavy-tailed noise. Two principled strategies for handling heavy-tailed noise, truncation and median-of-means, have been introduced to heavy-tailed bandits. Nonetheless, these methods rely on specific noise assumptions or bandit structures, limiting their applicability to general settings. The recent work [Huang et al.2024] develop a soft truncation method via the adaptive Huber regression to address these limitations. However, their method suffers undesired computational cost: it requires storing all historical data and performing a full pass over these data at each round. In this paper, we propose a \emph{one-pass} algorithm based on the online mirror descent framework. Our method updates using only current data at each round, reducing the per-round computational cost from $\mathcal{O}(t \log T)$ to $\mathcal{O}(1)$ with respect to current round $t$ and the time horizon $T$, and achieves a near-optimal and variance-aware regret of order $\widetilde{\mathcal{O}}\big(d T^{\frac{1-\varepsilon}{2(1+\varepsilon)}} \sqrt{\sum_{t=1}^T \nu_t^2} + d T^{\frac{1-\varepsilon}{2(1+\varepsilon)}}\big)$ where $d$ is the dimension and $\nu_t^{1+\varepsilon}$ is the $(1+\varepsilon)$-th central moment of reward at round $t$.

Poster

#W-903

Linear Bandits with Partially Observable Features

Wonyoung Kim · Sungwoo PARK · Garud Iyengar · Assaf Zeevi · Min-hwan Oh

We study the linear bandit problem that accounts for partially observable features. Without proper handling, unobserved features can lead to linear regret in the decision horizon $T$, as their influence on rewards is unknown.To tackle this challenge, we propose a novel theoretical framework and an algorithm with sublinear regret guarantees.The core of our algorithm consists of: (i) feature augmentation, by appending basis vectors that are orthogonal to the row space of the observed features; and (ii) the introduction of a doubly robust estimator.Our approach achieves a regret bound of $\tilde{O}(\sqrt{(d + d_h)T})$, where $d$ denotes the dimension of the observed features, and $d_h$ represents the number of nonzero coefficients in the parameter associated with the reward component projected onto the subspace orthogonal to the row space spanned by the observed features.Notably, our algorithm requires no prior knowledge of the unobserved feature space, which may expand as more features become hidden.Numerical experiments confirm that our algorithm outperforms both non-contextual multi-armed bandits and linear bandit algorithms depending solely on observed features.

Poster

#W-904

Learning-Augmented Algorithms for MTS with Bandit Access to Multiple Predictors

Matei Gabriel Cosa · Marek Elias

Combining algorithms is one of the key techniques in learning-augmented algorithms.We consider the following problem:We are given $\ell$ heuristicsfor Metrical Task Systems (MTS), where each might be tailored to a different typeof input instances.While processing an input instance received online,we are allowed to query the action of only one of the heuristics at each time step.Our goal is to achieve performance comparable to the best of the given heuristics.The main difficulty of our setting comes from the fact thatthe cost paid by a heuristic at time $t$ cannot be estimatedunless the same heuristic was also queried at time $t-1$.This is related to Bandit Learning against memory boundedadversaries (Arora et al., 2012).We show how to achieve regret of $O(\text{OPT}^{2/3})$and prove a tight lower bound based on the constructionof Dekel et al. (2013).

Poster

#W-905

Ranking with Multiple Oracles: From Weak to Strong Stochastic Transitivity

Tao Jin · Yue Wu · Quanquan Gu · Farzad Farnoud

We study the problem of efficiently aggregating the preferences of items from multiple information sources (oracles) and infer the ranking under both the weak stochastic transitivity (WST) and the strong stochastic transitivity (SST) conditions. When the underlying preference model satisfies the WST condition, we propose an algorithm named RMO-WST, which has a bi-level design: at the higher level, it actively allocates comparison budgets to all undetermined pairs until the full ranking is recovered; at the lower level, it attempts to compare the pair of items and selects the more accurate oracles simultaneously. We prove that the sample complexity of RMO-WST is $ \tilde O( N\sum_{i=2}^{N}H_{\sigma^{-1}(i),{\sigma^{-1}(i-1)}} )$, where $N$ is the number of items to rank, $H$ is a problem-dependent hardness factor, and $\sigma^{-1}(i)$ represents the $i$-th best item. We also provide a tight lower bound that matches the upper bound of approximate ranking under the WST condition, answering a previously open problem. In addition, when the SST condition is satisfied, we propose an algorithm named RMO-SST, which can achieve an $\tilde{O}(\sum_{i=1}^{N} H_i \log(N))$ sample complexity. This outperforms the best-known sample complexity by a factor of $\log(N)$. The theoretical advantages of our algorithms are verified by empirical experiments in a simulated environment.

Poster

#W-906

Instance-Optimal Pure Exploration for Linear Bandits on Continuous Arms

Sho Takemori · Yuhei Umeda · Aditya Gopalan

This paper studies a pure exploration problem with linear bandit feedback on continuous arm sets, aiming to identify an $\epsilon$-optimal arm with high probability. Previous approaches for continuous arm sets have employed instance-independent methods due to technical challenges such as the infinite dimensionality of the space of probability measures and the non-smoothness of the objective function. This paper proposes a novel, tractable algorithm that addresses these challenges by leveraging a reparametrization of the sampling distribution and projected subgradient descent. However, this approach introduces new challenges related to the projection and reconstruction of the distribution from the reparametrization. We address these by focusing on the connection to the approximate Carath\'eodory problem. Compared to the original optimization problem on the infinite-dimensional space, our method is tractable, requiring only the solution of quadratic and fractional quadratic problems on the arm set. We establish an instance-dependent optimality for our method, and empirical results on synthetic data demonstrate its superiority over existing instance-independent baselines.

Poster

#W-907

Near Optimal Non-asymptotic Sample Complexity of 1-Identification

Zitian Li · Wang Chi Cheung

Motivated by an open direction in existing literature, we study the 1-identification problem, a fundamental multi-armed bandit formulation on pure exploration. The goal is to determine whether there exists an arm whose mean reward is at least a known threshold $\mu_0$, or to output \textsf{None} if it believes such an arm does not exist. The agent needs to guarantee its output is correct with probability at least $1-\delta$. Degenne & Koolen 2019 has established the asymptotically tight sample complexity for the 1-identification problem, but they commented that the non-asymptotic analysis remains unclear. We design a new algorithm Sequential-Exploration-Exploitation (SEE), and conduct theoretical analysis from the non-asymptotic perspective. Novel to the literature, we achieve near optimality, in the sense of matching upper and lower bounds on the pulling complexity. The gap between the upper and lower bounds is up to a polynomial logarithmic factor. The numerical result also indicates the effectiveness of our algorithm, compared to existing benchmarks.

Poster

#W-908

A Classification View on Meta Learning Bandits

Mirco Mutti · Jeongyeol Kwon · Shie Mannor · Aviv Tamar

Contextual multi-armed bandits are a popular choice to model sequential decision-making. *E.g.*, in a healthcare application we may perform various tests to asses a patient condition (exploration) and then decide on the best treatment to give (exploitation). When human design strategies, they aim for the exploration to be *fast*, since the patient's health is at stake, and easy to *interpret* for a physician overseeing the process. However, common bandit algorithms are nothing like that: The regret caused by exploration scales with $\sqrt{H}$ over $H$ rounds and decision strategies are based on opaque statistical considerations. In this paper, we use an original *classification view* to meta learn interpretable and fast exploration plans for a fixed collection of bandits $\mathbb{M}$. The plan is prescribed by an interpretable *decision tree* probing decisions' payoff to classify the test bandit. The test regret of the plan in the *stochastic* and *contextual* setting scales with $O (\lambda^{-2} C_{\lambda} (\mathbb{M}) \log^2 (MH))$, being $M$ the size of $\mathbb{M}$, $\lambda$ a separation parameter over the bandits, and $C_\lambda (\mathbb{M})$ a novel *classification-coefficient* that fundamentally links meta learning bandits with classification. Through a nearly matching lower bound, we show that $C_\lambda (\mathbb{M})$ inherently captures the complexity of the setting.

Poster

#W-909

High Probability Bound for Cross-Learning Contextual Bandits with Unknown Context Distributions

Ruiyuan Huang · Zengfeng Huang

Motivated by applications in online bidding and sleeping bandits, we examine the problem of contextual bandits with cross learning, where the learner observes the loss associated with the action across all possible contexts, not just the current round’s context. Our focus is on a setting where losses are chosen adversarially, and contexts are sampled i.i.d. from a specific distribution. This problem was first studied by Balseiro et al. (2019), who proposed an algorithm that achieves near-optimal regret under the assumption that the context distribution is known in advance. However, this assumption is often unrealistic. To address this issue, Schneider & Zimmert (2023) recently proposed a new algorithm that achieves nearly optimal expected regret. It is well-known that expected regret can be significantly weaker than high-probability bounds. In this paper, we present a novel, in-depth analysis of their algorithm and demonstrate that it actually achieves near-optimal regret with high probability. There are steps in the original analysis by Schneider & Zimmert (2023) that lead only to an expected bound by nature. In our analysis, we introduce several new insights. Specifically, we make extensive use of the weak dependency structure between different epochs, which was overlooked in previous analyses. Additionally, standard martingale inequalities are not directly applicable, so we refine martingale inequalities to complete our analysis.

Poster

#W-910

No-Regret is not enough! Bandits with General Constraints through Adaptive Regret Minimization

Martino Bernasconi · Matteo Castiglioni · Andrea Celli

In the bandits with knapsacks framework (BwK) the learner has $m$ resource-consumption (i.e., packing) constraints. We focus on the generalization of BwK in which the learner has a set of general long-term constraints. The goal of the learner is to maximize their cumulative reward, while at the same time achieving small cumulative constraints violations. In this scenario, there exist simple instances where conventional methods for BwK fail to yield sublinear violations of constraints. We show that it is possible to circumvent this issue by requiring the primal and dual algorithm to be weakly adaptive. Indeed, even without any information on the Slater's parameter $\rho$ characterizing the problem, the interaction between weakly adaptive primal and dual regret minimizers leads to a ``self-bounding'' behavior of dual variables. In particular, their norm remains suitably upper bounded across the entire time horizon even without explicit projection steps. By exploiting this property, we provide best-of-both-worlds guarantees for stochastic and adversarial inputs. In the first case, we show that the algorithm guarantees sublinear regret. In the latter case, we establish a tight competitive ratio of $\rho/(1+\rho)$. In both settings, constraints violations are guaranteed to be sublinear in time. Finally, this results allow us to obtain new result for the problem of contextual bandits with linear constraints, providing the first no-$\alpha$-regret guarantees for adversarial contexts.

Poster

#W-911

Bayesian Optimization from Human Feedback: Near-Optimal Regret Bounds

Aya Kayal · Sattar Vakili · Laura Toni · Da-shan Shiu · Alberto Bernacchia

Bayesian optimization (BO) with preference-based feedback has recently garnered significant attention due to its emerging applications. We refer to this problem as Bayesian Optimization from Human Feedback (BOHF), which differs from conventional BO by learning the best actions from a reduced feedback model, where only the preference between two actions is revealed to the learner at each time step. The objective is to identify the best action using a limited number of preference queries, typically obtained through costly human feedback. Existing work, which adopts the Bradley-Terry-Luce (BTL) feedback model, provides regret bounds for the performance of several algorithms. In this work, within the same framework we develop tighter performance guarantees. Specifically, we derive regret bounds of $\tilde{\mathcal{O}}(\sqrt{\Gamma(T)T})$, where $\Gamma(T)$ represents the maximum information gain—a kernel-specific complexity term—and $T$ is the number of queries. Our results significantly improve upon existing bounds. Notably, for common kernels, we show that the order-optimal sample complexities of conventional BO—achieved with richer feedback models—are recovered. In other words, the same number of preferential samples as scalar-valued samples is sufficient to find a nearly optimal solution.

Poster

#W-912

Dimension-Free Adaptive Subgradient Methods with Frequent Directions

Sifan Yang · Yuanyu Wan · Peijia Li · Yibo Wang · Xiao Zhang · Zhewei Wei · Lijun Zhang

In this paper, we investigate the acceleration of adaptive subgradient methods through frequent directions (FD), a widely-used matrix sketching technique. The state-of-the-art regret bound exhibits a _linear_ dependence on the dimensionality $d$, leading to unsatisfactory guarantees for high-dimensional problems. Additionally, it suffers from an $O(\tau^2 d)$ time complexity per round, which scales quadratically with the sketching size $\tau$. To overcome these issues, we first propose an algorithm named FTSL, achieving a tighter regret bound that is independent of the dimensionality. The key idea is to integrate FD with adaptive subgradient methods under _the primal-dual framework_ and add the cumulative discarded information of FD back. To reduce its time complexity, we further utilize fast FD to expedite FTSL, yielding a better complexity of $O(\tau d)$ while maintaining the same regret bound. Moreover, to mitigate the computational cost for optimization problems involving matrix variables (e.g., training neural networks), we adapt FD to Shampoo, a popular optimization algorithm that accounts for the structure of decision, and give a novel analysis under _the primal-dual framework_. Our proposed method obtains an improved dimension-free regret bound. Experimental results have verified the efficiency and effectiveness of our approaches.

Spotlight Poster

#W-913

A Generalization Result for Convergence in Learning-to-Optimize

Michael Sucker · Peter Ochs

Learning-to-optimize leverages machine learning to accelerate optimization algorithms. While empirical results show tremendous improvements compared to classical optimization algorithms, theoretical guarantees are mostly lacking, such that the outcome cannot be reliably assured. Especially, convergence is hardly studied in learning-to-optimize, because conventional convergence guarantees in optimization are based on geometric arguments, which cannot be applied easily to learned algorithms. Thus, we develop a probabilistic framework that resembles classical optimization and allows for transferring geometric arguments into learning-to-optimize. Based on our new proof-strategy, our main theorem is a generalization result for parametric classes of potentially non-smooth, non-convex loss functions and establishes the convergence of learned optimization algorithms to critical points with high probability. This effectively generalizes the results of a worst-case analysis into a probabilistic framework, and frees the design of the learned algorithm from using safeguards.

Poster

#W-914

Graph-Based Algorithms for Diverse Similarity Search

Piyush Anand · Piotr Indyk · Ravishankar Krishnaswamy · Sepideh Mahabadi · Vikas Raykar · Kirankumar Shiragur · Haike Xu

Nearest neighbor search is a fundamental data structure problem with many applications. Although the main objective of the data structure is to quickly report data points that are closest to a given query, it has long been noted that without additional constraints the reported answers can be redundant and/or duplicative. This issue is typically addressed in two stages: in the first stage, the algorithm retrieves a (large) number $r$ of points closest to the query, while in the second stage, the $r$ points are post-processed and a small subset is selected to maximize the desired diversity objective. Although popular, this method suffers from a fundamental efficiency bottleneck, as the set of points retrieved in the first stage often needs to be much larger than the final output. In this paper we present provably efficient algorithms for approximate nearest neighbor search with diversity constraints that bypass this two stage process. Our algorithms are based on popular graph-based methods, which allows us to ``piggy-back'' on the existing efficient implementations. These are the first graph-based algorithms for nearest neighbor search with diversity constraints. For data sets with low intrinsic dimension, our data structures report a diverse set of $k$ points approximately closest to the query, in time that only depends on $k$ and $\log \Delta$, where $\Delta$ is the ratio of the diameter to the closest pair distance in the data set. This bound is qualitatively similar to the best known bounds for standard (non-diverse) graph-based algorithms. Our experiments show that the search time of our algorithms is substantially lower than that using the standard two-stage approach.

Poster

#W-915

Learning Mixtures of Experts with EM: A Mirror Descent Perspective

Quentin Fruytier · Aryan Mokhtari · Sujay Sanghavi

Classical Mixtures of Experts (MoE) are Machine Learning models that involve partitioning the input space, with a separate "expert" model trained on each partition. Recently, MoE-based model architectures have become popular as a means to reduce training and inference costs. There, the partitioning function and the experts are both learnt jointly via gradient descent-type methods on the log-likelihood. In this paper we study theoretical guarantees of the Expectation Maximization (EM) algorithm for the training of MoE models. We first rigorously analyze EM for MoE where the conditional distribution of the target and latent variable conditioned on the feature variable belongs to an exponential family of distributions and show its equivalence to projected Mirror Descent with unit step size and a Kullback-Leibler Divergence regularizer. This perspective allows us to derive new convergence results and identify conditions for local linear convergence; In the special case of mixture of 2 linear or logistic experts, we additionally provide guarantees for linear convergence based on the signal-to-noise ratio. Experiments on synthetic and (small-scale) real-world data supports that EM outperforms the gradient descent algorithm both in terms of convergence rate and the achieved accuracy.

Spotlight Poster

#W-916

Decision Theoretic Foundations for Conformal Prediction: Optimal Uncertainty Quantification for Risk-Averse Agents

Shayan Kiyani · George Pappas · Aaron Roth · Hamed Hassani

A fundamental question in data-driven decision making is how to quantify the uncertainty of predictions to inform risk-sensitive downstream actions, as often required in domains such as medicine. We develop a decision-theoretic foundation linking prediction sets to risk-averse decision-making, addressing three questions: (1) What is the correct notion of uncertainty quantification for risk-averse decision makers? We prove that prediction sets are optimal for decision makers who wish to optimize their value at risk. (2) What is the optimal policy that a risk averse decision maker should use to map prediction sets to actions? We show that a simple max-min decision policy is optimal for risk-averse decision makers. Finally, (3) How can we derive prediction sets that are optimal for such decision makers? We provide an exact characterization in the population regime and a distribution free finite-sample construction. These insights leads to Risk-Averse Calibration (RAC), a principled algorithm that is both practical—exploiting black-box predictions to enhance downstream utility—and safe—adhering to user-defined risk thresholds. We experimentally demonstrate RAC's advantages in medical diagnosis and recommendation systems, showing that it substantially improves the trade-off between safety and utility, delivering higher utility than existing methods while avoiding critical errors.

Poster

#W-917

Actor-Critics Can Achieve Optimal Sample Efficiency

Kevin Tan · Wei Fan · Yuting Wei

Actor-critic algorithms have become a cornerstone in reinforcement learning (RL), leveraging the strengths of both policy-based and value-based methods. Despite recent progress in understanding their statistical efficiency, no existing work has successfully learned an $\epsilon$-optimal policy with a sample complexity of $O(1/\epsilon^2)$ trajectories with general function approximation when strategic exploration is necessary. We address this open problem by introducing a novel actor-critic algorithm that attains a sample-complexity of $O(dH^5 \log|\mathcal{A}|/\epsilon^2 + d H^4 \log|\mathcal{F}|/ \epsilon^2)$ trajectories, and accompanying $\sqrt{T}$ regret when the Bellman eluder dimension $d$ does not increase with $T$ at more than a $\log T$ rate. Here, $\mathcal{F}$ is the critic function class, and $\mathcal{A}$ is the action space. Our algorithm integrates optimism, off-policy critic estimation targeting the optimal Q-function, and rare-switching policy resets. We extend this to the setting of Hybrid RL, where we show that initializing the critic with offline data yields sample efficiency gains, and also provide a \textit{non-optimistic} provably efficient actor-critic algorithm, addressing another open problem in the literature. Numerical experiments support our theoretical findings.

Poster

#W-918

Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

Zeyu Jia · Alexander Rakhlin · Tengyang Xie

Process and outcome supervision represent two fundamental approaches to reinforcement learning, especially for complex reasoning tasks in large language models. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data.In this paper, we provide a possible theoretical resolution to this debate. Perhaps surprisingly, our main theorem shows that: under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision. At the core of this result lies the novel Change of Trajectory Measure Lemma---a powerful technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a simple yet powerful connection between outcome and process supervision. These findings suggest that the empirically observed performance gap between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data and algorithm design for reinforcement learning.

Spotlight Poster

#W-919

Catoni Contextual Bandits are Robust to Heavy-tailed Rewards

Chenlu Ye · Yujia Jin · Alekh Agarwal · Tong Zhang

Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni's estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.

Poster

#W-920

Logarithmic Regret for Online KL-Regularized Reinforcement Learning

Heyang Zhao · Chenlu Ye · Wei Xiong · Quanquan Gu · Tong Zhang

Recent advances in Reinforcement Learning from Human Feedback (RLHF) have shown that KL-regularization plays a pivotal role in improving the efficiency of RL fine-tuning for large language models (LLMs). Despite its empirical advantage, the theoretical difference between KL-regularized RL and standard RL remains largely under-explored. While there is a recent line of work on the theoretical analysis of KL-regularized objective in decision making (Xiong et al., 2024a; Xie et al., 2024; Zhao et al., 2024), these analyses either reduce to the traditional RL setting or rely on strong coverage assumptions. In this paper, we propose an optimism-based KL-regularized online contextual bandit algorithm, and provide a novel analysis of its regret. By carefully leveraging the benign optimization landscape induced by the KL-regularization and the optimistic reward estimation, our algorithm achieves an $\mathcal{O}\big(\eta\log (N_{\mathcal R} T)\cdot d_{\mathcal R}\big)$ logarithmic regret bound, where $\eta, N_{\mathcal R},T,d_{\mathcal R}$ denote the KL-regularization parameter, the cardinality of the reward function class, number of rounds, and the complexity of the reward function class. Furthermore, we extend our algorithm and analysis to reinforcement learning by developing a novel decomposition over transition steps and also obtain a similar logarithmic regret bound.

Poster

#W-921

A Computationally Efficient Algorithm for Infinite-Horizon Average-Reward Linear MDPs

Kihyuk Hong · Ambuj Tewari

We study reinforcement learning in infinite-horizon average-reward settings with linear MDPs. Previous work addresses this problem by approximating the average-reward setting by discounted setting and employing a value iteration-based algorithm that uses clipping to constrain the span of the value function for improved statistical efficiency. However, the clipping procedure requires computing the minimum of the value function over the entire state space, which is prohibitive since the state space in linear MDP setting can be large or even infinite. In this paper, we introduce a value iteration method with efficient clipping operation that only requires computing the minimum of value functions over the set of states visited by the algorithm. Our algorithm enjoys the same regret bound as the previous work while being computationally efficient, with computational complexity that is independent of the size of the state space.

Poster Session

Poster Session 5 West

West Exhibition Hall B2-B3