Sat 8:00 a.m. - 9:00 a.m.
|
(Optional) Poster Setup 1
|
🔗
|
Sat 9:00 a.m. - 9:10 a.m.
|
Opening Remarks
(
Introduction
)
>
|
🔗
|
Sat 9:10 a.m. - 9:40 a.m.
|
Been Kim - Agentic Interpretability and Neologism: What LLMs Can Offer Us
(
Invited Talk
)
>
|
Been Kim
🔗
|
Sat 9:40 a.m. - 10:10 a.m.
|
Sarah Schwettman - AI Investigators for Understanding AI Systems
(
Invited Talk
)
>
|
Sarah Schwettmann
🔗
|
Sat 10:10 a.m. - 10:25 a.m.
|
Contributed Talk 1: Detecting High-Stakes Interactions with Activation Probes
(
Contributed Talk
)
>
link
|
Alex McKenzie
🔗
|
Sat 10:25 a.m. - 10:40 a.m.
|
Contributed Talk 2: Actionable Interpretability with NDIF and NNsight
(
Contributed Talk
)
>
link
|
Emma Bortz
🔗
|
Sat 10:40 a.m. - 11:40 a.m.
|
Poster Session 1
(
Poster Session
)
>
|
🔗
|
|
→ Supernova Event Dataset: Interpreting Large Language Model's Personality through Critical Event Analysis
(
Poster
)
>
link
|
Pranav Agarwal · Ioana Ciucă
🔗
|
|
→ Concept Generation through Vision-Language Preference Learning for Understanding Neural Networks' Internal Representations
(
Poster
)
>
link
|
Aditya Taparia · Som Sagar · Ransalu Senanayake
🔗
|
|
→ DCBM: Data-Efficient Visual Concept Bottleneck Models
(
Poster
)
>
link
|
Katharina Prasse · Patrick Knab · Sascha Marton · Christian Bartelt · Margret Keuper
🔗
|
|
→ Koopman Autoencoders Learn Neural Representation Dynamics
(
Poster
)
>
link
|
Nishant Suresh Aswani · Saif Jabari
🔗
|
|
→ Detecting High-Stakes Interactions with Activation Probes
(
Poster
)
>
link
|
Alex McKenzie · Phil Blandfort · Urja Pawar · William Bankes · David Krueger · Ekdeep Singh Lubana · Dmitrii Krasheninnikov
🔗
|
|
→ Simplicity isn't as simple as you think
(
Poster
)
>
link
|
Aman Sinha · Timothee Mickus · Marianne Clausel · Mathieu Constant · Xavier Coubez
🔗
|
|
→ When Compression Breaks Safety, Interpretability Reveals The Fix
(
Poster
)
>
link
|
Vishnu Chhabra · Mahdi Khalili
🔗
|
|
→ Machine Learning from Explanations
(
Poster
)
>
link
|
Jiashu Tao · Reza Shokri
🔗
|
|
→ MSE-Break: Steering Internal Representations to Bypass Refusals in Large Language Models
(
Poster
)
>
link
|
Ashwin Saraswatula · Pranav Balabhadra · Pranav Dhinakar
🔗
|
|
→ Challenges in Understanding Modality Conflict in Vision-Language Models
(
Poster
)
>
link
|
Trang Nguyen · Jackson Michaels · Madalina Fiterau · David Jensen
🔗
|
|
→ Avoiding Leakage Poisoning: Concept Interventions Under Distribution Shifts
(
Poster
)
>
link
|
Mateo Espinosa Zarlenga · Gabriele Dominici · Pietro Barbiero · Zohreh Shams · Mateja Jamnik
🔗
|
|
→ Probabilistic Soundness Guarantees in LLM Reasoning Chains
(
Poster
)
>
link
|
Weiqiu You · Anton Xue · Shreya Havaldar · Delip Rao · Helen Jin · Chris Callison-Burch · Eric Wong
🔗
|
|
→ MIB: A Mechanistic Interpretability Benchmark
(
Poster
)
>
link
|
23 presenters
Aaron Mueller · Atticus Geiger · Sarah Wiegreffe · Dana Arad · Iván Arcuschin · Adam Belfki · Yik Siu Chan · Jaden Fiotto-Kaufman · Tal Haklay · Michael Hanna · Jing Huang · Rohan Gupta · Yaniv Nikankin · Hadas Orgad · Nikhil Prakash · Anja Reusch · Aruna Sankaranarayanan · Shun Shao · Alessandro Stolfo · Martin Tutek · Amir Zur · David Bau · Yonatan Belinkov
🔗
|
|
→ Single Feature Tips the Balance: Reducing Language Model Over-Refusal with Sparse Representations
(
Poster
)
>
link
|
Ailin Deng · Shaoliang Nie · Lijuan Liu · Xianjun Yang · Ujjwal Karn · Dat Huynh · Fulton Wang · Ying Xu · Madian Khabsa · Saghar Hosseini
🔗
|
|
→ A Single Direction of Truth: An Observer Model’s Linear Residual Probe Exposes and Steers Contextual Hallucinations
(
Poster
)
>
link
|
Charles O'Neill · Sviatoslav Chalnev · Rune Chi Zhao · Max Kirkby · Mudith Jayasekara
🔗
|
|
→ Interpretable Diffusion Models with B-cos Networks
(
Poster
)
>
link
|
Nicola Bernold · Moritz Vandenhirtz · Alice Bizeul · Julia Vogt
🔗
|
|
→ Taming Knowledge Conflicts in Language Models
(
Poster
)
>
link
|
Gaotang Li · Yuzhong Chen · Hanghang Tong
🔗
|
|
→ Benchmarking Deception Probes via Black-to-White Performance Boosts
(
Poster
)
>
link
|
Avi Parrack · Carlo Attubato · Stefan Heimersheim
🔗
|
|
→ Needle in a Patched Haystack: Evaluating Saliency Maps for Vision LLMs.
(
Poster
)
>
link
|
Bastien Zimmermann · Matthieu Boussard
🔗
|
|
→ Actionable Interpretability via Causal Hypergraphs: Unravelling Batch Size Effects in Deep Learning
(
Poster
)
>
link
|
Zhongtian Sun · Anoushka Harit · Pietro Lió
🔗
|
|
→ Steering Language Model Refusal with Sparse Autoencoders
(
Poster
)
>
link
|
Kyle O'Brien · David Majercak · Xavier Fernandes · Richard Edgar · Blake Bullwinkel · Jingya Chen · Harsha Nori · Dean Carignan · Eric Horvitz · Forough Poursabzi-Sangdeh
🔗
|
|
→ Classifier Reconstruction Through Counterfactual-Aware Wasserstein Prototypes
(
Poster
)
>
link
|
Xuan Zhao · Zhuo Cao · Arya Bangun · Hanno Scharr · Ira Assent
🔗
|
|
→ SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders
(
Poster
)
>
link
|
Bartosz Cywiński · Kamil Deja
🔗
|
|
→ Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
(
Poster
)
>
link
|
Jing Huang · Junyi Tao · Thomas Icard · Diyi Yang · Christopher Potts
🔗
|
|
→ Learning interpretable positional encodings in transformers depends on initialization
(
Poster
)
>
link
|
Takuya Ito · Luca Cocchi · Tim Klinger · Parikshit Ram · Murray Campbell · Luke Hearne
🔗
|
|
→ One-shot Optimized Steering Vectors Mediate Safety-relevant Behaviors in LLMs
(
Poster
)
>
link
|
Jacob Dunefsky · Arman Cohan
🔗
|
|
→ SAEs Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
(
Poster
)
>
link
|
Aashiq Muhamed · Jacopo Bonato · Mona Diab · Virginia Smith
🔗
|
|
→ Probing for Arithmetic Errors in Language Models
(
Poster
)
>
link
|
Yucheng Sun · Alessandro Stolfo · Mrinmaya Sachan
🔗
|
|
→ Insights into a radiology-specialised multimodal large language model with sparse autoencoders
(
Poster
)
>
link
|
Kenza Bouzid · Shruthi Bannur · Felix Meissen · Daniel Coelho de Castro · Anton Schwaighofer · Javier Alvarez-Valle · Stephanie L Hyland
🔗
|
|
→ Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them
(
Poster
)
>
link
|
Neel Rajani · Aryo Gema · Seraphina Goldfarb-Tarrant · Ivan Titov
🔗
|
|
→ Looking Beyond the Top-1: Transformers Determine Top Tokens in Order
(
Poster
)
>
link
|
Daria Lioubashevski · Tomer Schlank · Gabriel Stanovsky · Ariel Goldstein
🔗
|
|
→ Spectral Scaling Laws in Language Models: \\ {\em How Effectively Do Feed-Forward Networks Use Their Latent Space?}
(
Poster
)
>
link
|
Nandan Kumar Jha · Brandon Reagen
🔗
|
|
→ Pruning the Paradox: How CLIP’s Most Informative Heads Enhance Performance While Amplifying Bias
(
Poster
)
>
link
|
Avinash Madasu · Vasudev Lal · Phillip Howard
🔗
|
|
→ Reasoning-Finetuning Repurposes Latent Mechanisms in Base Models
(
Poster
)
>
link
|
Jake Ward · Chuqiao Lin
🔗
|
|
→ DeltaSHAP: Explaining Prediction Evolutions in Online Patient Monitoring with Shapley Values
(
Poster
)
>
link
|
Changhun Kim · Yechan Mun · Sangchul Hahn · Eunho Yang
🔗
|
|
→ Discovering Forbidden Topics in Language Models
(
Poster
)
>
link
|
Can Rager · Chris Wendler · Rohit Gandikota · David Bau
🔗
|
Sat 11:40 a.m. - 1:00 p.m.
|
Lunch Break + Poster Setup 2
|
🔗
|
Sat 1:00 p.m. - 2:00 p.m.
|
Poster Session 2
(
Poster Session
)
>
|
🔗
|
|
→ Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization
(
Poster
)
>
link
|
Joschka Braun · Carsten Eickhoff · Seyed Ali Bahrainian
🔗
|
|
→ Neural Concept Verifier: Scaling Prover-Verifier Games via Concept Encodings
(
Poster
)
>
link
|
Berkant Turan · Suhrab Asadulla · David Steinmann · Wolfgang Stammer · Sebastian Pokutta
🔗
|
|
→ Steering off Course: Reliability Challenges in Steering Language Models
(
Poster
)
>
link
|
Patrick Da Silva · Hari Sethuraman · Dheeraj Rajagopal · Hannaneh Hajishirzi · Sachin Kumar
🔗
|
|
→ Bayesian Influence Functions for Scalable Data Attribution
(
Poster
)
>
link
|
Philipp Kreer · Wilson Wu · Maxwell Adam · Zach Furman · Jesse Hoogland
🔗
|
|
→ Activation Steering in Generative Settings via Contrastive Causal Mediation Analysis
(
Poster
)
>
link
|
Aruna Sankaranarayanan · Amir Zur · Atticus Geiger · Dylan Hadfield-Menell
🔗
|
|
→ Understanding Synthetic Context Extension via Retrieval Heads
(
Poster
)
>
link
|
Xinyu Zhao · Fangcong Yin · Greg Durrett
🔗
|
|
→ Internal states before wait modulate reasoning patterns
(
Poster
)
>
link
|
Dmitrii Troitskii · Koyena Pal · Chris Wendler · Callum McDougall · Neel Nanda
🔗
|
|
→ Posthoc Disentanglement of Textual and Acoustic Features in Self-Supervised Speech Encoders
(
Poster
)
>
link
|
Hosein Mohebbi · Grzegorz Chrupała · Willem Zuidema · Afra Alishahi · Ivan Titov
🔗
|
|
→ Identifiable Steering via Sparse Autoencoding of Multi-Concept Shifts
(
Poster
)
>
link
|
Shruti Joshi · Andrea Dittadi · Sebastien Lachapelle · Dhanya Sridhar
🔗
|
|
→ Interpretable EEG-to-Image Generation with Semantic Prompts
(
Poster
)
>
link
|
Arshak Rezvani · Ali Akbari · Kosar Arani · Maryam Mirian · Emad Arasteh · Martin McKeown
🔗
|
|
→ REVIVING YOUR MNEME: Predicting The Side Effects of LLM Unlearning and Fine-Tuning via Sparse Model Diffing
(
Poster
)
>
link
|
Aly Kassem · Golnoosh Farnadi · Negar Rostamzadeh · Zhuan Shi
🔗
|
|
→ Evaluating Neuron Explanations: A Unified Framework with Sanity Checks
(
Poster
)
>
link
|
Tuomas Oikarinen · Ge Yan · Lily Weng
🔗
|
|
→ Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
(
Poster
)
>
link
|
Pedro Ferreira · Wilker Aziz · Ivan Titov
🔗
|
|
→ Prediction Models That Learn to Avoid Missing Values
(
Poster
)
>
link
|
Lena Stempfle · Anton Matsson · Newton Mwai · Fredrik Johansson
🔗
|
|
→ MULTIGUARD: An Efficient Approach for AI Safety Moderation Across Modalities
(
Poster
)
>
link
|
Sahil Verma · Keegan Hines · Jeff Bilmes · Charlotte Siska · Luke Zettlemoyer · Hila Gonen · Chandan Singh
🔗
|
|
→ Do LLMs Lie About What They Use? Benchmark for Metacognitive Truthfulness in Large Language Models
(
Poster
)
>
link
|
Nhi Nguyen · Shauli Ravfogel · Rajesh Ranganath
🔗
|
|
→ Rethinking Crowd-Sourced Evaluation of Neuron Explanations
(
Poster
)
>
link
|
Tuomas Oikarinen · Ge Yan · Akshay Kulkarni · Lily Weng
🔗
|
|
→ Are language models aware of the road not taken? Token-level uncertainty and hidden state dynamics
(
Poster
)
>
link
|
Amir Zur · Eric Bigelow · Atticus Geiger · Ekdeep Singh Lubana
🔗
|
|
→ Convergent Linear Representations of Emergent Misalignment
(
Poster
)
>
link
|
Anna Soligo · Edward Turner · Senthooran Rajamanoharan · Neel Nanda
🔗
|
|
→ The Geometry of Forgetting: Analyzing Machine Unlearning through Local Learning Coefficients
(
Poster
)
>
link
|
Aashiq Muhamed · Virginia Smith
🔗
|
|
→ Interpreting the repeated token phenomenon in LLMs
(
Poster
)
>
link
|
Itay Yona · Jamie Hayes · Ilia Shumailov · Federico Barbero · Yossi Gandelsman
🔗
|
|
→ Beyond the ATE: Interpretable Modelling of Treatment Effects over Dose and Time
(
Poster
)
>
link
|
Julianna Piskorz · Krzysztof Kacprzyk · Harry Amad · Mihaela van der Schaar
🔗
|
|
→ Diversity-driven Data Selection for Language Model Tuning through Sparse Autoencoder
(
Poster
)
>
link
|
Xianjun Yang · Shaoliang Nie · Lijuan Liu · Suchin Gururangan · Ujjwal Karn · Rui Hou · Madian Khabsa · Yuning Mao
🔗
|
|
→ Unifying Image Counterfactuals and Feature Attributions with Latent-Space Adversarial Attacks
(
Poster
)
>
link
|
Jeremy Goldwasser · Giles Hooker
🔗
|
|
→ Towards eliciting latent knowledge from LLMs with mechanistic interpretability
(
Poster
)
>
link
|
Bartosz Cywiński · Emil Ryd · Senthooran Rajamanoharan · Neel Nanda
🔗
|
|
→ Explanation Design in Strategic Learning: Sufficient Explanations that Induce Non-harmful Responses
(
Poster
)
>
link
|
Kiet Vo · Siu Lun Chau · Masahiro Kato · Yixin Wang · Krikamol Muandet
🔗
|
|
→ Probing and Steering Evaluation Awareness of Language Models
(
Poster
)
>
link
|
Jord Nguyen · Hoang Khiem · Carlo Attubato · Felix Hofstätter
🔗
|
|
→ Sum-of-Parts: Self-Attributing Neural Networks with End-to-End Learning of Feature Groups
(
Poster
)
>
link
|
Weiqiu You · Helen Qu · Marco Gatti · Bhuvnesh Jain · Eric Wong
🔗
|
|
→ Why Do Some Inputs Break Low-Bit LLMs?
(
Poster
)
>
link
|
Ting-Yun Chang · Muru Zhang · Jesse Thomason · Robin Jia
🔗
|
|
→ Resilient Multi-Concept Steering in LLMs via Enhanced Sparse "Conditioned" Autoencoders
(
Poster
)
>
link
|
Saurish Srivastava · Kevin Zhu · Cole Blondin · Sean O'Brien
🔗
|
|
→ Automated Detection of Visual Attribute Reliance with a Self-Reflective Agent
(
Poster
)
>
link
|
Christy Li · Josep Camuñas · Jake Touchet · Jacob Andreas · Agata Lapedriza · Antonio Torralba · Tamar Rott Shaham
🔗
|
Sat 2:00 p.m. - 2:30 p.m.
|
Byron Wallace - What (if anything) can interpretability do for healthcare?
(
Invited Talk
)
>
|
Byron Wallace
🔗
|
Sat 2:30 p.m. - 2:45 p.m.
|
Contributed Talk 3: Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
(
Contributed Talk
)
>
link
|
Pedro Ferreira
🔗
|
Sat 2:45 p.m. - 3:00 p.m.
|
Contributed Talk 4: Internal Causal Mechanisms Robustly Predict Language Model Out-of-Distribution Behaviors
(
Contributed Talk
)
>
link
|
Jing Huang
🔗
|
Sat 3:00 p.m. - 3:30 p.m.
|
Coffee Break
|
🔗
|
Sat 3:30 p.m. - 4:00 p.m.
|
Eric Wong - Explanations for Experts via Guarantees and Domain Knowledge: From Attributions to Reasoning
(
Invited Talk
)
>
|
Eric Wong
🔗
|
Sat 4:00 p.m. - 4:45 p.m.
|
Panel on Actionable Interpretability
(
Panel
)
>
|
Fazl Barez · Naomi Saphra · Samuel Marks · Kyle Lo
🔗
|
Sat 4:45 p.m. - 5:00 p.m.
|
Closing Remarks
(
Conclusion
)
>
|
🔗
|
-
|
No Training Wheels: Steering Vectors for Bias Correction at Inference Time
(
Poster
)
>
link
|
Aviral Gupta · Armaan Sethi · Ameesh Sethi
🔗
|
-
|
What Kind of User Are You? Uncovering User Models in LLM Chatbots
(
Poster
)
>
link
|
13 presenters
Yida Chen · Aoyu Wu · Trevor DePodesta · Catherine Yeh · Lena Armstrong · Kenneth Li · Nicholas Marin · Oam Patel · Jan Riecke · Shivam Raval · Olivia Seow · Martin Wattenberg · Fernanda Viégas
🔗
|
-
|
Are We Merely Justifying Results ex Post Facto? Quantifying Explanatory Inversion in Post-Hoc Model Explanations
(
Poster
)
>
link
|
Zhen Tan · Song Wang · Yifan Li · Yu Kong · Jundong Li · Tianlong Chen · huan liu
🔗
|
-
|
Learning-Augmented Robust Algorithmic Recourse
(
Poster
)
>
link
|
Kshitij Kayastha · Shahin Jabbari · Vasilis Gkatzelis
🔗
|
-
|
ExpProof : Operationalizing Explanations for Confidential Models with ZKPs
(
Poster
)
>
link
|
Chhavi Yadav · Evan Laufer · Dan Boneh · Kamalika Chaudhuri
🔗
|
-
|
Towards Understanding the Mechanisms of Classifier-Free Guidance
(
Poster
)
>
link
|
Xiang Li · Rongrong Wang · Qing Qu
🔗
|
-
|
Breaking Bad Tokens: Detoxification of LLMs Using Sparse Autoencoders
(
Poster
)
>
link
|
Agam Goyal · Vedant Rathi · William Yeh · Yian Wang · Yuen Chen · Hari Sundaram
🔗
|
-
|
Disentangling and Steering Multilingual Representations: Layer-Wise Analysis and Cross-Lingual Control in Language Models
(
Poster
)
>
link
|
Abir HARRASSE · Florent Draye · Bernhard Schölkopf · Zhijing Jin
🔗
|
-
|
Going Beyond Black-Box Models by Leveraging Behavioral Insights: an Intent-Aware Multi-Stage Recommendation Framework
(
Poster
)
>
link
|
Yuyan Wang · Cheenar Banerjee · Samer Chucri · Minmin Chen
🔗
|
-
|
Beyond Sparsity: Improving Diversity in Sparse Autoencoders via Denoising Training
(
Poster
)
>
link
|
Xiang Pan · Yifei Wang · Qi Lei
🔗
|
-
|
Persistent Demographic Information in X-ray Foundation Embeddings: a Risk for a Safe and Fair Deployment in Healthcare
(
Poster
)
>
link
|
Filipe Santos · Aldo Marzullo · Alessandro Quarta · João Sousa · Susana Vieira · Leo Celi · Francesco Calimeri · Laleh Seyyed-Kalantari
🔗
|
-
|
BlueGlass: A Framework for Composite AI Safety
(
Poster
)
>
link
|
Harshal Sanjay Nandigramwar · Qutub Syed Sha · Kay-Ulrich Scholl
🔗
|
-
|
Fine-Grained Visual Tokens Align with Localized Semantics
(
Poster
)
>
link
|
Zhuohao Ni · Xiaoxiao Li
🔗
|
-
|
On the Effect of Uncertainty on Layer-wise Inference Dynamics
(
Poster
)
>
link
|
Sunwoo Kim · Haneul Yoo · Alice Oh
🔗
|
-
|
The Blessing of Reasoning: LLM-Based Contrastive Explanations in Black-Box Recommender Systems
(
Poster
)
>
link
|
Yuyan Wang · Pan Li · Minmin Chen
🔗
|
-
|
Joint Localization and Activation Editing for Low-Resource Fine-Tuning
(
Poster
)
>
link
|
Wen Lai · Alexander Fraser · Ivan Titov
🔗
|
-
|
Multi-Modal Interpretable Graph for Competing Risk Prediction with Electronic Health Records
(
Poster
)
>
link
|
Munib Mesinovic · Peter Watkinson · Tingting Zhu
🔗
|
-
|
MPF: Aligning and Debiasing Language Models post Deployment via Multi-Perspective Fusion
(
Poster
)
>
link
|
Xin Guan · Pei-Hsin Lin · Zekun Wu · Ze Wang · Ruibo Zhang · Emre Kazim · Adriano Koshiyama
🔗
|
-
|
Steering Self-Evaluation: Interpreting LLM's Reasoning Across Domains and Languages
(
Poster
)
>
link
|
Praveen Hegde
🔗
|
-
|
Why Do Metrics Think That? Towards Understanding Large Language Models as Machine Translation Evaluators
(
Poster
)
>
link
|
Runzhe Zhan · Xinyi Yang · Junchao Wu · Lidia Chao · Derek Wong
🔗
|
-
|
Adversarial Cooperative Rationalization: The Risk of Spurious Correlations in Even Clean Datasets
(
Poster
)
>
link
|
Wei Liu · Zhongyu Niu · Lang Gao · Zhiying Deng · Jun Wang · Haozhao Wang · Ruixuan Li
🔗
|
-
|
Transferring Features Across Language Models With Model Stitching
(
Poster
)
>
link
|
Alan Chen · Jack Merullo · Alessandro Stolfo · Ellie Pavlick
🔗
|