News
Nine papers accepted at ICML 2026
Our group will present nine papers at ICML 2026, including one position paper. Congratulations!
Excited Pfaffians: Generalized Neural Wave Functions Across Structure and State (Spotlight)
(Nicholas Gao*, Till Grutschus*, Frank Noé, Stephan Günnemann)
A molecule lives on a ladder of energy levels, and most of chemistry — light absorption, photochemistry, spectroscopy — depends on far more than just the bottom rung. Neural-network wave functions have become a powerful tool for computing these levels, but climbing the ladder has been expensive: estimating how different excited states relate to one another demands ever more Monte Carlo samples, and current architectures need separate machinery for each new state. We propose two ideas that break this scaling. Multi-State Importance Sampling (MSIS) reuses samples across all states, keeping the cost of estimating state overlaps roughly constant as more states are added. Excited Pfaffians, inspired by Hartree-Fock, pack many electronic states into a single neural network — and the same network generalizes across different molecules and geometries. On the carbon dimer we train over 200× faster than the strongest comparable method while modeling 50% more states, and we are the first to use a neural network to recover every distinct energy level of the beryllium atom.
Derivative Informed Learning of Exchange-Correlation Functionals
(Eike S. Eberhard, Luca A. Thiede, Abdul Aldossary, Andreas Burger, Nicholas Gao, Vignesh Bhethanabotla, Alán Aspuru-Guzik, Stephan Günnemann)
Machine-learned exchange correlation functionals are usually trained to land at the right answer: right energy, right density, right self-consistent fixed point. We trained ours to know the shape of the energy landscape around it. By supervising gradients and curvatures on the manifold of valid electron densities, a spatially equivariant graph neural network can absorb the behavior of more expensive traditional functionals, and produce cleaner excited-state spectra along the way. Fixed-point matching tells the model where to stand. Derivative matching tells it which way the ground truly tilts.
Task-Awareness Improves LLM Generations and Uncertainty
(Tim Tomov*, Dominik Fuchsgruber*, Stephan Günnemann)
Are LLM outputs just language? When using large language models, we often don’t know what the answer will be—but we usually know its structure: numbers, lists, sets, and more. So why do we still operate purely in text space? What if we decoded in the space that actually matters for the task? We propose to model outputs directly in their underlying structure. This lets us turn model responses into a task-aware distribution and leverage the full models belief to compute Bayes-optimal predictions in that space—often synthesizing better answers than possible in text space. Uncertainty becomes structure-aware as well: not just variation in text, but distance-sensitive Bayes risk aligned with the task. Across a wide range of settings, this simple shift leads to consistently better answers and more meaningful uncertainty than standard decoding and UQ methods.
A Coin Flip for Safety: LLM Judges Fail to Reliably Measure Adversarial Robustness
(Leo Schwinn*, Moritz Ladenburger*, Tim Beyer, Mehrnaz Mofakhami, Gauthier Gidel, Stephan Günnemann)
Automated *LLM-as-a-Judge* frameworks have become the de facto standard for scalable evaluation across natural language processing. For instance, in safety evaluation, these judges are relied upon to evaluate harmfulness in order to benchmark the robustness of safety against adversarial attacks. However, we show that existing validation protocols fail to account for substantial distribution shifts inherent to red-teaming: diverse victim models exhibit distinct generation styles, attacks distort output patterns, and semantic ambiguity varies significantly across jailbreak scenarios. Through a comprehensive audit using 6642 human-verified labels, we reveal that the unpredictable interaction of these shifts often causes judge performance to degrade to near random chance. This stands in stark contrast to the high human agreement reported in prior work. Crucially, we find that many attacks inflate their success rates by exploiting judge insufficiencies rather than eliciting genuinely harmful content. To enable more reliable evaluation, we propose ReliableBench, a benchmark of behaviors that remain more consistently judgeable, and JudgeStressTest, a dataset designed to expose judge failures.
Speculative Sampling For Faster Molecular Dynamics
(Arthur Kosmala, Stephan Günnemann, Meng Gao, Brandon M. Wood)
What if an untapped form of parallelism could make your molecular dynamics simulation several times faster – without sacrifices in accuracy? Our method of Langevin Speculative Dynamics (LSD) achieves this by breaking the traditional serial bottleneck of MD. By pairing a fast “draft” simulator with replicas of a slower, high-fidelity target model, LSD proposes a fast stream of time steps and verifies them in parallel – either accepting or overwriting them in a way that preserves the exact target distribution. The result is a substantial parallelism-driven speedup without any distributional bias. Formally, LSD extends the maximal coupling technique of speculative sampling, previously explored for LLMs and diffusion models, to second-order Langevin integrators. We confirm both theoretically and empirically that LSD-drawn trajectories are indeed identically distributed to a standard serial simulation with the target model, all while achieving a 3–9× acceleration across diverse systems.
Certifying Graph Neural Networks Against Label and Structure Poisoning
(Lukas Gosch*, Xichuan Chen*, Yan Scholten*, Stephan Günnemann)
Robust machine learning for graph-structured data has made significant progress against test-time attacks, yet certified robustness to poisoning – where adversaries manipulate the training data – remains largely underexplored. For image data, state-of-the-art poisoning certificates rely on partitioning-and-aggregation schemes. However, we show that these methods fail when applied in the graph domain due to the inherent label and structure sparsity found in common graph datasets, making effective graph-partitioning difficult. To address this challenge, we propose a novel semi-supervised learning framework called deep Self-Training Graph Partition Aggregation (ST-GPA), which enriches each graph partition with informative pseudo-labels and synthetic edges, enabling effective certification against node-label and graph-structure poisoning under sparse conditions. Our method is architecture-agnostic, scales to large numbers of partitions, and consistently and significantly improves robustness guarantees against both label and structure poisoning across multiple benchmarks, while maintaining strong clean accuracy. Overall, our results establish a promising direction for certifiably robust learning on graph-structured data against poisoning under sparse conditions.
Byte Pair Encoding for Efficient Time Series Forecasting
(Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn)
Why should a flat signal and a highly dynamic one receive the same number of tokens? Existing time series tokenizers usually compress a fixed number of samples into each token, wasting computation on simple patterns such as long constant regions. Inspired by byte pair encoding, we introduce the first pattern-centric tokenizer for time series: a motif-based scheme that adaptively merges samples according to the patterns they contain. This turns frequent temporal motifs into compact tokens, reducing sequence length where possible while preserving detail where needed. We further exploit the discrete motif vocabulary and the continuity of time series through conditional decoding, a gradient-free post-hoc optimization that improves predictions without extra model cost. On recent time series foundation models, our approach delivers large gains in forecasting accuracy and efficiency, and our analysis shows that the resulting tokens capture meaningful temporal structure, from trends to statistical moments.
LLM-Safety Evaluations Lack Robustness (Position paper)
(Tim Beyer, Sophie Xhonneux, Simon Geisler, Gauthier Gidel, Leo Schwinn, Stephan Günnemann)
Current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.
Flow-Based Density Ratio Estimation for Intractable Distributions with Applications in Genomics
(Egor Antipov*, Alessandro Palma*, Lorenzo Consoli*, Stephan Günnemann, Andrea Dittadi, Fabian J. Theis)
Density Ratio Estimation (DRE) between conditional distributions is a fundamental problem in probabilistic modeling, underpinning tasks such as comparing the likelihoods of empirical samples under different covariates. A natural approach to approximating likelihood ratios is to separately estimate the numerator and denominator likelihoods with distinct normalizing flow models over the data points and then compose their ratio. However, this method requires simulating as many integrals as compared to likelihoods, hence it is costly and accumulates compounding discretization errors along the way. In this work, we alleviate this problem by deriving a single ordinary differential equation that directly tracks the ratio as samples travel along generative trajectories from data back to noise, thereby making the ratio dynamics explicit by composing learned velocity fields and score functions from a conditional flow model. One simulation, two conditions, no redundant integral solves. On synthetic benchmarks, this beats both naive likelihood evaluation and score-based ratio estimators, with lower error and faster runtime as dimensionality grows. In single-cell genomics, we release scRatio, a tool grounded in our framework for cellular data analysis. In this applied setting, every cell carries a gene expression profile measured under some experimental condition. The payoff is immediate: scRatio scores individual cells for differential abundance across perturbations, detects batch effects before and after correction, identifies synergistic drug combinations, and stratifies patient-specific cytokine responses, all within a unified likelihood-based framework.