Fabian Leinen, M.Sc.
Technische Universität München
Informatik 4 - Lehrstuhl für Software & Systems Engineering (Prof. Pretschner)
Postadresse
Boltzmannstr. 3
85748 Garching b. München
- Homepage
- E-Mail: fabian.leinen@tum.de
Research Focus
Regression testing ensures that changes to a software system do not introduce new bugs or reintroduce old ones. The underlying assumption of regression testing is that tests are deterministic; meaning, they consistently pass or fail if all factors that the tester seeks to control remain the same. Tests that violate this assumption are called flaky tests. In continuous integration (CI), developers integrate their code changes multiple times a day, making comprehensive automated test suites a necessity due to the frequent integrations. Assuming all tests exhibit some level of flakiness, there will inevitably be flaky failures hampering CI and causing high costs.
My research focuses on understanding flaky tests in CI development processes and on supporting developers in addressing them.
Thesis Supervision
Open Topics
Please get in touch if you are interested in guided research or if you need supervision for your master's or bachelor's thesis. Throughout the entire research period, we will meet weekly to discuss results, issues, and the next steps. You will receive detailed feedback on parts of your thesis before submission, giving you the opportunity to further improve your work. Additionally, we will conduct a dry run of your final presentation, and you will receive feedback to refine your performance and content.
When applying, please include your CV and transcript of records to help me get a picture of your background and interests. Please also indicate which topic you are interested in and when you would like to start work.
MA/GR: Fix One, Fix Many: Clustering Flaky Tests by Shared Root Causes in CI
Master/Guided Research
Flaky tests are a major cost driver in CI, yet developers typically address them one at a time. In practice, many flaky tests fail for the same underlying reason, such as a shared dependency on an unreliable external service or a common sensitivity to resource constraints. If these groups can be identified automatically, developers can fix the shared root cause once and resolve multiple flaky tests simultaneously. This thesis investigates whether co-failure patterns and temporal correlations in CI test execution data can reliably identify such groups, using a dataset of 8.8 billion test executions from four industrial-scale projects. A central challenge is evaluating whether the resulting clusters are meaningful, which we address at three levels: structural quality, semantic correctness validated against independent signals such as repair commits and error messages, and practical usefulness beyond what developers already know.
What you should bring:
- Programming skills in Python for data processing and analysis
- Strong background in mathematics, particularly distance metrics, clustering algorithms, and statistical testing (minor in Mathematics or strong interest in theoretical/mathematical work for a practical problem)
- Comfort working with large datasets and messy real-world data
- Ability to think critically about what constitutes meaningful evidence, and creativity in designing evaluation strategies for open-ended problems
What you can learn:
- Designing multi-level evaluation frameworks for problems with fuzzy ground truth
- Working with real-world CI data from major open-source projects (Chromium, GitLab, Playwright) at a scale of 8.8 billion test executions
- Contributing to an active research area with direct practical relevance
MA/GR: Finding Needles in Haystacks: Identifying Flakiness-Relevant Commits at Scale
Master/Guided Research
Flaky tests waste significant developer time in CI, yet when developers want to repair them, they often lack insight into what caused the flakiness. This thesis aims to automatically identify flakiness-relevant commits, i.e., changes that introduced, increased, or otherwise contributed to a test's flaky behavior, using machine learning on readily available CI and VCS data from a large dataset of 8.8 billion test executions across four industrial-scale projects. The practical impact would be substantial: pointing developers directly to the code changes that matter when repairing flaky tests. A central challenge is the evaluation, as ground truth for flakiness-relevant commits is inherently scarce and fuzzy.
What you should bring:
- Programming skills in Python for data processing and feature engineering
- Solid understanding of git (commit graphs, branching, blame, diffs)
- Interest in machine learning, particularly feature engineering and model evaluation
- Comfort working with large datasets and messy real-world data
- Ability to think critically about what constitutes meaningful evidence, and creativity in designing evaluation strategies for open-ended problems
What you can learn:
- Feature engineering on graph-structured data combining heterogeneous sources (CI and VCS)
- How industrial CI/CD pipelines work at scale, with data from major open-source projects (Chromium, GitLab, Playwright)
- Designing evaluations for problems where ground truth is scarce
- Contributing to an active research area with direct practical relevance
Ongoing and Assigned Topics
| Evaluation of Test Optimization Models for Detecting Flaky Test Failures | Master |
Finished Topics
| Automatically Detecting Flaky Tests in CI | Master |
| Supporting Developers in Repairing Flaky Tests in CI | Master |
| Supporting Developers in Repairing Flaky Tests in CI | Master |
| Comparing Approaches for Flaky Test Research | Bachelor |
| Enhanced Debugging of Flaky Test Cases: Automating State Preservation for Efficient Failure Analysis | Bachelor |
| A Study on Flaky Tests | Master |
| Automatically Detecting Flaky End-to-End Test Failures Using Code Coverage (with RW) | Master |
| Reducing Effort for Flaky Test Detection Through Resource Limitation | Bachelor |
| Analyzing the Effectiveness of Rerunning Tests for Detecting Flaky UI Tests | Bachelor |
| Reducing Effort for Flaky Test Detection through Dynamic Program Analysis (with DE; Rohde & Schwarz Best Bachelor Award) | Bachelor |
| Determining Root Causes of Flaky Tests Using System Call Analysis (with DE) | Master |
Publications
Leinen, F., Perathoner, A., Pretschner, A. (2024). On the Impact of Hitting System Resource Limits on Test Flakiness. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering: International Flaky Tests Workshop (FTW).
Leinen, F., Elsner, D., Pretschner, A., Stahlbauer, A., Sailer, M., & Jürgens, E. (2024). Cost of Flaky Tests in Continuous Integration: An Industrial Case Study. In Proceedings of the 17th IEEE International Conference on Software Testing, Verification and Validation (accepted for publication).
Wuersching, R.*, Elsner, D.*, Leinen, F., Pretschner, A., Grueneissl, G., Neumeyr, T., Vosseler, T. (2023). Severity-Aware Prioritization of System-Level Regression Tests in Automotive Software. In Proceedings of the 16th IEEE International Conference on Software Testing, Verification and Validation.
Teaching
| Semester | Title | |
|---|---|---|
| SS 2026 | Seminar: Software Quality | Seminar |
| WS 2025/26 | Seminar: Software Quality | Seminar |
| Advanced Topics of Software Engineering | ||
| SS 2025 | Seminar: Software Quality | Seminar |
| WS 2024 | Seminar: Software Quality | Seminar |
| SS 2024 | Seminar: Software Quality | Seminar |
| WS 2023/24 | Advanced Topics of Software Engineering | Exercise + Tutorial |
| Seminar: Software Quality | Seminar | |
| SS 2023 | Robust DevOps: Exploring Stability Factors for UI Tests | Practicum |
| Seminar: Software Quality | Seminar | |
| WS 2022/23 | Seminar: Software Quality | Seminar |
| SS 2022 | Advanced Topics of Software Testing | Tutorial |
| Seminar: Software Quality | Seminar | |
| WS 2021/22 | Introduction to Informatics | Tutorial |