Automating Software Maintainability Evaluations

Project Description

This research project strengthens the idea that machine-learned models, particularly models based on static code metrics, can contribute towards automated software maintainability assessments. 

It is critical for software vendors to establish continuous quality management to avoid cost explosions. However, before controlling the quality of software systems, we have to assess it first. In the case of maintainability, this often happens with manual expert reviews. The goal of this project is to establish an automated evaluation process that is based on expert judgment. In contrast to most other work using expert assessments, we investigate in depth which aspects experts take into account during quality assessments. 

Currently, we are focusing on several research questions:
-    Which inputs are suitable predictors for software maintainability?
-    Which algorithms perform best to predict the experts‘ judgment? How precise can identify ML classifiers hard-to-maintain code in comparison to human experts?
-    How can maintainability ratings predicted by ML models help to improve the state of practice? How can the predicted ratings be integrated into into industrial assessment processes?

Contributions and Results

This project set out to assist maintainability assessments by human experts with machine-learned classifications of the source code and its maintainability. To develop reliable algorithms, we crafted a software maintainability dataset, that contains 519 human-labeled data points. In total, 70 professionals assessed code from 9 different projects and rated its readability, comprehensibility, modularity, complexity, and overall maintainability. Every code file was analyzed by at least three experts. Using the consensus of the expert as the ground truth, we developed several approaches to predict the maintainability of source code. 

Among others, static code metrics were used as input of the algorithms. Simple structural metrics such as the size of a class and its methods are found to yield the highest predictive power towards maintainability.  Considering a four-part ordinal scale (code is either easy, rather easy, rather hard, or hard to maintain), we achieve a Matthews Correlations Coefficient of 0.525. Analyzing the agreement of the individual expert ratings and the consensus of all experts reveals a Matthews Correlations Coefficient, which is only 0.004 higher. In summary, our models achieve the same level of prediction performance towards the consensus as an average human expert.

Currently, we compare the metric-based approaches with other machine learning techniques based on other input types. Furthermore, we deployed a prototype at our industrial partner. It incorporates the trained models, explanations of the predicted labels, and a web-based frontend. An ongoing study seems to confirm that our models can identify hard-to-maintain code with high precision and can create easy-to-understand overviews of the overall state of a system.

A Software Maintainability Dataset

Before controlling the quality of software systems, we need to assess it. Current automatic approaches have received criticism because their results often do not reflect the opinion of experts or are biased towards a small group of experts. We use the judgments of a significantly larger expert group to create a robust maintainability dataset. In a large scale survey, 70 professionals assessed code from 9 open and closed source Java projects with a combined size of 1.4 million source lines of code. The assessment covers an overall judgment as well as an assessment of several subdimensions of maintainability. Among these subdimensions, we present evidence that understandability is valued the most by the experts. Our analysis also reveals that disagreement between evaluators occurs frequently. Significant dissent was detected in 17% of the cases. To overcome these differences, we present a method to determine a consensus, i.e. the most probable true label. The resulting dataset contains the consensus of the experts for more than 500 Java classes. This corpus can be used to learn precise and practical classifiers for software maintainability.

For more details about the creation of this dataset, please refer to: M. Schnappinger, A. Fietzke, and A. Pretschner, "Defining a Software Maintainability Dataset: Collecting, Aggregating and Analysing Expert Evaluations of Software Maintainability", International Conference on Software Maintenance and Evolution (ICSME), 2020

The dataset, i.e. code plus labels and instructions on how to use the data, is available from here

Coality: Code Quality Labeling Platform

In research related to Software Quality, we are interested to create a dataset of source code and corresponding quality labels. Thus, we have built an online platform that allows to evaluate strategically chosen code snippets from various code bases. With just a small invest of your time, you can actively foster very important research. Curious how that looks like? https://coality.sse.in.tum.de/

Related Publications

  • Schnappinger, Markus; Zachau, Simon; Fietzke, Arnaud; Pretschner, Alexander: A Preliminary Study on Using Text- and Image-Based Machine Learning to Predict Software Maintainability. In: Software Quality: The Next Big Thing in Software Engineering and Quality. Springer International Publishing, 2022 more…
  • Schnappinger, Markus; Streit, Jonathan: Efficient Platform Migration of a Mainframe Legacy System Using Custom Transpilation. 2021 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, 2021 more…
  • Schnappinger, Markus; Fietzke, Arnaud; Pretschner, Alexander: Human-level Ordinal Maintainability Prediction Based on Static Code Metrics. Evaluation and Assessment in Software Engineering, ACM, 2021 more…
  • Schnappinger, Markus; Fietzke, Arnaud; Pretschner, Alexander: Defining a Software Maintainability Dataset: Collecting, Aggregating and Analysing Expert Evaluations of Software Maintainability. 2020 IEEE International Conference on Software Maintenance and Evolution (ICSME), IEEE, 2020 more…
  • Schnappinger, Markus; Fietzke, Arnaud; Pretschner, Alexander: A Software Maintainability Dataset. 2020 more…
  • Schnappinger, Markus; Osman, Mohd Hafeez; Pretschner, Alexander; Fietzke, Arnaud: Learning a Classifier for Prediction of Maintainability Based on Static Analysis Tools. 2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC), IEEE, 2019 more…
  • Schnappinger, Markus; Osman, Mohd Hafeez; Pretschner, Alexander; Pizka, Markus; Fietzke, Arnaud: Software quality assessment in practice. Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement - ESEM '18, ACM Press, 2018 more…