Software Reliability Engineering: from Software Reliability Models to Software Resilience Bruno Rossi brossi@mail.muni.cz Lasaris (Lab of Software Architectures and Information Systems) Faculty of Informatics Masaryk University, Brno, Czech Republic www.lasaris.cz 2/58 Structure The presentation will be covering three parts: 3/58 Motivation – C4e Project - Critical Infrastructure ● Critical Infrastructure provide mission critical services - typically implemented as connected Cyberphysical systems (CPS) ● P1. Critical infrastructure protection: – P1.1 Simulation and predictive analysis of critical infrastructures – P1.2 Formal verification of critical infrastructures – P1.3 Recommendations for critical infrastructure realization ● Need to get a cohesive view, including cybersecurity and aspects related to cyber-law ERDF "CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence" (No. CZ.02.1.01/0.0/0.0/16_019/0000822). 4/58 Software Reliability Probably one one the most important qualities of software systems as it can make a system inoperative 5/58 ISO/IEC 25010 Standard – key terms ● ISO/IEC 25010 places four key terms under reliability: 6/58 Software Reliability Engineering (SRE) SRE includes: 1. Software reliability measurement – estimation and prediction 2. attributes and metrics of software design, development process, architecture and their impact on reliability 3. usage of the acquired knowledge to guide the design of software systems and development processes Lyu, Michael R. Handbook of software reliability engineering. Vol. 222. Los Alamitos: IEEE computer society press, 1996. 7/58 The SRE Process Adapted from Lyu, Michael R. Handbook of software reliability engineering. Vol. 222. Los Alamitos: IEEE computer society press, 1996. 8/58 Failure Rate & Hazard Rate failurerate( λ)= F(t+∆t)−F(t) ∆tR(t) z(t)= lim ∆t→0 F(t +∆t)−F(t) ∆tR(t) = f (t) R(t) ● Reliability is defined as the probability that a software system will not fail during the next x time units in a specific environment 9/58 Main difference Hardware vs Software Reliability ● Software does not have a wear-out region as in the hardware domain (in which hardware becomes obsolete and can lead to an increase in failures) Typical hazard rate z(t) of a system / component 10/58 Work of Lehman and Belady (1/3) ● Lehman started by a nine month study in 1968 to evaluate the IBM programming process, focusing on the OS/360 system ● After that experience Lehman and Belady joined in successive studies – laws of software evolution were defined in a time range from 1974 to 1996 ● The aim was to capture different growth trends of software systems and their long term evolution ● The laws apply to E-type systems: “programs that mechanize a human or social activity” (Lehman, 1980) Source: Lehman, M., M., Ramil, J.,F. “Software evolution--Background, theory, practice,” Information Processing Letters, vol. 88, Ott. 2003, pagg. 33-44. 11/58 What Lehman's Laws of Software Evolution tell us 12/58 Some implications for SRE ● A model in Region I will not work well for Region II ● If in Software there is no Region III then the same model as in Region III could be applied – however, according to Lehman’s laws quality of the systems decreases over time Typical hazard rate z(t) of a software system / component New software release 13/58 Software Reliability Growth Models (SRGMs) 14/58 Software Reliability Growth Modelling 15/58 Software Reliability Growth Modelling Mean value function – m(t) Fitting the cumulative failures over time 16/58 Software Reliability Growth Modelling Mean value function – m(t) Concave models – assume the total number of faults in software is finite, and that it is possible to achieve fault-free software in finite time S-shaped models – they also assume that the total number of faults is finite. Early testing is not as effective in fault discovery as the testing in the later stages. Therefore, there is a period in which the number of faults is increasing Infinite models – assume that it is not possible to develop fault-free software because during fault removal we can introduce new ones Types of models Fitting the cumulative failures over time 17/58 One of the earliest studies... ● One of the first papers* to apply SRGMs to Open Source software projects ● Comparing several models, like Weibull, Hossain Dahiya (HD), Goel Okumoto S-shaped (GOS), Gompertz ● Three projects analyzed: Mozilla Firefox, LibreOffice, OpenSuse ● Generally the Weibull model was found to be the best in terms of Goodness of Fit (GoF) ● However, no model was generally good for predictive capability * Rossi, B., Russo, B., & Succi, G. (2010). Modelling failures occurrences of open source software with reliability growth. In Open Source Software: New Horizons: 6th International IFIP WG 2.13 Conference on Open Source Systems, OSS 2010, Notre Dame, IN, USA, May 30–June 2, 2010 (pp. 268-280). Springer Berlin Heidelberg. 18/58 ...and how SRGMs have been used so far 19/58 STRAIT Tool (1/3) 1. Getting issue reports from data sources 2. Creation of snapshots and persistence storage 3. Data processing & filtering 4. Building of pluggable SRGM models (trend test, parameters estimation, GoF metrics) 5. Outputting module * Chren, S., Micko, R., Buhnova, B., & Rossi, B. (2019, May). STRAIT: a tool for automated software reliability growth analysis. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) (pp. 105-110). IEEE. Typical process 20/58 STRAIT Tool (2/3) 1. Getting issue reports from data sources 2. Creation of snapshots and persistence storage 3. Data processing & filtering 4. Building of pluggable SRGM models (trend test, parameters estimation, GoF metrics) 5. Outputting module Typical process Components * Chren, S., Micko, R., Buhnova, B., & Rossi, B. (2019, May). STRAIT: a tool for automated software reliability growth analysis. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) (pp. 105-110). IEEE. 21/58 STRAIT Tool (3/3) 1. Getting issue reports from data sources 2. Creation of snapshots and persistence storage 3. Data processing & filtering 4. Building of pluggable SRGM models (trend test, parameters estimation, GoF metrics) 5. Outputting module Example OutputTypical process Components * Chren, S., Micko, R., Buhnova, B., & Rossi, B. (2019, May). STRAIT: a tool for automated software reliability growth analysis. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR) (pp. 105-110). IEEE. 22/58 Experimental Evaluation 1. We adapted and used STRAIT to mine data from GitHub bug tracking repositories: top ten projects from different topics of the ”Topic Lists” and combined them with ten more projects from the ”Trending List” 2. We run STRAIT in a Cloud environment (16 threads / 128GB RAM) for increased performance 3. We fitted 792 SRGMs (88 projects x 9 models) with 383K software defects for RQ1, RQ2, and additionally 261 SRGMs for software releases (29 releases x 9 models) in RQ3 Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 255-262). IEEE. Implemented models 23/58 Aims of the experimental evaluation Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 255-262). IEEE. 24/58 Used Metrics (GoF) R2 (coefficient of determination) Akaike Information Criterion (AIC) Bayesian Information Criterion (BIC) Residual Standard Error (RSE) where: K – the number of estimated parameters in the model, L – the likelihood of the model given the data, n – the size of the dataset. How well the model fits the outputs (range 0-1) Indicators about the quality of the models, penalizing models with higher nr of parameters How well the model fits the outputs (in the unit of dependent variable) Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 255-262). IEEE. 25/58 RQ1 – Ranking of Models To answer this RQ, we considered 792 SRGMs fitted on the whole dataset with 383 788 software defects. Concave models S-Shaped Infinite Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 255-262). IEEE. 26/58 RQ2 – Project Domain To answer this RQ, we considered 792 SRGMs fitted on the whole dataset with 383 788 software defects and segmented by categories Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 255-262). IEEE. 27/58 RQ3 – Impact of Project Releases To answer this RQ, we used 63 SRGMs (7 projects with releases fitted by 9 models) and 261 SRGMs (29 releases, 9 models, 6 800 defects) Rankings of models based on R2 considering Releases (R) and whole projects (NR) Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and Advanced Applications (SEAA) (pp. 255-262). IEEE. 28/58 Major Challenges for SRGMs 29/58 Alternative methods They look more at the propagation and chaining of faults and failures ● Bayesian networks ● Fault trees and Markov chains ● Stochastic Petri Nets and Markov chains Source diagram: Avizienis, A., Laprie, J. C., & Randell, B. (2001). Fundamental concepts of computer system dependability. In Workshop on Robot Dependability: Technological Challenge of Dependable Robots in Human Environments (pp. 1-16). 30/58 Major Takeaways 31/58 Quality Models as Proxies for Failures Detection 32/58 How IEEE maps Failures to Quality The mapping of internal attributes to external ones is a key aspect in software reliability nr. of failures over a period of time Source image (adapted): ISO/IEC 91260 Standard Lyu, Michael R. Handbook of software reliability engineering. Vol. 222. Los Alamitos: IEEE computer society press 1996. nr. of faults / errors over a period of time 33/58 …indeed many metrics were used in the prediction models Rebro, D. A., Chren, S., & Rossi, B. (2023). Source Code Metrics for Software Defects Prediction. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing (pp. 1469-1472). 34/58 Software Defects Prediction Process Rebro, D. A., Chren, S., & Rossi, B. (2023). Source Code Metrics for Software Defects Prediction. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing (SAC) Example: ranking of metrics in the prediction model (the lower the better) 35/58 Generic process of defects prediction Rebro, D. A., Chren, S., & Rossi, B. (2023). Source Code Metrics for Software Defects Prediction. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing 36/58 Quality Models (example: SQALE) ● Many quality models were developed over time – the assumption is: control the quality and you will control the failures In short, the model defines a Remediation Cost (RC) to fix all the violations: 37/58 Major Challenges ● Identifying which modules are more defects prone ● Identifying the importance of features for the prediction ● Considering changes in history of a project (drifts) ● Dealing with imbalanced data ● Associating defects to implementations activities ● Integration of the models into running systems ● Mining representative datasets (e.g., NASA dataset has been used for long time in SE) 38/58 My Research Focus in this Area ● Evaluating the impact of several metrics on the defect prediction of models – Rebro, D. A., Chren, S., & Rossi, B. (2023). Source Code Metrics for Software Defects Prediction. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing (pp. 1469-1472). ● Evaluating the comparability of different maintainability indexes (SQALE, MI, SIG-TD) – Strečanský, P., Chren, S., & Rossi, B. (2020). Comparing maintainability index, SIG method, and SQALE for technical debt identification. In Proceedings of the 35th Annual ACM Symposium on Applied Computing (pp. 121- 124). ● Studying bug triaging in both Open Source Software and one company involved – Dedík, V., & Rossi, B. (2016). Automated bug triaging in an industrial context. In 2016 42th Euromicro conference on software engineering and advanced applications (SEAA) (pp. 363-367). IEEE. ● Evaluating the applicability of Mutation Testing in a industrial context – J. Možucha and B. Rossi. Is mutation testing ready to be adopted industry-wide? In P. Abrahamsson, A. Jedlitschka, A. Nguyen Duc, M. Felderer, S. Amasaki, and T. Mikkonen, editors, Product-Focused Software Process Improvement, pages 217–232, Cham, 2016. Springer International Publishing 39/58 Major Takeaways 40/58 Software Systems Resilience & Self-* capabilities 41/58 Motivation (1/2) ● In previous work we created a testing management platform for Smart Grids based on the Mosaik framework for co-simulations ● We extended Mosaik with the disconnect method to remove edges from the dataflow graph and the entity graph → A simple way to simulate node failures ● This can be useful to understand the patterns of failures → Mihal, P., Schvarcbacher, M., Rossi, B., & Pitner, T. (2022). Smart grids co-simulations: Survey & research directions. Sustainable Computing: Informatics and Systems,. → Schvarcbacher, M., Hrabovská, K., Rossi, B., & Pitner, T. (2018). Smart grid testing management platform (sgtmp). Applied Sciences, 8(11), 2278. → Gryga, L., & Rossi, B. (2021). Co-simulation of Smart Grids: Dynamically Changing Topologies in Failure Scenarios. In Complexis. Smart Grids Testing Processes 42/58 Motivation (2/2) ● In the CERIT-SC Big Data project we looked into anomalies for power consumption data ● Built a Big Data platform based on Apache Fink that could integrate anomaly detection algorithms Lipčák, P., Macak, M., & Rossi, B. (2019). Big data platform for smart grids power consumption anomaly detection. In 2019 federated conference on computer science and information systems (FedCSIS) (pp. 771-780). IEEE. 43/58 Software Systems Resilience 44/58 Software Systems Resilience – Self-* Systems ● Software Resilience is often associated with the following concepts (4S) 45/58 Self Healing Research Adapted from Psaier, H., & Dustdar, S. (2011). A survey on self-healing systems: approaches and systems. Computing, 91, 43-73. 46/58 Self Healing Research Adapted from Psaier, H., & Dustdar, S. (2011). A survey on self-healing systems: approaches and systems. Computing, 91, 43-73. Self-adaptive systems can monitor themselves and correct any deviations from expected behaviour Autonomic systems can self-manage and operate with minimum human intervention Fault tolerance is often difficult to achieve (e.g., distributed systems): selfstabilizing systems can improve towards one “correct” state in a certain time period Discipline defining survivability of a system in case of failures (resistance, recognition, recovery from failures, adaptation of services) Pioneering work about theory of redundancy to improve the reliability of software systems 47/58 Typical Aspects of Self-Healing Systems - Monitor the system - Identify anomalies from expected behaviour - Trigger the alerts - Identification of the source of the fault - Try to identify the component that is the cause of the fault - taking actions to restore the normal state of the system (e.g., restarting a service) 48/58 Self-healing System Challenges Adapted from Dreo Rodosek, G., Geihs, K., Schmeck, H., & Burkhard, S. (2009). Self-healing systems: Foundations and challenges. In Dagstuhl Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik. Major Challenges: – How to define the expected behaviour? Both in the sense of specifications but also anomalies – Defining situational / context awareness – Fault analysis: when and which recovery actions to take? What is “enough” of a recovery action to restore the state? – Can the system “learn” based on the actions performed? – Are predictive capabilities needed? Taking preventive actions based on some signals – How to deal with uncertainty of such systems – Openness of the self-healing system: how open/close is the system is in terms of adaptive actions 49/58 Proposed Software Architecture to reach the “4S” 50/58 Proposed Software Architecture to reach the “4S” Actor models implementation – integration with the Chaos Engineering toolkit Microservices simulator can run simulation based on the architectural representation Chaos Engineering toolkit that can generate faults to the system by instruction from the supervision strategies 51/58 Using the Actor Models to reach the “4S” (1/2) ● Our proposal is to use the Actor Models.The actor model is a mathematical model of concurrent computation with roots dating back to 1973. It was introduced by Hewitt et al. in 1973 ● The system using an actor model consists of location-transparent actors, seen in the model as the universal primitives of concurrent computations. Each actor receives input and responds by – sending a finite number of messages to the other actors – creating a finite number of child actors – modifying its internal state Mraz, M., Bangui, H., Rossi, B., & Buhnova, B. (2023). Adopting the Actor Model for Antifragile Serverless Architectures. Proceedings of the 18th International Conference on Software Technologies (ICSOFT 2023) 52/58 Using the Actor Models to reach the “4S” (2/2) ● Implementation of the Actor Model with the AKKA framework ● Creation of a framework for the integration of the supervision strategies ● Integration in a Spring Boot microservice system ● Integration of resilience patterns like the circuit breaker Mraz, M., Bangui, H., Rossi, B., & Buhnova, B. (2023). Adopting the Actor Model for Antifragile Serverless Architectures. Proceedings of the 18th International Conference on Software Technologies (ICSOFT 2023) 53/58 Using Chaos Engineering to reach the “4S” “Chaos Engineering can help to understand how emergent behavior from component interactions could result in a system drifting into an unsafe, chaotic state” From Miles, R. (2019). Learning Chaos engineering: discovering and overcoming system weaknesses through experimentation. O'Reilly Media. The consequences of the experiments should be contained Define an hypothesis based on the steady state What are the “normal” levels of operation of the system Inject faults and failures based on the hypothesis Was the hypothesis disproved? You can also increase the blast radius once you are confident on the results Implement the changes based on the chaos experiments 54/58 Integrating a Simulation Environment to reach the “4S” We can optimize the parameters for a Circuit Breaker based on the typical workload (either real or simulated) 55/58 Integrating a Simulation Environment to reach the “4S” We can use the results from the simulator to understand the behaviour based on different hypotethical workloads 56/58 Main Challenges ● Modelling expected behaviour and how to verify it ● Modelling unknown unknowns and uncertainty in the models ● Which anomaly detection algorithms to integrate into the system ● Modelling stress functions of components ● Integration of ML models for all the phases of Fault Detection, Isolation, Recovery ● Accuracy of the simulator and capability of transfering the whole architectural representation ● Automation of the design of chaos engineering experiments and integration of ML models 57/58 Main Takeaways 58/58 Thank you a lot! Q&A Thank you to the many colleagues and students that collaborated to the research: Radoslav Mičko, Dominik Arne Rebro, Stanislav Chren, Michael Schvarcbacher, K. Hrabovská, Barbora Buhnova, Tomas Pitner, Martin Macak, Marcel Mraz, Hind Bangui and many more