Software Reliability Engineering: from Software
Reliability Models to Software Resilience
Bruno Rossi
brossi@mail.muni.cz
Lasaris (Lab of Software Architectures and Information Systems)
Faculty of Informatics
Masaryk University, Brno, Czech Republic
www.lasaris.cz
2/58
Structure
The presentation will be covering three parts:
3/58
Motivation – C4e Project - Critical Infrastructure
●
Critical Infrastructure provide mission critical services - typically implemented as
connected Cyberphysical systems (CPS)
●
P1. Critical infrastructure protection:
– P1.1 Simulation and predictive analysis of critical infrastructures
– P1.2 Formal verification of critical infrastructures
– P1.3 Recommendations for critical infrastructure realization
●
Need to get a cohesive view, including cybersecurity and aspects related to cyber-law
ERDF "CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence" (No. CZ.02.1.01/0.0/0.0/16_019/0000822).
4/58
Software Reliability
Probably one one the most important qualities of software systems as it can make a system
inoperative
5/58
ISO/IEC 25010 Standard – key terms
●
ISO/IEC 25010 places four key terms under reliability:
6/58
Software Reliability Engineering (SRE)
SRE includes:
1. Software reliability measurement – estimation and prediction
2. attributes and metrics of software design, development process, architecture and their
impact on reliability
3. usage of the acquired knowledge to guide the design of software systems and
development processes
Lyu, Michael R. Handbook of software reliability engineering. Vol. 222. Los Alamitos: IEEE computer society press, 1996.
7/58
The SRE Process
Adapted from Lyu, Michael R. Handbook of software reliability engineering. Vol. 222. Los Alamitos: IEEE computer society press, 1996.
8/58
Failure Rate & Hazard Rate
failurerate( λ)=
F(t+∆t)−F(t)
∆tR(t)
z(t)= lim
∆t→0
F(t +∆t)−F(t)
∆tR(t)
=
f (t)
R(t)
●
Reliability is defined as the probability that a software system will
not fail during the next x time units in a specific environment
9/58
Main difference Hardware vs Software Reliability
●
Software does not have a wear-out region as in the hardware domain (in which
hardware becomes obsolete and can lead to an increase in failures)
Typical hazard rate z(t) of a system / component
10/58
Work of Lehman and Belady (1/3)
●
Lehman started by a nine month study in 1968 to evaluate the IBM programming process,
focusing on the OS/360 system
●
After that experience Lehman and Belady joined in successive studies – laws of software
evolution were defined in a time range from 1974 to 1996
●
The aim was to capture different growth trends of software systems and their long term
evolution
●
The laws apply to E-type systems: “programs that mechanize a human or social activity”
(Lehman, 1980)
Source: Lehman, M., M., Ramil, J.,F. “Software evolution--Background, theory, practice,” Information Processing Letters, vol. 88, Ott. 2003, pagg. 33-44.
11/58
What Lehman's Laws of Software Evolution tell us
12/58
Some implications for SRE
●
A model in Region I will not work well for Region II
●
If in Software there is no Region III then the same model as in Region III could be
applied – however, according to Lehman’s laws quality of the systems decreases
over time
Typical hazard rate z(t) of a software
system / component
New software release
13/58
Software Reliability
Growth Models (SRGMs)
14/58
Software Reliability Growth Modelling
15/58
Software Reliability Growth Modelling
Mean value function – m(t)
Fitting the cumulative failures over time
16/58
Software Reliability Growth Modelling
Mean value function – m(t)
Concave models – assume the total number of faults in software is
finite, and that it is possible to achieve fault-free software in finite time
S-shaped models – they also assume that the total number of faults is
finite. Early testing is not as effective in fault discovery as the testing in
the later stages. Therefore, there is a period in which the number of
faults is increasing
Infinite models – assume that it is not possible to develop fault-free
software because during fault removal we can introduce new ones
Types of models
Fitting the cumulative failures over time
17/58
One of the earliest studies...
●
One of the first papers* to apply SRGMs to Open Source software projects
●
Comparing several models, like Weibull, Hossain Dahiya (HD), Goel Okumoto S-shaped (GOS),
Gompertz
●
Three projects analyzed: Mozilla Firefox, LibreOffice, OpenSuse
●
Generally the Weibull model was found to be the best in terms of Goodness of Fit (GoF)
●
However, no model was generally good for predictive capability
* Rossi, B., Russo, B., & Succi, G. (2010). Modelling failures occurrences of open source software with reliability growth. In Open Source Software: New Horizons: 6th International IFIP WG 2.13
Conference on Open Source Systems, OSS 2010, Notre Dame, IN, USA, May 30–June 2, 2010 (pp. 268-280). Springer Berlin Heidelberg.
18/58
...and how SRGMs have been used so far
19/58
STRAIT Tool (1/3)
1. Getting issue reports from data sources
2. Creation of snapshots and persistence storage
3. Data processing & filtering
4. Building of pluggable SRGM models (trend test, parameters
estimation, GoF metrics)
5. Outputting module
* Chren, S., Micko, R., Buhnova, B., & Rossi, B. (2019, May). STRAIT: a tool for automated software reliability growth analysis. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
(pp. 105-110). IEEE.
Typical process
20/58
STRAIT Tool (2/3)
1. Getting issue reports from data sources
2. Creation of snapshots and persistence storage
3. Data processing & filtering
4. Building of pluggable SRGM models (trend test, parameters
estimation, GoF metrics)
5. Outputting module
Typical process
Components
* Chren, S., Micko, R., Buhnova, B., & Rossi, B. (2019, May). STRAIT: a tool for automated software reliability growth analysis. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
(pp. 105-110). IEEE.
21/58
STRAIT Tool (3/3)
1. Getting issue reports from data sources
2. Creation of snapshots and persistence storage
3. Data processing & filtering
4. Building of pluggable SRGM models (trend test,
parameters estimation, GoF metrics)
5. Outputting module
Example OutputTypical process
Components
* Chren, S., Micko, R., Buhnova, B., & Rossi, B. (2019, May). STRAIT: a tool for automated software reliability growth analysis. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)
(pp. 105-110). IEEE.
22/58
Experimental Evaluation
1. We adapted and used STRAIT to mine data from GitHub bug tracking repositories: top ten projects
from different topics of the ”Topic Lists” and combined them with ten more projects from the ”Trending List”
2. We run STRAIT in a Cloud environment (16 threads / 128GB RAM) for increased performance
3. We fitted 792 SRGMs (88 projects x 9 models) with 383K software defects for RQ1, RQ2, and
additionally 261 SRGMs for software releases (29 releases x 9 models) in RQ3
Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and
Advanced Applications (SEAA) (pp. 255-262). IEEE.
Implemented models
23/58
Aims of the experimental evaluation
Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and
Advanced Applications (SEAA) (pp. 255-262). IEEE.
24/58
Used Metrics (GoF)
R2
(coefficient of determination)
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Residual Standard Error (RSE)
where:
K – the number of estimated parameters in the model,
L – the likelihood of the model given the data,
n – the size of the dataset.
How well the model fits the outputs
(range 0-1)
Indicators about the quality of the models, penalizing
models with higher nr of parameters
How well the model fits the outputs (in
the unit of dependent variable)
Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and
Advanced Applications (SEAA) (pp. 255-262). IEEE.
25/58
RQ1 – Ranking of Models
To answer this RQ, we considered 792 SRGMs fitted on the whole dataset with 383 788 software defects.
Concave models S-Shaped
Infinite
Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and
Advanced Applications (SEAA) (pp. 255-262). IEEE.
26/58
RQ2 – Project Domain
To answer this RQ, we considered 792 SRGMs fitted on the whole dataset with 383 788 software defects
and segmented by categories
Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and
Advanced Applications (SEAA) (pp. 255-262). IEEE.
27/58
RQ3 – Impact of Project Releases
To answer this RQ, we used 63 SRGMs (7 projects with releases fitted by 9 models) and 261 SRGMs (29
releases, 9 models, 6 800 defects)
Rankings of models based on R2 considering Releases (R) and whole projects (NR)
Mičko, R., Chren, S., & Rossi, B. (2022). Applicability of Software Reliability Growth Models to Open Source Software. In 2022 48th Euromicro Conference on Software Engineering and
Advanced Applications (SEAA) (pp. 255-262). IEEE.
28/58
Major Challenges for SRGMs
29/58
Alternative methods
They look more at the propagation and chaining of faults and failures
●
Bayesian networks
●
Fault trees and Markov chains
●
Stochastic Petri Nets and Markov chains
Source diagram: Avizienis, A., Laprie, J. C., & Randell, B. (2001). Fundamental concepts of computer system dependability. In Workshop on Robot
Dependability: Technological Challenge of Dependable Robots in Human Environments (pp. 1-16).
30/58
Major Takeaways
31/58
Quality Models as Proxies
for Failures Detection
32/58
How IEEE maps Failures to Quality
The mapping of internal attributes to external ones is a key aspect in
software reliability
nr. of failures
over a period
of time
Source image (adapted): ISO/IEC
91260 Standard
Lyu, Michael R. Handbook of software reliability engineering. Vol. 222. Los Alamitos: IEEE computer society press 1996.
nr. of faults /
errors over a
period of time
33/58
…indeed many metrics were used in the prediction models
Rebro, D. A., Chren, S., & Rossi, B. (2023). Source Code Metrics for Software Defects Prediction. In Proceedings of the 38th ACM/SIGAPP Symposium on Applied
Computing (pp. 1469-1472).
34/58
Software Defects Prediction Process
Rebro, D. A., Chren, S., & Rossi, B. (2023). Source Code Metrics for Software Defects Prediction. In Proceedings of the
38th ACM/SIGAPP Symposium on Applied Computing (SAC)
Example: ranking of metrics in the
prediction model (the lower the better)
35/58
Generic process of defects prediction
Rebro, D. A., Chren, S., & Rossi, B. (2023). Source Code Metrics for Software Defects Prediction. In Proceedings of the
38th ACM/SIGAPP Symposium on Applied Computing
36/58
Quality Models (example: SQALE)
●
Many quality models were developed over time – the assumption is: control the quality and
you will control the failures
In short, the model defines a Remediation
Cost (RC) to fix all the violations:
37/58
Major Challenges
●
Identifying which modules are more defects prone
●
Identifying the importance of features for the prediction
●
Considering changes in history of a project (drifts)
●
Dealing with imbalanced data
●
Associating defects to implementations activities
●
Integration of the models into running systems
●
Mining representative datasets (e.g., NASA dataset has been used for long time in SE)
38/58
My Research Focus in this Area
●
Evaluating the impact of several metrics on the defect prediction of models
– Rebro, D. A., Chren, S., & Rossi, B. (2023). Source Code Metrics for Software Defects Prediction. In
Proceedings of the 38th ACM/SIGAPP Symposium on Applied Computing (pp. 1469-1472).
●
Evaluating the comparability of different maintainability indexes (SQALE, MI, SIG-TD)
– Strečanský, P., Chren, S., & Rossi, B. (2020). Comparing maintainability index, SIG method, and SQALE for
technical debt identification. In Proceedings of the 35th Annual ACM Symposium on Applied Computing (pp. 121-
124).
●
Studying bug triaging in both Open Source Software and one company involved
– Dedík, V., & Rossi, B. (2016). Automated bug triaging in an industrial context. In 2016 42th Euromicro
conference on software engineering and advanced applications (SEAA) (pp. 363-367). IEEE.
●
Evaluating the applicability of Mutation Testing in a industrial context
– J. Možucha and B. Rossi. Is mutation testing ready to be adopted industry-wide? In P. Abrahamsson, A.
Jedlitschka, A. Nguyen Duc, M. Felderer, S. Amasaki, and T. Mikkonen, editors, Product-Focused Software
Process Improvement, pages 217–232, Cham, 2016. Springer International Publishing
39/58
Major Takeaways
40/58
Software Systems Resilience &
Self-* capabilities
41/58
Motivation (1/2)
●
In previous work we created a testing management platform for Smart Grids based on the
Mosaik framework for co-simulations
●
We extended Mosaik with the disconnect method to remove edges from the dataflow graph
and the entity graph → A simple way to simulate node failures
●
This can be useful to understand the patterns of failures
→ Mihal, P., Schvarcbacher, M., Rossi, B., & Pitner, T. (2022). Smart grids co-simulations: Survey & research directions. Sustainable Computing: Informatics and Systems,.
→ Schvarcbacher, M., Hrabovská, K., Rossi, B., & Pitner, T. (2018). Smart grid testing management platform (sgtmp). Applied Sciences, 8(11), 2278.
→ Gryga, L., & Rossi, B. (2021). Co-simulation of Smart Grids: Dynamically Changing Topologies in Failure Scenarios. In Complexis.
Smart Grids Testing Processes
42/58
Motivation (2/2)
●
In the CERIT-SC Big Data project we looked into anomalies for
power consumption data
●
Built a Big Data platform based on Apache Fink that could integrate
anomaly detection algorithms
Lipčák, P., Macak, M., & Rossi, B. (2019). Big data platform for smart grids power consumption anomaly detection. In 2019 federated
conference on computer science and information systems (FedCSIS) (pp. 771-780). IEEE.
43/58
Software Systems Resilience
44/58
Software Systems Resilience – Self-* Systems
●
Software Resilience is often associated with the following concepts (4S)
45/58
Self Healing Research
Adapted from Psaier, H., & Dustdar, S. (2011). A survey on self-healing systems: approaches and systems. Computing, 91, 43-73.
46/58
Self Healing Research
Adapted from Psaier, H., & Dustdar, S. (2011). A survey on self-healing systems: approaches and systems. Computing, 91, 43-73.
Self-adaptive systems can
monitor themselves and
correct any deviations from
expected behaviour
Autonomic systems can
self-manage and operate with
minimum human intervention
Fault tolerance is often difficult to achieve
(e.g., distributed systems): selfstabilizing
systems can improve towards
one “correct” state in a certain time period
Discipline defining survivability of a
system in case of failures (resistance,
recognition, recovery from failures,
adaptation of services)
Pioneering work about
theory of redundancy to
improve the reliability of
software systems
47/58
Typical Aspects of Self-Healing Systems
- Monitor the system
- Identify anomalies from
expected behaviour
- Trigger the alerts
- Identification of the source
of the fault
- Try to identify the
component that is the cause
of the fault
- taking actions to restore the
normal state of the system
(e.g., restarting a service)
48/58
Self-healing System Challenges
Adapted from Dreo Rodosek, G., Geihs, K., Schmeck, H., & Burkhard, S. (2009). Self-healing systems: Foundations and challenges. In Dagstuhl
Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.
Major Challenges:
– How to define the expected behaviour?
Both in the sense of specifications but
also anomalies
– Defining situational / context awareness
– Fault analysis: when and which recovery
actions to take? What is “enough” of a
recovery action to restore the state?
– Can the system “learn” based on the
actions performed?
– Are predictive capabilities needed?
Taking preventive actions based on some
signals
– How to deal with uncertainty of such
systems
– Openness of the self-healing system:
how open/close is the system is in terms
of adaptive actions
49/58
Proposed Software Architecture to reach the “4S”
50/58
Proposed Software Architecture to reach the “4S”
Actor models
implementation – integration
with the Chaos Engineering
toolkit
Microservices simulator can
run simulation based on the
architectural representation
Chaos Engineering toolkit
that can generate faults to the
system by instruction from the
supervision strategies
51/58
Using the Actor Models to reach the “4S” (1/2)
●
Our proposal is to use the Actor Models.The actor model is a mathematical model of concurrent
computation with roots dating back to 1973. It was introduced by Hewitt et al. in 1973
●
The system using an actor model consists of location-transparent actors, seen in the model as
the universal primitives of concurrent computations. Each actor receives input and responds by
– sending a finite number of messages to the other actors
– creating a finite number of child actors
– modifying its internal state
Mraz, M., Bangui, H., Rossi, B., & Buhnova, B. (2023). Adopting the Actor Model for Antifragile Serverless Architectures.
Proceedings of the 18th International Conference on Software Technologies (ICSOFT 2023)
52/58
Using the Actor Models to reach the “4S” (2/2)
●
Implementation of the Actor Model with the AKKA
framework
●
Creation of a framework for the integration of the
supervision strategies
●
Integration in a Spring Boot microservice system
●
Integration of resilience patterns like the circuit
breaker
Mraz, M., Bangui, H., Rossi, B., & Buhnova, B. (2023). Adopting the Actor Model for Antifragile Serverless Architectures.
Proceedings of the 18th International Conference on Software Technologies (ICSOFT 2023)
53/58
Using Chaos Engineering to reach the “4S”
“Chaos Engineering can help to understand how emergent behavior from component interactions could result
in a system drifting into an unsafe, chaotic state” From Miles, R. (2019). Learning Chaos engineering: discovering and overcoming
system weaknesses through experimentation. O'Reilly Media.
The
consequences of
the experiments
should be
contained
Define an
hypothesis based
on the steady
state
What are the
“normal” levels of
operation of the
system
Inject faults and
failures based on
the hypothesis
Was the
hypothesis
disproved?
You can also
increase the blast
radius once you
are confident on
the results
Implement the
changes based
on the chaos
experiments
54/58
Integrating a Simulation Environment to reach the “4S”
We can optimize the
parameters for a Circuit
Breaker based on the typical
workload (either real or
simulated)
55/58
Integrating a Simulation Environment to reach the “4S”
We can use the results from
the simulator to understand
the behaviour based on
different hypotethical
workloads
56/58
Main Challenges
●
Modelling expected behaviour and how to verify it
●
Modelling unknown unknowns and uncertainty in the models
●
Which anomaly detection algorithms to integrate into the system
●
Modelling stress functions of components
●
Integration of ML models for all the phases of Fault Detection, Isolation,
Recovery
●
Accuracy of the simulator and capability of transfering the whole
architectural representation
●
Automation of the design of chaos engineering experiments and
integration of ML models
57/58
Main Takeaways
58/58
Thank you a lot! Q&A
Thank you to the many colleagues and students that collaborated to the research: Radoslav Mičko, Dominik Arne Rebro,
Stanislav Chren, Michael Schvarcbacher, K. Hrabovská, Barbora Buhnova, Tomas Pitner, Martin Macak, Marcel Mraz, Hind
Bangui and many more