QUICK ALGORITHMS TO CALCULATE MEAN TIME TO
FAILURE AND STEADY STATE AVAILABILITY OF SYSTEMS
Moslem Amiri1
, Václav Přenosil2
ABSTRACT
Calculation of reliability and availability is a necessary part of every fault tolerant system
design. Conventional methods, which use calculus of probability, almost fail when there
is dependant failure, repair, or standby operation in the system. Use of Markov models
also almost breaks down when system is composed of many elements or system is
repairable. However, the alternatives for reliability and availability, i.e., mean time to
failure and steady state availability, respectively, are easily measurable using Markov
graphs. In this paper, we introduce an algorithm for the calculation of mean time to
failure, and also develop a method to measure steady state availability with its
corresponding algorithm.
Key Words:
Mean time to failure, steady state availability, Markov graphs, algorithm, measurement
1 INTRODUCTION
Any system which could function correctly while there exist some faults in it is called a
fault tolerant system. Some reasons to build fault tolerant systems are harsh
environments, novice users, high repairing costs, and large systems which should always
be kept up. Adding redundant components or functions is the most common approach to
acquiring fault tolerant systems. When designing a fault tolerant system, several features
need to be evaluated and a trade-off among them is required. These features are cost,
weight, volume, reliability, and availability. Reliability is the probability of no failure in a
given operating period, while availability is the probability that the system is up at any
point in time. Availability is generally measured to find the effect of repair on a system.
Calculation of reliability and availability is a necessary part of any fault tolerant design
process.
Conventionally, combinatorial reliability is used to measure the reliability of a system.
The steps taken in this method are as follows:
1. The system is decomposed into subsystems and units;
                                                            
1
RNDr. Moslem Amiri; Masaryk University, FI, B202; amiri@mail.muni.cz
2
prof. Ing. Václav Přenosil, CSc.; Masaryk University, FI, B406; prenosil@fi.muni.cz
  1
2. Calculus of probability is used to express the reliability of the whole system in
terms of the probability of success of its units;
3. Every unit’s probability of success is calculated using failure rate models;
4. Combining steps 2 and 3, system reliability is found.
In this method, element reliabilities are directly calculated. When there is dependent
failure, repair, or standby operation in the system, this method gets complicated. To
overcome this complexity, an alternative approach is use of Markov models. Markov
model approach works well when failure hazards, , and repair hazards, , are
constant. Throughout this paper, we will only consider constant failure and repair hazards
for simplicity. However, the algorithms and formulations explained here for calculation
of mean time to failure and steady state availability, can be easily expanded to timedependent
hazards.
)(tz )(tw
2 MARKOV MODELS
In this model, each component is assumed to be in either of two conditions at any time;
good or bad. Combination of all components composing the whole system, with each
component being either good or bad, is called a state. First of all, all mutually exclusive
states of the system should be defined. States of the system at time zero are called the
“initial states” and the states representing final or equilibrium states are “final states.”
Hence, in Markov models it is possible to appoint not only the state in which all the
components are operational, but any other state as the initial state. However, practically,
it is almost always the case that all the components are operational at time zero. Starting
from initial state, in time , there could be transition to every possible state with one
more bad component. These chains of transitions spread between states which differ only
in the condition of one component in time
t
t . For every state, there should be as many
transitions as the number of good components in that state in time . Probability of
transition in time from one state to another is
t
t ttz )( , where is the hazard rate
associated with this transition.
)(tz
For a system composed of a single non-repairable component , the two possible states
are (component is good) and
1x
10 xs  11 xs  (component is bad). A state transition table
along with the Markov graph for this system is shown in Fig. 1. In this system,
probability of being in state at time0s )(tt  is:
1. probability that system is in state 0s at time t (i.e., )(0
tPs ) and there is no failure
(transition) in time )(t (i.e., )()(1 ttz  );
2. or probability that system is in state 1s at time t (i.e., )(1
tPs ) and the probability of
repair in time )(t (which is zero in this example).
)(0)())(1()( 100
tPtPttzttP sss  (1)
  2
Final States
Initial
States 0s 1s
0s ttz  )(1 ttz )(
1s 0 1
(a) (b)
Figure 1 State transition table (a) and Markov graph (b) for a single non-repairable
component
Probability of being in state at time1s )(tt  is:
1. probability that system is in state 0s at time t (i.e., )(0
tPs ) and the probability of
failure (transition) in time )(t (i.e., )()( ttz  );
2. or probability that system is in state 1s at time t (i.e., )(1
tPs ) and the probability of
remaining there in time )(t (which is one because there is no repair).
)(1)())(()( 101
tPtPttzttP sss  (2)
Rearranging (1) and (2), we will have:
)()(
)()(
0
00
tPtz
t
tPttP
s
ss



(3)
)()(
)()(
0
11
tPtz
t
tPttP
s
ss



(4)
When approaches its limit (goes to zero), (3) and (4) will yield:)(t
)()(
)(
0
0
tPtz
dt
tdP
s
s
 (5)
)()(
)(
0
1
tPtz
dt
tdP
s
s
 (6)
A simple method to obtain (5) and (6) directly from Markov graph in Fig. 1 (b) is to [1]:
1. change any unity to zero and any t to one in the Markov graph;
  3
2. equate derivative of probability of any node to the sum of branches coming into
that node.
Formulation of Markov models always leads to a set of first order differential equations.
Solving these differential equations becomes complicated if there are a large number of
components in the system, or repair is added to the system. An alternative method is to
use Laplace transforms. Direct Laplace transform is an easy process which is in fact a
simple replacement. However, inverse transform requires partial fraction expansion for
exact measurements which is difficult to calculate, especially when repair factor is also
considered in the design. Although reliability and availability are obtained only if the
inverse transform is done, it is still possible to extract partial information from the
transformed equations in order to dispose of inverse Laplace transform. These easily
obtained information are mean time to failure (MTTF) and steady state availability
(SSA). MTTF and SSA can be used as alternatives to reliability and availability when
several systems should be compared. If failure hazards are constant ( )(tz ), the
transform pairs shown in Tab. 1 are sufficient to develop and calculate MTTF and SSA.
Table 1 Some Laplace transform pairs
No. )(tf )()()}({ *
sfsFtfL 
1
dt
tdf )(
)0()( fssF 
2 
t
dttf
0
)(
s
sF )(
3 )(lim tf
t 
)(lim
0
ssF
s
2.1 MEAN TIME TO FAILURE (MTTF)
The relationship between MTTF and reliability is as follows:
  

dRdttRMTTF
t
t 00
)(lim)( (7)
Applying theorem 2 of Tab. 1 to reliability function )(R will result in:
s
sR
dRL
t )(
)(
*
0







 (8)
And applying theorem 3 of Tab. 1 when will give: dRtf
t

0
)()(
  4
s
sR
sdRsLdRMTTF
s
t
s
t
t
)(
lim)(lim)(lim
*
0000 






   (9)
Hence MTTF can be calculated alternatively by the following equation [2]:
)(lim *
0
sRMTTF
s
 (10)
If )(tz and the system is good at 0t (i.e., 1)0(0
tPs and ), applying
Laplace transform to (5) and (6), MTTF for a single non-repairable system can be found
as follows:
0)0(1
tPs


s
sPsPssPtP
dt
tdP
ssss
s 1
)()(1)()(
)(
0000
0
(11)
)(
)(
1
)()()(
)(
1010
1










ss
sP
s
sPssPtP
dt
tdP
ssss
s
(12)
A one-component system is reliable when there is no failure (i.e., the system is in state
). Hence:0s





1
)(lim
1
)()( *
0
*
0
sRMTTF
s
sPsR
s
s (13)
At this point, we design an algorithm for calculation of MTTF for systems with no repair.
In order to illustrate the algorithms explained in this article, we use a 3-component
system as an aide. The Markov graph of such a system with no repair is shown in Fig. 2.
Laplace transformed probabilities of some sample nodes in Fig. 2 are developed here:
402010
1
)(0
 

s
sPs (14)
5131
10 )(
)( 0
1





s
sP
sP
s
s (15)
73
3231 )()(
)( 21
3





s
sPsP
sP
ss
s (16)
  5
Figure 2 Markov graph of a 3-component system with no repair
The Markov graph in Fig. 2 can be viewed as a unidirectional graph where various
permutations of a fixed number of failed elements out of the whole number of elements
stand on one specific layer. Fig. 3 illustrates such a graph which we call transition graph.
To measure MTTF for a k -out-of- system, transformed probabilities of the nodes on
layers to (in order) should be calculated and summed up while variable
approaches zero. The pseudo-code in Tab. 2 explains this procedure.
n
0 kn  s
Figure 3 Transition graph corresponding to the Markov graph of Fig. 2
  6
Table 2 MTTF measurement of a system with no repair
Input: k, n, transition graph //k-out-of-n configuration
Output: MTTF
 1)0(s0 (failure rates of layer 0’s outgoing branches)
)0(sMTTF 0
if k = n //series configuration
output MTTF
else
for layer to1L  kn  do
for every state on layer L dois
(failure rate of incoming branch from
j
i )L(s
 ))1L(s)1L(s jj (failure rates of outgoing
branches)
endfor

i
i )L(sMTTFMTTF
endfor
output  MTTF
endif
When the components are repairable, the transformed probabilities of the nodes are not
directly found. Instead, solving for probabilities will result in a system of linear
equations. Viewing this linear system as a matrix equation, bAx  , the matrix and the
column vectors
A
x and need to be obtained from the transition graph of the system. For
example, ,
b
A x , and for a 1-out-of-3 system of Fig. 4 are shown in Tab. 3.b
Table 3 Matrices required for a system of linear equations of a 1-out-of-3 system shown
in Fig. 4
)(0000)(
0)(000)(
00)(00)(
0)(00)(
00)(0)(
000)()(
000)()(
)()()()()()()(
46267664626
45157554515
23137332313
4645046454404
2623026232202
1513015131101
0402014020100
6534210






















sP
sP
sP
sP
sP
sP
sP
sPsPsPsPsPsPsP
A
s
s
s
s
s
s
s
sssssss
)(
)(
)(
)(
)(
)(
)(
6
5
3
4
2
1
0
sP
sP
sP
sP
sP
sP
sP
x
s
s
s
s
s
s
s

0)(
0)(
0)(
0)(
0)(
0)(
1)(
6
5
3
4
2
1
0
sP
sP
sP
sP
sP
sP
sP
b
s
s
s
s
s
s
s 

The pseudo-code in Tab. 4 explains the procedure to measure MTTF for a -out-of-n
system with repair.
k
  7
Figure 4 Transition graph of a 3-component repairable system for MTTF measurement
Table 4 MTTF measurement of a system with repair
Input: k, n, transition graph //k-out-of-n configuration
Output: MTTF
Create matrices ,]MM[A  ]1M[x  , and ]1M[b  , where is the
number of nodes on layers 0 to n – k
M
Matrices and are initially filled; with variables in any
order, however once an order is picked, rows and columns of and rows
of should be consistent with that order. is filled with constant
associated with , and 0 with the rest, if any
x b x )s(P is
A
b b
1 )s(P 0s
if 1k 
Remove all the branches from layer 1kn  to kn 
endif
for every node on the layers 0 tois kn 
in the cell associated with[]A  )s(P)s(P ii ss (hazard rates
of the branches leaving the node )is
in the cell associated with[]A  )s(P)s(P ji ss hazard rate of
the branch coming from every node tojs is
endfor
Empty cells of 0[]A 
Solve bAx 
  8

i
]1,i[xMTTF
2.2 STEADY STATE AVAILABILITY (SSA)
An important difference between reliability and availability is their steady state
behavior; as time approaches infinity, approaches zero while approaches some
fixed value. Fig. 5 shows this behavior for a system which is initially good.
)(tR )(tA
(tA)(tR )
Figure 5 Availability function
In this paper, we develop an easy method to calculate SSA without the need to calculate
the inverse Laplace transform for the nodes of Markov graph. is the probability that
an item is up at any point in time, i.e.:
)(tA
)()(
)(
)(
ttimeDownttimeUp
ttimeUp
tA

 (17)
Although it seems that taking integration of from zero to infinity will equal infinity,
the fraction in (17) cancels the common factors which make approach infinity.
can be thought of as a power signal, while is an energy signal. Hence, the
relationship between SSA and can be written as:
)(tA
)(tA )(tA
)(tR
)(tA
  

dAdttASSA
t
t 00
)(lim)( (18)
Applying theorem 2 of Tab. 1 to availability function )(A will result in:
s
sA
dAL
t )(
)(
*
0







 (19)
And applying theorem 3 of Tab. 1 when will give: dAtf
t

0
)()(
  9
s
sA
sdAsLdASSA
s
t
s
t
t
)(
lim)(lim)(lim
*
0000 







  (20)
Hence SSA can be calculated alternatively by the following equation:
)()(
)(
lim)(lim
0
*
0 stimeDownstimeUp
stimeUp
sASSA
ss 


(21)
In the procedure of calculating SSA, solving for probabilities will result in a system of
linear equations, similar to the calculation of MTTF for repairable systems. However, in
this case, all the repair hazard branches of the system should be considered, and the
matrix equation should contain all the states. Hence, the processing of linear system for
SSA calculation is independent of the value for a -out-of-n system. For example, ,k k A
x , and for a 3-component system (whose graph is shown in Fig. 6) are shown in Tab.
5.
b
Table 5 Matrices required for a system of linear equations of a 3-component system
shown in Fig. 6

75737
64626
75451554515
73231332313
45645404404
23623202202
1513513101101
0402014020100
534210
0000)(
0000)(
)(000)(
0)(00)(
0)(00)(
00)(0)(
00)()(
00)()(
)()()()()()(























sP
sP
ssP
ssP
ssP
ssP
ssP
ssP
sPsPsPsPsPsP
A
s
s
s
s
s
s
s
s
ssssss
)(
)(
0
0
0
0
00
00
)()(
67573776
67764626
57
37
46
26
76
s
s
sPsP
A
ss
















)(
)(
)(
)(
)(
)(
)(
)(
7
6
5
3
4
2
1
0
sP
sP
sP
sP
sP
sP
sP
sP
x
s
s
s
s
s
s
s
s

0)(
0)(
0)(
0)(
0)(
0)(
0)(
1)(
7
6
5
3
4
2
1
0
sP
sP
sP
sP
sP
sP
sP
sP
b
s
s
s
s
s
s
s
s 

As seen in matrix A in Tab. 5, the variables are not replaced by zero, otherwise the
probabilities of the nodes will equal to infinity (which is true since SSA is a fixed
nonzero value extending to infinity). If the fraction formula in (21) is used, the common
product factors (which include and make the single probabilities equal infinity) from
the numerator and denominator of the fraction are cancelled. Hence, in calculation of the
probabilities of the nodes in the system of linear equations, the product factors need not
be expanded, since they will later be replaced in Eq. (21) and could be cancelled.
s
s
The pseudo-code in Tab. 6 explains the procedure to measure SSA for a -out-of-
system.
k n
  10
Figure 6 Transition graph of a 3-component system for SSA measurement
Table 6 SSA measurement
Input: k, n, transition graph //k-out-of-n configuration
Output: SSA
Create matrices ,]MM[A  ]1M[x  , and ]1M[b  , where is the
number of components in the system
M
Matrices and are initially filled; with variables in any
order, however once an order is picked, rows and columns of and rows
of should be consistent with that order. is filled with constant
associated with , and 0 with the rest, if any
x b x )s(P is
A
b b
1 )s(P 0s
for every node on the layers 0 to nis
in the cell associated with[]A  s[)s(P)s(P ii ss (hazard
rates of the branches leaving the node )]is
in the cell associated with[]A  )s(P)s(P ji ss hazard rate of
the branch coming from every node tojs is
endfor
Empty cells of 0[]A 
Solve bAx 
)s(timeUp (probabilities of the nodes on layers 0 to )kn 
)s(timeDown (probabilities of the nodes on layers to n )1kn 
  11
)s(timeDown)s(timeUp
)s(timeUp
limSSA
0s 


//common product factors should be
//cancelled before setting s to 0
3 COLLAPSED MARKOV GRAPHS
The general form of a Markov graph contains states for an -component system.
However, it is possible to reduce the number of nodes in a Markov graph and
consequently, reduce the complexity of the graph. Those nodes on one specific layer of
the graph which have equal hazard rates on their outgoing branches could be collapsed
into one node. The hazard rates of the incoming branches to these collapsed nodes need
not be equal.
n
2 n
3.1 ORDINARY PARALLEL SYSTEMS
In an ordinary parallel system, all the components including the primary and back ups,
start operating at time zero and all can fail. If there are independent components in the
system with equal failure rates of
m
 each, the failure rates of the branches in going from
one layer to the next, starting from layer zero, will be ,)2(,)1(,   mmm . This is
simply the addition of the failure rates of the outgoing branches of one node on every
layer (it could be any node on the layer) in this system’s general Markov model
equivalent. However, if the repair rate in the general model is the same for every
component and equal to  , it does not follow the same rule as in the case of failure rate.
The repair rate is relative to the number of repairmen present and whether they work
cooperatively. If  is the repair rate for one repairman, k is the repair rate for
repairmen. Fig. 7 depicts the collapsed transition graph for a 3-component system. The
algorithms explained in section 2 apply to these collapsed graphs as well.
k
Figure 7 Collapsed transition graph of a 3-component system for MTTF and SSA
measurements; the failure rate of every component is  and repairmen are presentk
3.2 STANDBY SYSTEMS
  12
The weak point of an ordinary parallel system is that the backup components can fail
even before they are put on-line. This happens because all the components in such a
system are energized at time zero. An improvement is to energize the primary component
only. Thus, as soon as the primary component fails, one backup is energized and put online.
Contrary to the parallel systems in which every component can fail, in standby
systems only one component (the active one) can fail at a time (of course, this is true only
if the components do not fail when they are not energized). Fig. 8 shows the collapsed
transition graph for a 3-component standby system with components having the failure
rate of  each. As it is seen, the transition probability is t for every branch. For
standby systems, it is not possible to illustrate k -out-of- system for various s in one
graph, and the algorithms explained in section 2 do not directly apply to the collapsed
graphs of these systems. However, the fact that exposure time for standby elements does
not start until on-line component has failed, brings simplicity to measurements of features
of standby systems. For example, for identical non-repairable standby elements with
failure rates
n k
n
 each, the system succeeds if there are 1n or fewer failures. The
reliability of this system can simply be calculated by Poisson distribution [2]:



 

1
0
!
)(
)(
n
i
i
t
i
t
etR (22)
Figure 8 Transition graph of a 3-component standby system for 1-out-of-3 configuration;
the failure rate of every component is  and repairmen are present. The active
component at each node is indicated by an arrow
k
ACKNOWLEDGEMENT
The work presented in this paper has been supported under research project SPECTRUM,
No. TA01011383 by Technology Agency of the Czech Republic.
REFERENCES
[1] Shooman, M. L. Probabilistic Reliability: An Engineering Approach, 2nd ed. Krieger,
Melbourne, FL, 1990.
[2] Shooman, M. L. Reliability of Computer Systems and Networks: Fault Tolerance,
Analysis, and Design, 1st ed. Wiley-Interscience, 2001.
  13