QUICK ALGORITHMS TO CALCULATE MEAN TIME TO FAILURE AND STEADY STATE AVAILABILITY OF SYSTEMS Moslem Amiri1 , Václav Přenosil2 ABSTRACT Calculation of reliability and availability is a necessary part of every fault tolerant system design. Conventional methods, which use calculus of probability, almost fail when there is dependant failure, repair, or standby operation in the system. Use of Markov models also almost breaks down when system is composed of many elements or system is repairable. However, the alternatives for reliability and availability, i.e., mean time to failure and steady state availability, respectively, are easily measurable using Markov graphs. In this paper, we introduce an algorithm for the calculation of mean time to failure, and also develop a method to measure steady state availability with its corresponding algorithm. Key Words: Mean time to failure, steady state availability, Markov graphs, algorithm, measurement 1 INTRODUCTION Any system which could function correctly while there exist some faults in it is called a fault tolerant system. Some reasons to build fault tolerant systems are harsh environments, novice users, high repairing costs, and large systems which should always be kept up. Adding redundant components or functions is the most common approach to acquiring fault tolerant systems. When designing a fault tolerant system, several features need to be evaluated and a trade-off among them is required. These features are cost, weight, volume, reliability, and availability. Reliability is the probability of no failure in a given operating period, while availability is the probability that the system is up at any point in time. Availability is generally measured to find the effect of repair on a system. Calculation of reliability and availability is a necessary part of any fault tolerant design process. Conventionally, combinatorial reliability is used to measure the reliability of a system. The steps taken in this method are as follows: 1. The system is decomposed into subsystems and units;                                                              1 RNDr. Moslem Amiri; Masaryk University, FI, B202; amiri@mail.muni.cz 2 prof. Ing. Václav Přenosil, CSc.; Masaryk University, FI, B406; prenosil@fi.muni.cz   1 2. Calculus of probability is used to express the reliability of the whole system in terms of the probability of success of its units; 3. Every unit’s probability of success is calculated using failure rate models; 4. Combining steps 2 and 3, system reliability is found. In this method, element reliabilities are directly calculated. When there is dependent failure, repair, or standby operation in the system, this method gets complicated. To overcome this complexity, an alternative approach is use of Markov models. Markov model approach works well when failure hazards, , and repair hazards, , are constant. Throughout this paper, we will only consider constant failure and repair hazards for simplicity. However, the algorithms and formulations explained here for calculation of mean time to failure and steady state availability, can be easily expanded to timedependent hazards. )(tz )(tw 2 MARKOV MODELS In this model, each component is assumed to be in either of two conditions at any time; good or bad. Combination of all components composing the whole system, with each component being either good or bad, is called a state. First of all, all mutually exclusive states of the system should be defined. States of the system at time zero are called the “initial states” and the states representing final or equilibrium states are “final states.” Hence, in Markov models it is possible to appoint not only the state in which all the components are operational, but any other state as the initial state. However, practically, it is almost always the case that all the components are operational at time zero. Starting from initial state, in time , there could be transition to every possible state with one more bad component. These chains of transitions spread between states which differ only in the condition of one component in time t t . For every state, there should be as many transitions as the number of good components in that state in time . Probability of transition in time from one state to another is t t ttz )( , where is the hazard rate associated with this transition. )(tz For a system composed of a single non-repairable component , the two possible states are (component is good) and 1x 10 xs  11 xs  (component is bad). A state transition table along with the Markov graph for this system is shown in Fig. 1. In this system, probability of being in state at time0s )(tt  is: 1. probability that system is in state 0s at time t (i.e., )(0 tPs ) and there is no failure (transition) in time )(t (i.e., )()(1 ttz  ); 2. or probability that system is in state 1s at time t (i.e., )(1 tPs ) and the probability of repair in time )(t (which is zero in this example). )(0)())(1()( 100 tPtPttzttP sss  (1)   2 Final States Initial States 0s 1s 0s ttz  )(1 ttz )( 1s 0 1 (a) (b) Figure 1 State transition table (a) and Markov graph (b) for a single non-repairable component Probability of being in state at time1s )(tt  is: 1. probability that system is in state 0s at time t (i.e., )(0 tPs ) and the probability of failure (transition) in time )(t (i.e., )()( ttz  ); 2. or probability that system is in state 1s at time t (i.e., )(1 tPs ) and the probability of remaining there in time )(t (which is one because there is no repair). )(1)())(()( 101 tPtPttzttP sss  (2) Rearranging (1) and (2), we will have: )()( )()( 0 00 tPtz t tPttP s ss    (3) )()( )()( 0 11 tPtz t tPttP s ss    (4) When approaches its limit (goes to zero), (3) and (4) will yield:)(t )()( )( 0 0 tPtz dt tdP s s  (5) )()( )( 0 1 tPtz dt tdP s s  (6) A simple method to obtain (5) and (6) directly from Markov graph in Fig. 1 (b) is to [1]: 1. change any unity to zero and any t to one in the Markov graph;   3 2. equate derivative of probability of any node to the sum of branches coming into that node. Formulation of Markov models always leads to a set of first order differential equations. Solving these differential equations becomes complicated if there are a large number of components in the system, or repair is added to the system. An alternative method is to use Laplace transforms. Direct Laplace transform is an easy process which is in fact a simple replacement. However, inverse transform requires partial fraction expansion for exact measurements which is difficult to calculate, especially when repair factor is also considered in the design. Although reliability and availability are obtained only if the inverse transform is done, it is still possible to extract partial information from the transformed equations in order to dispose of inverse Laplace transform. These easily obtained information are mean time to failure (MTTF) and steady state availability (SSA). MTTF and SSA can be used as alternatives to reliability and availability when several systems should be compared. If failure hazards are constant ( )(tz ), the transform pairs shown in Tab. 1 are sufficient to develop and calculate MTTF and SSA. Table 1 Some Laplace transform pairs No. )(tf )()()}({ * sfsFtfL  1 dt tdf )( )0()( fssF  2  t dttf 0 )( s sF )( 3 )(lim tf t  )(lim 0 ssF s 2.1 MEAN TIME TO FAILURE (MTTF) The relationship between MTTF and reliability is as follows:     dRdttRMTTF t t 00 )(lim)( (7) Applying theorem 2 of Tab. 1 to reliability function )(R will result in: s sR dRL t )( )( * 0         (8) And applying theorem 3 of Tab. 1 when will give: dRtf t  0 )()(   4 s sR sdRsLdRMTTF s t s t t )( lim)(lim)(lim * 0000           (9) Hence MTTF can be calculated alternatively by the following equation [2]: )(lim * 0 sRMTTF s  (10) If )(tz and the system is good at 0t (i.e., 1)0(0 tPs and ), applying Laplace transform to (5) and (6), MTTF for a single non-repairable system can be found as follows: 0)0(1 tPs   s sPsPssPtP dt tdP ssss s 1 )()(1)()( )( 0000 0 (11) )( )( 1 )()()( )( 1010 1           ss sP s sPssPtP dt tdP ssss s (12) A one-component system is reliable when there is no failure (i.e., the system is in state ). Hence:0s      1 )(lim 1 )()( * 0 * 0 sRMTTF s sPsR s s (13) At this point, we design an algorithm for calculation of MTTF for systems with no repair. In order to illustrate the algorithms explained in this article, we use a 3-component system as an aide. The Markov graph of such a system with no repair is shown in Fig. 2. Laplace transformed probabilities of some sample nodes in Fig. 2 are developed here: 402010 1 )(0    s sPs (14) 5131 10 )( )( 0 1      s sP sP s s (15) 73 3231 )()( )( 21 3      s sPsP sP ss s (16)   5 Figure 2 Markov graph of a 3-component system with no repair The Markov graph in Fig. 2 can be viewed as a unidirectional graph where various permutations of a fixed number of failed elements out of the whole number of elements stand on one specific layer. Fig. 3 illustrates such a graph which we call transition graph. To measure MTTF for a k -out-of- system, transformed probabilities of the nodes on layers to (in order) should be calculated and summed up while variable approaches zero. The pseudo-code in Tab. 2 explains this procedure. n 0 kn  s Figure 3 Transition graph corresponding to the Markov graph of Fig. 2   6 Table 2 MTTF measurement of a system with no repair Input: k, n, transition graph //k-out-of-n configuration Output: MTTF  1)0(s0 (failure rates of layer 0’s outgoing branches) )0(sMTTF 0 if k = n //series configuration output MTTF else for layer to1L  kn  do for every state on layer L dois (failure rate of incoming branch from j i )L(s  ))1L(s)1L(s jj (failure rates of outgoing branches) endfor  i i )L(sMTTFMTTF endfor output  MTTF endif When the components are repairable, the transformed probabilities of the nodes are not directly found. Instead, solving for probabilities will result in a system of linear equations. Viewing this linear system as a matrix equation, bAx  , the matrix and the column vectors A x and need to be obtained from the transition graph of the system. For example, , b A x , and for a 1-out-of-3 system of Fig. 4 are shown in Tab. 3.b Table 3 Matrices required for a system of linear equations of a 1-out-of-3 system shown in Fig. 4 )(0000)( 0)(000)( 00)(00)( 0)(00)( 00)(0)( 000)()( 000)()( )()()()()()()( 46267664626 45157554515 23137332313 4645046454404 2623026232202 1513015131101 0402014020100 6534210                       sP sP sP sP sP sP sP sPsPsPsPsPsPsP A s s s s s s s sssssss )( )( )( )( )( )( )( 6 5 3 4 2 1 0 sP sP sP sP sP sP sP x s s s s s s s  0)( 0)( 0)( 0)( 0)( 0)( 1)( 6 5 3 4 2 1 0 sP sP sP sP sP sP sP b s s s s s s s   The pseudo-code in Tab. 4 explains the procedure to measure MTTF for a -out-of-n system with repair. k   7 Figure 4 Transition graph of a 3-component repairable system for MTTF measurement Table 4 MTTF measurement of a system with repair Input: k, n, transition graph //k-out-of-n configuration Output: MTTF Create matrices ,]MM[A  ]1M[x  , and ]1M[b  , where is the number of nodes on layers 0 to n – k M Matrices and are initially filled; with variables in any order, however once an order is picked, rows and columns of and rows of should be consistent with that order. is filled with constant associated with , and 0 with the rest, if any x b x )s(P is A b b 1 )s(P 0s if 1k  Remove all the branches from layer 1kn  to kn  endif for every node on the layers 0 tois kn  in the cell associated with[]A  )s(P)s(P ii ss (hazard rates of the branches leaving the node )is in the cell associated with[]A  )s(P)s(P ji ss hazard rate of the branch coming from every node tojs is endfor Empty cells of 0[]A  Solve bAx    8  i ]1,i[xMTTF 2.2 STEADY STATE AVAILABILITY (SSA) An important difference between reliability and availability is their steady state behavior; as time approaches infinity, approaches zero while approaches some fixed value. Fig. 5 shows this behavior for a system which is initially good. )(tR )(tA (tA)(tR ) Figure 5 Availability function In this paper, we develop an easy method to calculate SSA without the need to calculate the inverse Laplace transform for the nodes of Markov graph. is the probability that an item is up at any point in time, i.e.: )(tA )()( )( )( ttimeDownttimeUp ttimeUp tA   (17) Although it seems that taking integration of from zero to infinity will equal infinity, the fraction in (17) cancels the common factors which make approach infinity. can be thought of as a power signal, while is an energy signal. Hence, the relationship between SSA and can be written as: )(tA )(tA )(tA )(tR )(tA     dAdttASSA t t 00 )(lim)( (18) Applying theorem 2 of Tab. 1 to availability function )(A will result in: s sA dAL t )( )( * 0         (19) And applying theorem 3 of Tab. 1 when will give: dAtf t  0 )()(   9 s sA sdAsLdASSA s t s t t )( lim)(lim)(lim * 0000           (20) Hence SSA can be calculated alternatively by the following equation: )()( )( lim)(lim 0 * 0 stimeDownstimeUp stimeUp sASSA ss    (21) In the procedure of calculating SSA, solving for probabilities will result in a system of linear equations, similar to the calculation of MTTF for repairable systems. However, in this case, all the repair hazard branches of the system should be considered, and the matrix equation should contain all the states. Hence, the processing of linear system for SSA calculation is independent of the value for a -out-of-n system. For example, ,k k A x , and for a 3-component system (whose graph is shown in Fig. 6) are shown in Tab. 5. b Table 5 Matrices required for a system of linear equations of a 3-component system shown in Fig. 6  75737 64626 75451554515 73231332313 45645404404 23623202202 1513513101101 0402014020100 534210 0000)( 0000)( )(000)( 0)(00)( 0)(00)( 00)(0)( 00)()( 00)()( )()()()()()(                        sP sP ssP ssP ssP ssP ssP ssP sPsPsPsPsPsP A s s s s s s s s ssssss )( )( 0 0 0 0 00 00 )()( 67573776 67764626 57 37 46 26 76 s s sPsP A ss                 )( )( )( )( )( )( )( )( 7 6 5 3 4 2 1 0 sP sP sP sP sP sP sP sP x s s s s s s s s  0)( 0)( 0)( 0)( 0)( 0)( 0)( 1)( 7 6 5 3 4 2 1 0 sP sP sP sP sP sP sP sP b s s s s s s s s   As seen in matrix A in Tab. 5, the variables are not replaced by zero, otherwise the probabilities of the nodes will equal to infinity (which is true since SSA is a fixed nonzero value extending to infinity). If the fraction formula in (21) is used, the common product factors (which include and make the single probabilities equal infinity) from the numerator and denominator of the fraction are cancelled. Hence, in calculation of the probabilities of the nodes in the system of linear equations, the product factors need not be expanded, since they will later be replaced in Eq. (21) and could be cancelled. s s The pseudo-code in Tab. 6 explains the procedure to measure SSA for a -out-of- system. k n   10 Figure 6 Transition graph of a 3-component system for SSA measurement Table 6 SSA measurement Input: k, n, transition graph //k-out-of-n configuration Output: SSA Create matrices ,]MM[A  ]1M[x  , and ]1M[b  , where is the number of components in the system M Matrices and are initially filled; with variables in any order, however once an order is picked, rows and columns of and rows of should be consistent with that order. is filled with constant associated with , and 0 with the rest, if any x b x )s(P is A b b 1 )s(P 0s for every node on the layers 0 to nis in the cell associated with[]A  s[)s(P)s(P ii ss (hazard rates of the branches leaving the node )]is in the cell associated with[]A  )s(P)s(P ji ss hazard rate of the branch coming from every node tojs is endfor Empty cells of 0[]A  Solve bAx  )s(timeUp (probabilities of the nodes on layers 0 to )kn  )s(timeDown (probabilities of the nodes on layers to n )1kn    11 )s(timeDown)s(timeUp )s(timeUp limSSA 0s    //common product factors should be //cancelled before setting s to 0 3 COLLAPSED MARKOV GRAPHS The general form of a Markov graph contains states for an -component system. However, it is possible to reduce the number of nodes in a Markov graph and consequently, reduce the complexity of the graph. Those nodes on one specific layer of the graph which have equal hazard rates on their outgoing branches could be collapsed into one node. The hazard rates of the incoming branches to these collapsed nodes need not be equal. n 2 n 3.1 ORDINARY PARALLEL SYSTEMS In an ordinary parallel system, all the components including the primary and back ups, start operating at time zero and all can fail. If there are independent components in the system with equal failure rates of m  each, the failure rates of the branches in going from one layer to the next, starting from layer zero, will be ,)2(,)1(,   mmm . This is simply the addition of the failure rates of the outgoing branches of one node on every layer (it could be any node on the layer) in this system’s general Markov model equivalent. However, if the repair rate in the general model is the same for every component and equal to  , it does not follow the same rule as in the case of failure rate. The repair rate is relative to the number of repairmen present and whether they work cooperatively. If  is the repair rate for one repairman, k is the repair rate for repairmen. Fig. 7 depicts the collapsed transition graph for a 3-component system. The algorithms explained in section 2 apply to these collapsed graphs as well. k Figure 7 Collapsed transition graph of a 3-component system for MTTF and SSA measurements; the failure rate of every component is  and repairmen are presentk 3.2 STANDBY SYSTEMS   12 The weak point of an ordinary parallel system is that the backup components can fail even before they are put on-line. This happens because all the components in such a system are energized at time zero. An improvement is to energize the primary component only. Thus, as soon as the primary component fails, one backup is energized and put online. Contrary to the parallel systems in which every component can fail, in standby systems only one component (the active one) can fail at a time (of course, this is true only if the components do not fail when they are not energized). Fig. 8 shows the collapsed transition graph for a 3-component standby system with components having the failure rate of  each. As it is seen, the transition probability is t for every branch. For standby systems, it is not possible to illustrate k -out-of- system for various s in one graph, and the algorithms explained in section 2 do not directly apply to the collapsed graphs of these systems. However, the fact that exposure time for standby elements does not start until on-line component has failed, brings simplicity to measurements of features of standby systems. For example, for identical non-repairable standby elements with failure rates n k n  each, the system succeeds if there are 1n or fewer failures. The reliability of this system can simply be calculated by Poisson distribution [2]:       1 0 ! )( )( n i i t i t etR (22) Figure 8 Transition graph of a 3-component standby system for 1-out-of-3 configuration; the failure rate of every component is  and repairmen are present. The active component at each node is indicated by an arrow k ACKNOWLEDGEMENT The work presented in this paper has been supported under research project SPECTRUM, No. TA01011383 by Technology Agency of the Czech Republic. REFERENCES [1] Shooman, M. L. Probabilistic Reliability: An Engineering Approach, 2nd ed. Krieger, Melbourne, FL, 1990. [2] Shooman, M. L. Reliability of Computer Systems and Networks: Fault Tolerance, Analysis, and Design, 1st ed. Wiley-Interscience, 2001.   13