General Solutions for MTTF and Steady-State Availability of NMR Systems Moslem Amiri, V´aclav Pˇrenosil Abstract—Voting redundancy is a well-known technique to improve the fault tolerance of digital systems. Calculation of reliability and availability is a necessary part of every fault tolerant system design. Conventional methods, including block diagram method, are unsuccessful when there is dependent failure, repair, or standby operation in the system. Use of Markov models also almost breaks down when system is composed of many elements or is repairable. However, the alternatives to reliability and availability, i.e. mean time to failure and steady state availability, respectively, are easily measurable using Markov graphs. In this paper, novel general form solutions for calculation of mean time to failure, and steady-state availability of N-modular redundancy systems are presented. Keywords—NMR systems, Mean time to failure, Steady state availability, Markov graphs, General solutions. I. INTRODUCTION VOTING redundancy is a well-known technique to improve the fault tolerance of digital systems. Although parallel and standby systems are also used for this purpose, the complexity involved with designing coupler in parallel and switch in standby systems makes the efficient implementation of them difficult. Applied to digital systems, voting redundancy takes advantage of the digital nature of elements’ outputs to remove some of these problems. Implementation of voting redundancy concept, called Nmodular redundancy (NMR), is shown in Fig.1. This system consists of 2n+1 parallel digital circuits, all having equivalent logic (generally, identical digital circuits) with the same input applied to all elements. The outputs of the 2n + 1 circuits are compared by the voter and the majority is given as the system output. The basic NMR system is called triple modular redundancy (TMR) with only three parallel digital elements (n = 1). TMR is the most common implementation of majority voting systems due to its lower cost. However, as the cost of digital circuits are reduced, NMR systems with higher number of replicated digital components gain popularity to increase the fault tolerance of systems further. When designing a fault tolerant system, several features need to be evaluated and a trade-off among them is required. These features include cost, weight, volume, reliability, and availability. Reliability is the probability of no failure in a given operating period, while availability is the probability that the system is up at any point in time. Availability is generally measured to find the effect of repair on a system. Calculation Manuscript received March 31, 2014. This work was supported by Technology Agency of the Czech Republic under contract No. TA01011383/2011. M. Amiri and V. Pˇrenosil are with the Faculty of Informatics, Masaryk University, Brno, Czech Republic (e-mail: amiri@mail.muni.cz; prenosil@fi.muni.cz). Digital circuit 1 Digital circuit 2 Digital circuit 2n+1 Voter (0, 1) (0, 1) Fig. 1. Majority voting. of reliability and availability is a necessary part of any fault tolerant design process. Ideally, we would prefer to find general solutions for reliability and availability of NMR systems, but mathematically this is not achievable. Two alternatives, however, are commonly used in fault-tolerant systems design; mean time to failure and steady state availability. In this paper, we will seek general solutions for these features of repairable and nonrepairable NMR systems. Throughout this article, we assume that the parallel digital components are identical and have the same constant failure rate, λ, and that the voter does not fail. II. RELIABILITY AND MEAN TIME TO FAILURE (MTTF) NMR systems were first introduced and discussed in the 1960s [1] [2]. For a system consisting of 2n+1 parallel digital circuits and a perfect voter, the reliability without repair is R = 2n+1 i=n+1 B(i : 2n + 1) = 2n+1 i=n+1 2n + 1 i pi (1 − p)2n+1−i (1) where p is the success probability (reliability) of any digital circuit. Eq. 1 is in fact the reliability expression for an ”n + 1 out of 2n + 1” system. If the system is repairable, Markov models are generally used to compute various features of the system, including reliability. To compute reliability of a system with n states, formulation of Markov models always leads to n first-order differential equations. Repair rates couple these equations, creating an nth order differential equation. Even if Laplace transform theory is applied, the solution involves finding the roots of an nth-order polynomial. For n ≥ 5, it is proven that there is no closed-form solutions [3]. Although reliability is obtained only if the inverse Laplace transform is done, it is still possible to extract partial information from the transformed equations in order to dispose of inverse transform operations. This easily obtained feature Zero failures 0s t 1s 2s nstn  )12( tn )2( tn  )12( tt tn  )2( t 1nstn  )1( tn  )12(1 tn  )2(1  tn  ))12((1  tn  ))1((1  1 One failure Two failures n failures n + 1 or more failures Fig. 2. A Markov reliability model for a 2n+1-level majority voting system with repair. is MTTF. MTTF can be used as an alternative to reliability when several systems should be compared. The Markov reliability diagram for an NMR system, composed of 2n + 1 digital components, is given in Fig. 2. In this Figure, state sn+1 is an absorbing state; this state represents n + 1 or more failures. Also in moving from one state to the next, from left to right, the coefficients of λ express the number of possible ways to experience a single failure. The repair rate, µ, is always the same, meaning a single repairman is present for repair. The relationship between MTTF and reliability is as fol- lows: MTTF = ∞ 0 R(t)dt = lim t→∞ t 0 R(τ)dτ (2) Knowing L t 0 R(τ)dτ = R (s) s (3) and L lim t→∞ f(t) = lim s→0 sF(s) (4) then MTTF can be calculated as: MTTF = lim s→0 sL t 0 R(τ)dτ = lim s→0 s R (s) s = lim s→0 R (s) (5) For the Markov model of Fig. 2, the following time-domain differential equations are obtained: dPs0 (t) dt = −(2n + 1)λPs0 (t) + µPs1 (t) dPs1 (t) dt = (2n + 1)λPs0 (t) − (2nλ + µ)Ps1 (t) + µPs2 (t) dPs2 (t) dt = (2n)λPs1 (t) − ((2n − 1)λ + µ)Ps2 (t) + µPs3 (t) · · · dPsn (t) dt = (n + 2)λPs(n−1) (t) − ((n + 1)λ + µ)Psn (t) dPs(n+1) (t) dt = (n + 1)λPsn (t) + Ps(n+1) (t) (6) If the system is initially good, then Psk (t = 0) = 1, if k = 0. 0, otherwise. (7) After taking the Laplace transforms of the Eqs. 6, and incorporating the initial conditions (Eq. 7), and letting s approach 0, the transformed equations will be as follows: −(2n + 1)λPs0 (s) + µPs1 (s) = −1 (2n + 1)λPs0 (s) − (2nλ + µ)Ps1 (s) + µPs2 (s) = 0 (2n)λPs1 (s) − ((2n − 1)λ + µ)Ps2 (s) + µPs3 (s) = 0 · · · (n + 2)λPs(n−1) (s) − ((n + 1)λ + µ)Psn (s) = 0 (n + 1)λPs(n) (s) + Ps(n+1) (s) = 0 (8) Solving Eqs. 8 for the state probabilities in Laplace domain will result in the following general equations: Psk (s) = µPs(k+1) (s) + 1 (2n + 1 − k)λ if 0 ≤ k ≤ n − 1 (9) and Psn (s) = 1 (n + 1)λ (10) Manipulating Eqs. 9 and 10, for 0 ≤ k ≤ n − 1: Psk (s) = µn−k + λ n−k−1 i=0 µn−k−1−i λi i j=0(n + 1 + j) λn−k+1 n i=k(2n + 1 − i) (11) Therefore, MTTF with non-zero repair will be equal to MTTFrepairable = 1 (n + 1)λ + n−1 k=0 µn−k + λ n−k−1 i=0 µn−k−1−i λi i j=0(n + 1 + j) λn−k+1 n i=k(2n + 1 − i) (12) If there is no repair in the system, then from Eqs. 9 and 10 we can obtain the MTTF as: MTTFnon-repairable = n k=0 1 (2n + 1 − k)λ (13) For example, for a TMR system (n = 1): MTTFrepairable = 1 2λ + µ + 2λ 6λ2 = 5λ + µ 6λ2 MTTFnon-repairable = 1 3λ + 1 2λ = 5 6λ (14) III. AVAILABILITY AND STEADY STATE AVAILABILITY (SSA) If a system can tolerate brief failures and continue its operation after it is repaired, availability is a useful measure of its performance. Similar to the reliability computation of NMR systems, there is no general solution for the availability computation. However, as MTTF (with general solution) could be used as an alternative to reliability, the SSA could be used as the alternative to availability. In this section, we will seek a general solution for SSA. A(t) is the probability that the system is up at any point in time, i.e. A(t) = Up time (t) Up time (t) + Down time (t) (15) An important difference between reliability R(t) and availability A(t) is their steady state behavior; as time approaches )(tAtyAvailabili SSA tTime 1 Fig. 3. Availability function. Zero failures 0s t 1s 2s nstn  )12( tn )2( tn  )12( tt tn  )2( t 1nstn  )1( tn  )12(1 tn  )2(1  tn  ))12((1  tn  ))1((1  One failure Two failures n failures n + 1 or more failures t t 1 Fig. 4. A Markov availability model for a 2n + 1-level majority voting system with repair. infinity, R(t) approaches zero while A(t) approaches some fixed value. This final value of the availability function is SSA. Fig. 3 shows this behavior for a system which is initially good. A Markov availability model for an NMR system is given in Fig. 4. In this model, after complete failure, the system could be restored to operational status. A set of time-domain differential equations can be written for Markov availability model of Fig. 4. We will use the Laplace transform theorems, assuming that the system is initially good, to simplify the solution. SSA can be calculated through A(s) when s → 0. Laplace transformed equations for availability measurement will be as follows: −(2n + 1)λPs0 (s) + µPs1 (s) = −1 (2n + 1)λPs0 (s) − (2nλ + µ)Ps1 (s) + µPs2 (s) = 0 (2n)λPs1 (s) − ((2n − 1)λ + µ)Ps2 (s) + µPs3 (s) = 0 · · · (n + 2)λPs(n−1) (s) − ((n + 1)λ + µ)Psn (s) + µPs(n+1) (s) = 0 (n + 1)λPsn (s) − µPs(n+1) (s) = 0 (16) This is a system of n + 2 linear equations. If this system is represented in matrix multiplication form as APs(s) = b where A is an n+2 by n+2 matrix with a nonzero determinant, and the vector Ps(s) = (Ps0 (s), Ps1 (s), . . . , Ps(n+1) (s))T is the column vector of the variables, then applying Cramer’s rule will give: Psk (s) = det(Ak) det(A) 0 ≤ k ≤ n + 1 (17) where Ak is the matrix formed by replacing the kth column of A by the column vector b. If Eq. 17 is applied to Laplace transformed form of Eq. 15, we obtain: A(s) = det A0 + · · · + det An det A0 + · · · + det A(n+1) (18) Hence the calculation of det(A) need not be done. The matrix Ak(ij), whose rows and columns are 0 ≤ i ≤ n + 1 and 0 ≤ j ≤ n + 1 respectively, has the following general form (obtained from Eqs. 16): Ak(ij) =    −1, if i = 0, j = k. −(2n + 1)λ, if i = 0, j(= k) = 0. −µ, if i = n + 1, j(= k) = n + 1. µ, if i = j − 1, 1 ≤ j(= k). −((2n + 1 − j)λ + µ), if i = j, 1 ≤ j(= k) ≤ n. (2n + 1 − j)λ, if i = j + 1, j(= k) ≤ n. 0, otherwise. (19) Once this matrix is formed, the SSA is computed as: Asteady state = n k=0 det Ak(ij) n+1 k=0 det Ak(ij) 0 ≤ i and j ≤ n + 1 (20) For example, for a TMR system, we will have: det A0(ij) = −1 µ 0 0 −2λ − µ µ 0 2λ −µ = −µ2 det A1(ij) = −3λ −1 0 3λ 0 µ 0 0 −µ = −3λµ det A2(ij) = −3λ µ −1 3λ −2λ − µ 0 0 2λ 0 = −6λ2 (21) Therefore, SSA of TMR system will be as follows: Asteady state = µ2 + 3λµ µ2 + 3λµ + 6λ2 (22) If a system is not repairable, its SSA will be zero, as Eq. 22 proves this for a TMR system when µ = 0. IV. CONCLUSION A simple calculation of reliability or availability is done by means of Markov models and Laplace transforms. However, if there are a large number of components in the system, or system is repairable, inverse transform requires the nontrivial partial-fraction-expansion calculation. Although reliability and availability are obtained only if the inverse transform is done, it is still possible to extract the partial information MTTF and SSA from the transformed equations. MTTF and SSA can be used as alternatives to reliability and availability when several systems should be compared. The general solutions for MTTF and SSA of NMR systems were presented in this article. Although the computation of MTTF for an NMR system is simpler than reliability, comparing MTTFs of two systems should be done with some considerations. The relation between MTTF and reliability is as follows MTTF = ∞ 0 R(t)dt (23) For an NMR system containing 2n + 1 circuits, as n is increased, the area under the curve of reliability function is increased too but only within the region of primary interest, 0 < λt < 0.69 [1]. Outside this region, the area under reliability curve is decreased. Since in Eq. 23 the integral is taken from 0 to ∞, the MTTF comparisons are valid when a system is significantly superior than another. MTTF is also used when the reliability functions compared have the same shape. The MTTF will decide which system is superior in such cases. REFERENCES [1] J. K. Knox-Seith, “A redundancy technique for improving the reliability of digital systems,” Stanford Electronics Laboratories, Tech. Rep. [2] W. H. Pierce, “Improving reliability of digital systems by redundancy and adaptation,” diploma thesis, Stanford University. [3] S. Iyanaga and Y. Kuwada, Encyclopedic Dictionary of Mathematics. Cambridge, MA: Mit Press, 1980.