Redundant Systems With A Common Threat

 

Suppose a system is equipped with two redundant components, and the failure of both components (at the same time) represents total failure of the system.  Also, each component contains provisions to protect it from an external threat to which both components are subjected, at some unknown rate, in unison.  Furthermore, the failure of the protection is not inherently detectable, except when specifically checked at some periodic inspection interval.  Between inspections, a failure of the protection provisions becomes apparent only if/when the component is subsequently subjected to the external threat and fails.  Given the failure rates of the protection provisions of each component, and the length of the inspection interval, can we define a realistic upper bound on the probability of simultaneous failure of both systems?

 

Letting l denote the failure rate of the protection provisions of each component, and T the periodic inspection interval, a very conservative approach would be to assume that failure of the protection of a single component is never detected (and repaired) other than at periodic inspections.  Admittedly we will also detect such failures if we encounter the external threat, but since we do not know the rate at which the system encounters this threat, it may seem that we're forced to assume the worst case, i.e., that we never encounter the threat, and therefore the threat does not provide a means of detecting failures of the protection provisions.  On this basis, the exposure time for each of the protection failures is the full inspection interval T, so they each have probability lT of being failed at the end of the interval, and the probability of both being failed is (lT)2.  (Of course, the actual probabilities for exponentially distributed failures are 1-e-lT, but since lT is small for the cases of interest, we can use the conventional approximation lT.)

 

Now, given that the protection features of both components are failed, the overall system will not fail until/unless it encounters the external threat, but, again, since we do not know the rate of encountering this threat, we need to make a conservative assumption, which, in this case, is that we will certainly encounter the threat.  Moreover, we assume that the threat will definitely be encountered at the end of the inspection period, when the probability of both protections being failed is a maximum.  Combining all these worst-case assumptions, we conclude that the probability of a total system failure during the last incremental unit of operation in the T-hour inspection period is simply (lT)2.

 

This is certainly a valid upper bound, but it seems unnecessarily conservative, because it assumes the external threat never strikes the system except at the very end of the inspection interval, at which point we assume it strikes the system with certainty.  This combination of assumptions is extremely unrealistic, because the occurrences of the external threat are presumed to be exponentially distributed with just a single rate, independent of the periodic inspections.  Hence it would be valid to assume certainty for an encounter at the end of the interval only if we assume the rate of encounters is very high (infinite, recalling that the probability is really 1-e-lT), but this grossly conflicts with the assumption that we never encounter the threat at any other time during the interval.

 

A more realistic representation of the system is given by the Markov model shown below.

 

 

In this model the rate m represents the (unknown) rate of encountering the external threat.  The premise is that if/when the external threat is encountered while only one of the components has failed protection (i.e., State 1), the component will fail and thereby be detected, leading to its repair, returning the system to the Full-Up condition (State 0).  This same rate also represents the transition rate from State 2 (the condition when both components have failed protection) to State 3 (total system failure).  Our strategy will be to determine the explicit time-dependent solution of this model as a function of l and m, and then, beginning with the system in State 0, evaluate the rate of entry into State 3 at the time T, i.e., the end of the periodic inspection interval , which is when the rate reaches a maximum.  This equals mP2(T)/(1-P3(T)), although the denominator is so close to 1 that we can accurately represent the total failure rate at time T simply by mP2(T).

 

The differential equations corresponding to this model are

 

 

The characteristic polynomial of the coefficient matrix is

 

 

so the eigenvalues of the system are

 

 

where

 

Therefore, each Pj(t) for j = 0, 1, or 2 is of the form

 

 

where Aj, Bj, and Cj are constants determined by the initial conditions.  At the time t = 0 we have

 

Inserting these into the system equations gives the first derivatives

 

 

Differentiating the system equations and inserting the first derivatives gives the second derivatives

 

 

Differentiating the expression for Pj(t) twice and inserting the initial conditions, we have three equations in the unknown coefficients Aj, Bj, and Cj

 

 

Solving this system for the coefficients, we get

 

 

 

 

This gives the explicit expression for P2(t), which we can multiply by m to give the rate of total system failure at any time t

 

 

Evaluating this at the time t = T gives the rate of total system failure at the end of a periodic inspection interval.  We know the values of l and T, so the only unknown is the rate m of encounter the external threat.  Thus we can plot mP2(T) versus m as shown below for a typical case with l = (5.0)10-7/hour and T=5000 hours.

 

Naturally if m = 0 the rate of total system failure would also be zero, because the system would never make the transition from State 2 to State 3.  On the other hand, if m is infinite, the rate of total system failure is again zero, because the system would never get past State 1.  Therefore, we expect that there is some intermediate value of m that maximizes the rate of total system failure, and this is confirmed by the plot above.  The maximum for this example occurs at m = (3.2)10-4/hr, corresponding to a total system failure rate of  (7.4)10-10/hr.  In contrast, the more simplistic (and unrealistic) approach described previously predicts a rate of (lT)2 = (1.0)10-4/hr for this same case.

 

To approximate the results of the Markov model in the form of a fault tree, the top event is total system failure during the last incremental unit of operational time, denoted by Dt.  This event is generated by {protection failed on both components} AND {encounter with the external threat}.  The probability of encountering the external threat during this increment of time is m Dt.   The probability of both components being without protection is the square of the probability of the loss of protection for a single component.  The latter is simply the failure rate l for the protection multiplied by the appropriate exposure time.  Now, the mean time between occurrences is the reciprocal of the rate, and rates are additive, so the appropriate exposure time is given by adding 1/T to the reciprocal of the interval corresponding to the rate m.  However, for a fixed interval the mean exposure time is actually half the interval, so we divide m by 2 to give the overall exposure time 1/[(m/2) + (1/T)].  Dividing the top event probability by Dt to normalize the probability to a per-hour basis for the last incremental unit of operational time, we get

 

 

A plot of this function is shown in the figure below, superimposed on the exact Markov model result.

 

Thus both methods give comparable results.  One advantage of the simple fault tree representation is that it can easily be differentiated to find the maximum point explicitly.  Not surprisingly, we find that the worst-case value of m (i.e., the one that gives the maximum probability of total system failure) is 2/T.  Substituting this back into the equation for the total system failure (per hour for an incremental unit of operation at the end of the inspection interval) is simply l2T/2.

 

For another example of this common threat analysis method, consider a system with two redundant units, each of which has an overall failure rate of l. For each unit, a certain set of failures with rate  r  represent latent failures of the protection against a common external threat.  These failures are undetectable unless the external threat is encountered, in which case the entire unit becomes inoperative and is repaired.  The remaining failure modes of each unit are detectable, and result in the complete repair of the unit at the rate R.  (Note that if a unit is repaired for a detectable failure, any latent “undetected” failure present in that unit is also repaired, so it is returned to service with no failures.)  The rate s of encountering the common external threat is unspecified.  Our objective is to determine the rate of dual failures induced by the external threat, i.e., the rate at which the threat will be encountered while the protection features of both units are failed.  The state of the system can be represented by a symbol of the form [d1u1,d2u2] where di and ui are either 0 or 1, signifying whether or not there is a detected or undetected failure of the ith unit.  For example, if there is an undetected failure in the first unit, the state of the unit is denoted as [01,00].  Of the 16 possible states, 12 involve one or more detectable/repair failure(s), but these states are repaired by definition, so the system spends essentially no time in any of those states; they are really just transitions.  (The time between a detectable failure and when it is repaired is irrelevant, because we are using the repair rate, not the failure rate, and the detectable faults have no direct effect on the undetectable faults.) Of the four remaining states, the states [01,00] and [00,01] are symmetrical, so they can be combined into a single state.  Thus we can represent the system exactly by the three-state model shown below.

 

 

Note that we assume a common mode strike with the protection failed on just one unit results in complete restoration of the unit.  Using subscripts 0, 1, and 2 to denote these three states, the system equations are

 

 

along with the conservation equation P0 + P1 + P2 = 1.  The steady-state solution (found by setting the derivatives in the first two state equations to zero, and combining them with the conservation equation to give three linear equations in three unknowns) is

 

 

Therefore the rate of dual failures due to the common threat is

 

 

Differentiating this with respect to s and setting the result to zero, we find that the maximum rate is achieved by setting

 

 

so we have

 

We can also determine the time-dependent probability and rate by solving the original systems of equations.  The eigenvalues are the roots of the determinant

 

 

so the eigenvalues are 0, R+r+s, and 2(R+r)+s.  It’s interesting that only the sum R+r of the rates appears in the eigenvalues. Letting l denote this sum, the probability of state 2 (i.e., the state with protection features failed in both units) is of the form

 

 

where A, B, C are constants.  With P0(0) = 1 we have the initial conditions

 

 

from which we can determine the values of the constants A, B, and C.  Multiplying the result by s gives the rate of dual failures as a function of time

 

 

In the special case s = 0 (meaning the common threat never occurs) this reduces to

 

 

For non-zero values of s we can plot the rate of dual failures as a function of s for various values of the time t = T, as shown below.

 

 

This shows how the peak rate of dual failures at the end of any length of time T approaches the peak steady-state rate.

 

For a related discussion, see the note on Undetected Failures of Threat Protection.

 

Return to MathPages Main Menu