Why Markov Models?

 

Consider a system consisting of six basic components, and let A,B,…,F denote the independent failure states of these six components respectively, and suppose the overall system is specified to be failed just if the logic diagram shown below, called a fault tree, is satisfied.

 

 

In this diagram we have noted the Boolean expressions for each event, using the common notation in which the “plus” symbol signifies logical OR, and the “products” signify logical AND. Making use of idempotence and the absorptive property, the top event can be expressed as

 

 

This is the unique representation of the top event as the union of minimal cutsets. Letting P(x) denote the probability of x at a given time, it can be shown by inclusion-exclusion that the probability of the top event at that time is

 

 

If the probabilities of A, B, …, F are each orders of magnitude smaller than 1 (as is often the case for highly reliable systems), the probabilities of the four mincuts will be much larger than the higher-order terms, so in such cases the top probability can be approximated by just the sum of the probabilities of the four mincuts. More generally, a rigorous upper bound on the top probability for monotonic fault trees (i.e., trees consisting only of AND and OR gates) is given by recursively applying the inequality P(x U y) ≤ P(x) + P(y) − P(x)P(y), noting that P(x ∩ y) ≥ P(x)P(y) for monotonic events.

 

To complete the calculation of the probability of the top event, it remains to determine the probabilities of the individual cutsets, such as P(ABE). In many cases each basic component such as A has some constant failure rate denoted by λ and some exposure time denoted by τ, so beginning from a healthy condition at time t = 0 the probability of this component being failed by the end of the exposure time is 1 − e−λτ, which for rare events is approximately just λτ. For independent components, the joint probability of ABE at the end of the least common multiple of the individual exposure times is simply

 

 

This represents the worst case (i.e., end of the last flight) probability, but for many purposes we wish to know the average probability for the entire duration, or the average of the probabilities at the end of each mission/flight. Those values can be computed by integrating the time-dependent sawtooth function for the least common multiple of the exposure times, as discussed in Section 8.2.

 

However, in actual systems the fault detection and repair strategies may be more complicated, so the simple method of computing the probability of a cutset described above is not applicable. This is because, in addition to the exposure times for the individual components, there may also be specified different exposure times for certain combinations of components, and the repair transitions may be incomplete. For example, an alert may be generated if both A and E are failed, prompting repair of one or both of those components at a shorter period, denoted by τAE. This type of transition is fairly common, and yet it can’t be represented in a fault tree that defines exposure times only for individual basic events. The probabilities for systems involving fully general repair transitions can be computed using a suitable Markov model, as described in Section 8.5.

 

To illustrate, consider the cutset ABE, which represents the state in which all three of the components A, B, and E are failed. To reach that state from the full-up state (no failures), the system can transition through the other states (from left to right) as depicted below.

 

 

If component A is checked and repaired event τA hours, then at that interval we would transfer all the probability from each state containing “A” to the state with “A” deleted.  For example, the probability in state A would be moved back to the full-up state, and the probability in state “A,B” would be moved to state “B”, and so on. The inspection/repair transitions of B and E would be implemented in the same manner. However, if state “A,E” generates a special alert that is checked and repaired every τAB hours, we can account for this in the model by transferring the probability in state “A,E” back to the full-up state at that interval. Alternatively we might specify that only component E is repaired when that alert is found, so we would transfer the probability from state A,E to state A at that interval (i.e., a partial repair).

 

These kinds of composite repair strategies, although realistic and fairly common, can’t be readily evaluated numerically in a fault tree. A fault tree can be used to identify and depict the minimal cutsets, i.e., the combinations of failures sufficient to fail the overall system, but to evaluate the probabilities of those cutsets, a state model such as described above – called a Markov model – or some equivalent calculation must be used.

 

In addition to accommodating compound repair strategies, a Markov model also automatically accounts for any required sequencing of failures, and integrates the “sawtooth” function resulting from combining individual components with different exposure times. Also, although our discussion has focused on periodic repairs, Markov models can also accommodate repairs that are performed at specified rates or “on-condition”. Detailed description of all these calculation methods are provided in the book Markov Models and Reliability.

 

Return to MathPages Main Menu