Timed Markov Models |
|
One of the defining attributes of ordinary Markov models is that the state transitions are exponentially distributed, meaning that the transition rates are constant and therefore the transitions are continuous. However, many actual situations involve state transitions that depend explicitly on time, and that occur discretely. Such transitions can’t be exactly modeled in terms of ordinary Markov models, but it is possible to extend the usual Markov modeling technique to encompass time-dependent transitions. Interestingly, ordinary Markov models involve continuous flow between discrete states, whereas the extended technique (described below) essentially consists of representing discrete flow between continuous states. |
|
For a simple example, consider a dual-redundant system comprised of two identical components, either of which is capable of performing the required function. The component failures are independent and exponentially distributed. The system begins with both components healthy. If one component fails, it will be repaired, but the scheduling of repairs can be handled in different ways. If both components are failed (resulting in total system failure), they are both repaired immediately. We will consider the following three possible repair strategies: |
|
|
|
The case of continuous repairs can be handled as an ordinary Markov model, as illustrated below. |
|
|
|
The steady-state system equations are |
|
|
|
and we can solve these equations for P1 to give the steady-state system failure rate |
|
|
|
For the case of delayed repairs, i.e., if our strategy is to repair each components exactly τ hours after it fails, we cannot represent the repairs as a simple exponential transition. We can, however, approximate the time-delayed transition by a model of the type shown below. |
|
|
|
The four intermediate states 1 through 4 are arranged in a temporal sequence with an exponential transition from each to the next, each with a rate 4/τ, so the mean time to traverse those states and be repaired is τ. The mean time in the intermediate state of the previous model was also τ, but in the present model the variance of the dwell time is much less. By increasing the number of intermediate states, we can make the variance on the total dwell time arbitrarily small, and in the limit we arrive at a perfectly timed repair, precisely τ hours after the component fails. Substituting from the steady-state system equations |
|
|
|
into the conservation equation P0 + P1 + P2 + P3 + P4 = 1 and solving for P0 gives |
|
|
|
where |
|
|
From this we can compute the overall system failure rate |
|
|
|
Noting that α/(1–β) = 2, this can be written as |
|
|
|
Now, recall that β = ν/(ν+λ) where ν = 4/τ, corresponding to the inter-state dwell times between each of the four intermediate states. More generally, for n intermediate states, we have ν = n/τ, and |
|
|
|
In the limit as n goes to infinity this gives |
|
|
|
so the overall system failure rate is |
|
|
|
Next we consider periodic repair. In this case the entire system must be propagated through time, and restored to the fully healthy state every τ hours. We will consider two variations of this case. First, we suppose the periodic checks and repairs for a given system are re-initialized whenever a system completely fails. (This would be the case if, for example, each system carries its own periodic sequence, and a system is simply replaced with a new one if/when it completely fails.) To model this case we again start with a finite approximation, as shown below. |
|
|
|
The steady-state system equations are |
|
|
|
where |
|
|
It follows that the individual sub-state probabilities for the fully-healthy state can be expressed in terms of α, β, γ, and P00 as follows |
|
|
|
and the sub-state probabilities for the partial-failed state can be expressed as |
|
|
|
The sum of the probabilities of the fully healthy sub-states is |
|
|
|
and the sum of the probabilities of the partially-failed sub-states is |
|
|
|
The sum of all eight of these probabilities is 1, so we can solve for P00, and then we can write the overall system failure rate as |
|
|
|
Dividing through by the first term in the denominator, and replacing 4 with n for an arbitrary number of sub-states, separated by exponential transitions with the rate ν = n/τ, this gives |
|
|
|
As n goes to infinity we have the limiting values |
|
|
|
and we also have the limits |
|
|
|
Making these substitutions and simplifying, we find that the overall system failure rate is |
|
|
|
Of course, we could arrive at this same result by just evaluating the mean system failure rate for the no-repair model over a time of τ hours, but the method used here can be applied in more general situations to impose time effects on Markov models. Naturally this method can also be used to make transition rates variable as functions of time. |
|
Equations (1), (2), and (3) give the system failure rates for three different repair strategies, based on continuous, delayed, and periodic repairs, respectively. A comparison of these rates as a function of t is shown in the plot below. |
|
|
|
All of them are asymptotic to 2λ/3, as we would expect, because that is the steady-state system failure rate for no repairs. Naturally the periodic repair strategy has an initial rate of about half the continuous and delayed strategies, because the mean time to repair is only about half as long. (See below for further discussion of this point.) It’s interesting that all three results are of the form |
|
|
|
where “X” is one of the expressions |
|
|
|
Compare the last of these with the well-known bi-linear substitution used in converting continuous transfer functions to discrete transfer functions, where the differential operator D is replaced by a function of the past value operator |
|
|
|
It would be interesting to see if some such general substitution could be used to introduce temporal effects (fixed time delays or periodic actions) into Markov models. The use of time-dependent transition rates could also be explored. |
|
Next we consider the second variation for periodic inspections and repairs. In this variation, the periodic actions occur independently of any failures, so the period is not re-initialized if/when a system completely fails. That system is simply repaired (or replaced) and it remains on the same schedule of periodic inspections and repairs. Of course, it’s easy to evaluate this case by simply computing the average failure rate for the duration of one inspection period, beginning from the fully-healthy state, and with no repairs of partially failed systems. Thus the system can be represented in the simple form shown below |
|
|
|
The time-dependent system equations are |
|
|
|
Substituting for P0 from the conservation equation P0 = 1 – P1 and re-arranging terms gives |
|
|
|
which has the solution |
|
|
|
The average of this over the time interval from t = 0 to t = τ is the average system failure rate with periodic inspections and repairs. Thus we have |
|
|
|
Notice that this differs from each of the previous three cases. Now we will show how this same result can be reached using a timed Markov model. The approximate model for this case is shown below. |
|
|
|
To simplify the schematic the complete failure state has been omitted, since that state is irrelevant for purposes of assessing the failure rate of the population of operational systems. The second component failures are represented by the transitions from the single-fault states back to the fully healthy states. Unlike the previous variation, the system is restored to the contemporaneous fully healthy state, rather than the initial fully healthy state (because the periodic actions are not re-synchronized). |
|
The steady-state system equations are |
|
|
|
where |
|
|
and ν = 4/τ. From these we infer the recurrence relation |
|
|
|
Letting M denote the coefficient matrix on the right side, it follows that the system failure rate is given by |
|
|
|
The geometric series of matrices cannot be summed using the simple Euclidean formula, because M0 – M is singular, but it can be shown that Mj is given explicitly by |
|
|
|
Therefore the geometric sum can be written as |
|
|
|
where Φ = ν/(ν + 3λ). Letting Mij denote the components of M, we can substitute into the previous expression for λsys to give |
|
|
|
Evaluating this in the limit as n goes to infinity (and recalling that ν = n/τ), we get |
|
|
|
which agrees exactly with the previous result for this case. A plot of all four of the repair strategies is shown below. Interestingly, the continuous and the delayed cases are asymptotically matched at the low end, as are the two periodic cases, whereas at the high end the delayed is asymptotic with the periodic-with-reset and the continuous is asymptotic with the periodic-without-reset. |
|
|
|
As mentioned previously, the periodic strategies initially have exactly half the slope of the continuous and delayed strategies, because the mean time to repair with the periodic strategies is really only about half of the periodic time interval τ (for small τ), due to the fact that on average the failure occurs half-way through the period. To give equivalent performance using the continuous strategy, we would need to set the repair transition rate μ to the actual mean time to repair of the periodic strategy, which (as discussed in another note) is given by |
|
|
|
Substituting this in place of τ in the equation for the system failure rate based on the continuous (exponential) repair transition, we can plot the failure rate along with the periodic transition failure rate as shown below. |
|
|
|
This illustrates how closely the periodic repair strategy can be represented by a simple Markov model with a constant repair rate, assuming we make a suitable choice of the rate so as to give equivalent mean times to repair. |
|
We’ve seen that the timed Markov model for periodic repair without reset gave an overall system failure rate that is identical to the rate predicted by the time-dependent solution. Needless to say, this is not coincidental, because in this case the entire Markov model is being “timed” as a coherent whole, so it is strictly equivalent to the time-dependent solution of the overall model. To prove this, recall that the general model of an n-state system is of the form dP/dt = MP where P is the state (column) vector consisting of the probabilities of the n states and the transition matrix has the components |
|
|
|
with the understanding that λii = 0. The time-dependent solution can be expressed as |
|
|
|
Now, the timed Markov model for this entire system over a period τ without resets consists of m complete copies of the entire system, linked by transitions from each state of the system to the corresponding state in the successor system. The rate of each linking transition is m/τ, and we evaluate the limit as m goes to infinity (for any fixed τ). The recurrence equation for the ith state of the timed system is therefore |
|
|
|
where Pi[k] denotes the probability of the ith state in the kth copy of the system. Solving for Pi[k–1] we get |
|
|
|
Consequently the recurrence relations for all n states can be expressed in matrix notation as |
|
|
|
where I is the identity matrix and M is the same as the transition matrix for the time-dependent equations. To span the period τ we must have m applications of this discrete transition, so we have |
|
|
|
Evaluating this in the limit as m goes to infinity, we get P(0) = e–τM P(τ), which agrees exactly with the time-dependent solution. |
|