Timed Markov Models 

One of the defining attributes of ordinary Markov models is that the state transitions are exponentially distributed, meaning that the transition rates are constant and therefore the transitions are continuous. However, many actual situations involve state transitions that depend explicitly on time, and that occur discretely. Such transitions can’t be exactly modeled in terms of ordinary Markov models, but it is possible to extend the usual Markov modeling technique to encompass timedependent transitions. Interestingly, ordinary Markov models involve continuous flow between discrete states, whereas the extended technique (described below) essentially consists of representing discrete flow between continuous states. 

For a simple example, consider a dualredundant system comprised of two identical components, either of which is capable of performing the required function. The component failures are independent and exponentially distributed. The system begins with both components healthy. If one component fails, it will be repaired, but the scheduling of repairs can be handled in different ways. If both components are failed (resulting in total system failure), they are both repaired immediately. We will consider the following three possible repair strategies: 

_{} 

The case of continuous repairs can be handled as an ordinary Markov model, as illustrated below. 


The steadystate system equations are 

_{} 

and we can solve these equations for P_{1} to give the steadystate system failure rate 

_{} 

For the case of delayed repairs, i.e., if our strategy is to repair each components exactly t hours after it fails, we cannot represent the repairs as a simple exponential transition. We can, however, approximate the timedelayed transition by a model of the type shown below. 


The four intermediate states 1 through 4 are arranged in a temporal sequence with an exponential transition from each to the next, each with a rate 4/t, so the mean time to traverse those states and be repaired is t. The mean time in the intermediate state of the previous model was also t, but in the present model the variance of the dwell time is much less. By increasing the number of intermediate states, we can make the variance on the total dwell time arbitrarily small, and in the limit we arrive at a perfectly timed repair, precisely t hours after the component fails. Substituting from the steadystate system equations 

_{} 

into the conservation equation P_{0 }+ P_{1 }+ P_{2 }+ P_{3 }+ P_{4} = 1 and solving for P_{0} gives 

_{} 

where 
_{} 

From this we can compute the overall system failure rate 

_{} 

Noting that a/(1b) = 2, this can be written as 

_{} 

Now, recall that b = n/(n+l) where n = 4/t, corresponding to the interstate dwell times between each of the four intermediate states. More generally, for n intermediate states, we have n = n/t, and 

_{} 

In the limit as n goes to infinity this gives 

_{} 

so the overall system failure rate is 

_{} 

Next we consider periodic repair. In this case the entire system must be propagated through time, and restored to the fully healthy state every t hours. We will consider two variations of this case. First, we suppose the periodic checks and repairs for a given system are reinitialized whenever a system completely fails. (This would be the case if, for example, each system carries its own periodic sequence, and a system is simply replaced with a new one if/when it completely fails.) To model this case we again start with a finite approximation, as shown below. 


The steadystate system equations are 

_{} 

where 
_{} 

It follows that the individual substate probabilities for the fullyhealthy state can be expressed in terms of a, b, g, and P_{00} as follows 

_{} 

and the substate probabilities for the partialfailed state can be expressed as 

_{} 

The sum of the probabilities of the fully healthy substates is 

_{} 

and the sum of the probabilities of the partiallyfailed substates is 

_{} 

The sum of all eight of these probabilities is 1, so we can solve for P_{00}, and then we can write the overall system failure rate as 

_{} 

Dividing through by the first term in the denominator, and replacing 4 with n for an arbitrary number of substates, separated by exponential transitions with the rate n = n/t, this gives 

_{} 

As n goes to infinity we have the limiting values 

_{} 

and we also have the limits 

_{} 

Making these substitutions and simplifying, we find that the overall system failure rate is 

_{} 

Of course, we could arrive at this same result by just evaluating the mean system failure rate for the norepair model over a time of t hours, but the method used here can be applied in more general situations to impose time effects on Markov models. Naturally this method can also be used to make transition rates variable as functions of time. 

Equations (1), (2), and (3) give the system failure rates for three different repair strategies, based on continuous, delayed, and periodic repairs, respectively. A comparison of these rates as a function of t is shown in the plot below. 


All of them are asymptotic to 2l/3, as we would expect, because that is the steadystate system failure rate for no repairs. Naturally the periodic repair strategy has an initial rate of about half the continuous and delayed strategies, because the mean time to repair is only about half as long. (See below for further discussion of this point.) It’s interesting that all three results are of the form 

_{} 

where “X” is one of the expressions 

_{} 

Compare the last of these with the wellknown bilinear substitution used in converting continuous transfer functions to discrete transfer functions, where the differential operator D is replaced by a function of the past value operator 

_{} 

It would be interesting to see if some such general substitution could be used to introduce temporal effects (fixed time delays or periodic actions) into Markov models. The use of timedependent transition rates could also be explored. 

Next we consider the second variation for periodic inspections and repairs. In this variation, the periodic actions occur independently of any failures, so the period is not reinitialized if/when a system completely fails. That system is simply repaired (or replaced) and it remains on the same schedule of periodic inspections and repairs. Of course, it’s easy to evaluate this case by simply computing the average failure rate for the duration of one inspection period, beginning from the fullyhealthy state, and with no repairs of partially failed systems. Thus the system can be represented in the simple form shown below 


The timedependent system equations are 

_{} 

Substituting for P_{0} from the conservation equation P_{0} = 1 – P_{1} and rearranging terms gives 

_{} 

which has the solution 

_{} 

The average of this over the time interval from t = 0 to t = t is the average system failure rate with periodic inspections and repairs. Thus we have 

_{} 

Notice that this differs from each of the previous three cases. Now we will show how this same result can be reached using a timed Markov model. The approximate model for this case is shown below. 


To simplify the schematic the complete failure state has been omitted, since that state is irrelevant for purposes of assessing the failure rate of the population of operational systems. The second component failures are represented by the transitions from the singlefault states back to the fully healthy states. Unlike the previous variation, the system is restored to the contemporaneous fully healthy state, rather than the initial fully healthy state (because the periodic actions are not resynchronized). 

The steadystate system equations are 

_{} 

where 
_{} 

and n = 4/t. From these we infer the recurrence relation 

_{} 

Letting M denote the coefficient matrix on the right side, it follows that the system failure rate is given by 

_{} 

The geometric series of matrices cannot be summed using the simple Euclidean formula, because M^{0} – M is singular, but it can be shown that M^{j} is given explicitly by 

_{} 

Therefore the geometric sum can be written as 

_{} 

where F = n/(n + 3l). Letting M_{ij} denote the components of M, we can substitute into the previous expression for l_{sys} to give 

_{} 

Evaluating this in the limit as n goes to infinity (and recalling that n = n/t), we get 

_{} 

which agrees exactly with the previous result for this case. A plot of all four of the repair strategies is shown below. Interestingly, the continuous and the delayed cases are asymptotically matched at the low end, as are the two periodic cases, whereas at the high end the delayed is asymptotic with the periodicwithreset and the continuous is asymptotic with the periodicwithoutreset. 


As mentioned previously, the periodic strategies initially have exactly half the slope of the continuous and delayed strategies, because the mean time to repair with the periodic strategies is really only about half of the periodic time interval t (for small t), due to the fact that on average the failure occurs halfway through the period. To give equivalent performance using the continuous strategy, we would need to set the repair transition rate m to the actual mean time to repair of the periodic strategy, which (as discussed in another note) is given by 

_{} 

Substituting this in place of t in the equation for the system failure rate based on the continuous (exponential) repair transition, we can plot the failure rate along with the periodic transition failure rate as shown below. 


This illustrates how closely the periodic repair strategy can be represented by a simple Markov model with a constant repair rate, assuming we make a suitable choice of the rate so as to give equivalent mean times to repair. 

We’ve seen that the timed Markov model for periodic repair without reset gave an overall system failure rate that is identical to the rate predicted by the timedependent solution. Needless to say, this is not coincidental, because in this case the entire Markov model is being “timed” as a coherent whole, so it is strictly equivalent to the timedependent solution of the overall model. To prove this, recall that the general model of an nstate system is of the form dP/dt = MP where P is the state (column) vector consisting of the probabilities of the n states and the transition matrix has the components 

_{} 

with the understanding that l_{ii} = 0. The timedependent solution can be expressed as 

_{} 

Now, the timed Markov model for this entire system over a period t without resets consists of m complete copies of the entire system, linked by transitions from each state of the system to the corresponding state in the successor system. The rate of each linking transition is m/t, and we evaluate the limit as m goes to infinity (for any fixed t). The recurrence equation for the ith state of the timed system is therefore 

_{} 

where P_{i}[k] denotes the probability of the ith state in the kth copy of the system. Solving for P_{i}[k1] we get 

_{} 

Consequently the recurrence relations for all n states can be expressed in matrix notation as 

_{} 

where I is the identity matrix and M is the same as the transition matrix for the timedependent equations. To span the period t we must have m applications of this discrete transition, so we have 

_{} 

Evaluating this in the limit as m goes to infinity, we get P(0) = e^{tM} P(t), which agrees exactly with the timedependent solution. 
