Probabilities with Variable Failure Rates

Success is just failure that hasn't happened yet.

Catrell Sprewell

The probability density function δ(t) for the occurrence of a specific event at the time t is defined as δ(t) = dP(t)/dt where P(t) is the cumulative probability, i.e., the probability that the event has occurred by the time t. Letting t_E denote the time when event E occurs, the probability that the event will occur during the time interval from t₁ to t₂ is given by the integral

The density function for a specific event generally decreases to zero as time passes, because the event will almost certainly have already occurred after a sufficient amount of time. However, the density function for the event to occur given that it has not already occurred may be fairly constant. For example, if we repeatedly toss a six-sided die, the probability of the first occurrence of ‘4’ is 1/6 for the first toss, but it drops for subsequent tosses, because the first occurrence is likely to have already occurred. Thus it is a priori extremely unlikely that the first ‘4’ will occur on (say) the 20th toss. However, given that the first ‘4’ has not already occurred, the probability of the first ‘4’ occurring on the 20th toss (or any toss) is always 1/6.

This leads us to define the rate of occurrence, denoted by λ(t) for time t, as the probability density in the next increment of time after time t, given that the event has not already occurred by the time t. Thus letting dt denote the increment of time beginning at time t, and stipulating that P(0) = 0, we can express the definitions of the density function δ(t) and the rate function λ(t) as follows

where the vertical line signifies “given”. Here the expression t_E Ï t signifies that the event E has not occurred by the time t. Recall that a basic property of probability is that, for any two events X and Y, we have

where P{X|Y} signifies the conditional probability of X given Y. Making use of this identity, the expression for λ(t) can be written as

The probability that event E has not occurred by the time t is simply 1 – P(t), so this shows that the rate of occurrence is related to the probability density function by

In other words, the rate is simply the probability density normalized by the probability that the event has not yet occurred. This is often a useful parameter because, as mentioned above, many naturally-occurring events have a constant (or nearly constant) rate. Remembering that δ(t) = dP/dt, the relation can be written as a linear first-order differential equation

If λ(t) is constant, this has the homogeneous solution P(t) = Ae^−λt for some constant A, and the particular solution P(t) = 1. The general solution is the sum of these two solutions. Setting A = −[1 − P(0)] to match the initial value, we have

If P(0) = 0 this reduces to the familiar formula P(t) = 1 − e^−λt for the probability based on the exponential density function δ(t) = λe^−λt. The above equation can also be written in the form

If the rate function λ(t) is not constant, consider a sequence of incremental time intervals, each of duration Δt, during each of which the rate has the virtually constant values λ₀, λ₁, λ₂, ... and so on. In that case we have the sequence of solutions

Multiplying these together, the probability at the end of these increments is given by

Thus in the limit as Δt goes to zero and NΔt goes to T, we have

Consequently we have the result

If the probability P(0) at the beginning of the flight is zero (as is typically the case), and if we make use of the approximation e^x ≈ 1 + x, this equation simply says that P(T) is essentially just the integral of λ(t) from t = 0 to T, which is as expected, because λ(t) is nearly identical to the probability density function ρ(t) in these conditions. In other words, the above equation is just a convoluted way of expressing the tautological fact that the probability for the flight is the integral of the probability density for the flight.

Regulatory guidance for safety calculations sometimes splits up the total time interval T into several smaller intervals (also called phases) of duration T₁, T₂, ..., T_n, where T_j extends from t_j−1 to t_j, and considers separate rate functions λ₁(t), λ₂(t), ..., λ_n(t) during these intervals respectively. Then by the above derivation we immediately have the two equivalent expressions

It’s worth noting that the rate functions are each defined and evaluated only over disjoint time intervals, meaning that the rate function λ₁(t) covers only t = t₀ to t₁, and the rate function λ₂(t) covers only t = t₁ to t₂, and so on. Thus these functions really comprise just a single overall rate function λ(t) covering the entire duration from t = 0 to T, and the probability is given by (3). If we wished to define individual rate functions beginning from time zero for each phase, then we could sum the integrals of λ_i for t = 0 to T_i. On the other hand, it may be convenient in some situations to split the function into these separate terms if each phase has a distinct constant rate.

Another notable feature of the treatment of these calculations in the regulatory guidance is that they present “two cases”, writing the above equation (with product of exponentials) twice, once with P(0) = 0 and once with P(0) ≠ 0. Furthermore, in the latter “case”, they re-write the above equation as

This is algebraically equivalent to (3), but may have been intended to emphasize that it includes the initial probability for the case of latent failures that persist from one mission to the next. By the way, the guidance contains a misstatement, saying that the above equation with P(0) ≠ 0 gives “the probability that the element fails during one certain mission”, whereas it later says correctly the latent case represents the probability of the fault being present during that mission, including the probability that it failed prior to the start of this mission. The case P(0) ≠ 0 is applicable to failure conditions that are not detected at the beginning of the mission, so they could be already present at the beginning of the mission. (The probabilities of component failures would contribute to the applicable λ(t) functions in the above equations.)

The published regulatory guidance states that the probability per flight should always be calculated for the average flight length, but it also states that the probability for an average flight may vary, because one or more failed elements in the system can persist for multiple flights (latent, dormant, or hidden failures). The analysis must consider the relevant exposure times, e.g. time intervals between maintenance and operational checks/ inspections. In such cases the probability of the Failure Condition per flight increases with the number of flights during the latency period. To account for this, the guidance states that the probabilities per flight (assumed to be of average duration) for each flight during either the entire life of the aircraft or the least common multiple of the latency periods exposure times should be summed and then divided by the number of missions. If the system is verified to be fully healthy at the beginning of each flight, then the latency period is just one flight, then every flight of average duration would have the same probability, and hence this step in the calculation would be superfluous. In that case the average value of “probability per average flight” would simply be P(T_ave), i.e., the probability of failure in a flight of average duration. The last step in the guidance is to divide this by the average flight duration, giving P(T_ave)/T_ave, and this is the value that is compared with the numerical threshold (e.g., 10⁻⁹/hr for catastrophic failures) to determine compliance.

The guidance does not mention (at least not explicitly) that, even if the system is verified to be fully healthy at the start of each flight, the overall probability of the failure condition occurring during the life of the airplane can also be affected by variations in flight length from the average. The only (tacit) acknowledgement of this is in the stated caveat that if P(T_ave)/T_ave “is likely to be significantly different from the predicted average rate of occurrence of that failure condition during the entire operational life of all airplanes of that type, then a risk model that better reflects the failure condition should be used”. The only way that P(T_ave)/T_ave can differ from the predicted long-term rate of occurrence is due to variations in flight length if the probability per flight is highly dependent on flight length. Recall that some failure modes contribute the same risk per mission, regardless of the duration of the mission, whereas the risk contributed by other failure modes varies in proportion to the mission duration. If we were limited to just these two kinds of failures, the probability of the jth mission with duration T_j could be written as P_j = P_c + kT_j where P_c is the constant contribution and k is the proportionality of the scale-dependent contribution. If we add up the probabilities for N missions and then divide by N (as directed in the guidance), the result is P_ave = P_c + kT_ave, which is exactly equal to the probability for the mission of the average duration. In view of this, one might think that variations in flight length have no effect on the long-term average. However, if the probability per flight varies as the square or cube or, in general, the nth power of the flight length, then it does have an effect on the long-term probability of occurrence in the life of the airplane. In such cases, the actual long-term average probability of occurrence is increased by the factor (Tⁿ)_ave/T_aveⁿ as explained in Regulating Risk. In most realistic cases this factor is close to 1, so it is often neglected, but it can be significant in extreme cases.

Return to MathPages Main Menu