Quantifying Latent Risk

 

The risk due to a latent failure condition depends not just on the rate at which that condition arises, but on the duration of time that the failure condition may persist, once it has arisen.  To illustrate, consider the following example:

 

A certain latent condition (in which the airplane is one failure away from catastrophe) is predicted to arise 7 times in the life of the fleet, and each time it arises, it is present for 10 flights before being checked and repaired. This means we expect to have 70 flights in the life of the fleet during which the airplane is one failure away from catastrophe.  But now suppose the inspection is only performed every 10,000 flights.  The failure rate is unchanged, only the latency period has changed, and yet in this case we expect 70,000 flights in the life of the fleet during which the airplane is one failure away from catastrophe.  Thus the risk of catastrophe associated with this latent failure condition is 1000 times as great, and yet the “failure rate (i.e., the rate of arising) is exactly the same.

 

Consistent with this, the updated federal regulation 25.1309 in Amendment 25-152 clearly quantifies the risk of latent failure conditions as the product of rate and exposure time, which is the probability of being failed on the last flight of the interval, not the probability of failing on that flight. The result is proportional to the normalized average probability as defined by the “Arsenal” draft Advisory Circular and EASA AMC. 

 

Despite this, one still encounters efforts to promote the idea of quantifying latent risk in terms of the rate alone, which can underestimate the safety risk by orders of magnitude for long exposure times. Since it’s obvious that the failure rate, by itself, cannot characterize the safety risk posed by a latent fault, what is the reason for these efforts?  In the following discussion, we review the historical background on this misconception.

 

Numerical probability analyses (NPA) were originally applied almost exclusively to catastrophic or hazardous failure conditions, which are generally not latent (even if they entail some latent elements). For such conditions, the probability of arising on a given flight is the same as the probability of being present at some point during the flight. However, several developments gradually led to the application of numerical probability calculations to totally latent failure conditions. First was the publication of 25.981 in 2001, which imposed very stringent numerical limits on latent failure conditions related to fuel tank ignition sources. (The limits on latent faults imposed by 25.981 are orders of magnitude more stringent than the limits imposed on latent faults for every other system on the airplane in the recent Amendment 25-152.) Second, the regulatory agencies began to push for the application of numerical analyses for Major failure conditions (the category less severe than Hazardous), some of which are totally latent. Third, there was an initiative to extend requirements similar to 25.981 to a wider range of systems, beyond just fuel tank ignition sources.

 

The calculation defined in the Arsenal AC 25.1309 actually applies perfectly well to latent failure conditions, explicitly accounting for Pprior, etc., but some of the verbiage  in the body of the text unnecessarily continued to be phrased in terms of active failure conditions, e.g., referring to the number of times a failure condition arises, as distinct from the number of flights during which it is present. This verbal distinction was immaterial when all top-level failure conditions were active, but it becomes significant when addressing completely latent failure conditions. Even when the Arsenal was drafted in 2002, some of the old verbiage remained. For example, when discussing the probability of a latent failure condition on a given flight, the Arsenal at one point refers to this as the probability that the element “fails” during that flight, whereas the quantity being calculated (including the Pprior term) is clearly the probability that the element “is failed” during that flight. This was pointed out in an NPRM comment, and the FAA accepted this and changed the statement to “is failed”. Unfortunately, this correction was not consistently made throughout the AC, so a few remnants of the old focus on active faults remain, but aside from the one slip which has been corrected, the verbiage in Appendix 3 itself covers both active and latent failure conditions.

 

The Arsenal Appendix has 14 instances referring to the probability being calculated, and in 10 of those instances it refers to the “probability of the failure condition on a given flight”, which signifies that the failure condition is present on that flight.  In one instance, as noted above, the text mistakenly referred to the probability that an element “fails” on a certain flight, but this was corrected in the NPRM to “is failed”.  The remaining 3 instances refer to the failure condition occurring on a flight.  Strictly speaking, the word “occur” applied to a condition or persistent entity (as distinct from an event) has the meaning “exist or be found to be present”, whereas when it’s applied to an event, it has the meaning “happen or take place”. So, these 3 instances are consistent with the other 11 instances. Moreover, in every case, the Arsenal equations unambiguously calculate the probability of the failure condition being present on a certain flight, explicitly including the Pprior term (which is zero for active conditions). So there is no ambiguity.

 

As mentioned previously, in the recently amended version of 25.1309 (Amendment 25-152) the FAA has defined explicit requirements on latent fault conditions (that leave the airplane one failure away from a catastrophic condition), and they closely align with what would be required by stating that the latent condition must be Remote (per the actual Arsenal definition, i.e., normalized average probability less than 1E-05/FH).  However, instead of simply defining the requirement this way, the new regulation states that, in any Catastrophic single latent plus one (CSL+1) combination, the active part must be Remote and the latent part must satisfy λτ < 1/1000, where λ is the failure rate and τ is the exposure time. Again, this represents the probability that the condition is present on the last flight in the latency period, which confirms that this is the only meaningful measure of the risk associated with latent faults. Of course, the combination of active plus latent must be extremely improbable (1E-09), so if the active part is at the Remote boundary (1E-05), the latent part will need to be 1E-04, and if we apply averaging and normalization, this typically results in a value near 1E-05, which is Remote. The point is that the only relevant measure of risk for latent fault conditions is the probability of being present, not the rate at which it arises.

 

The Arsenal draft says “The probability of a Failure Condition occurring on an "Average Flight" …  should be determined …  If one or more failed elements in the system can persist for multiple flights … the probability of the Failure Condition increases with the number of flights during the latency period.” For a single latent fault, the probability of failing on a given flight remains constant, so when the AC refers to the probability of occurring increasing with the number of flights, it can only be referring to the probability of being present on the flight.

 

Unfortunately, along with the new version of 25.1309 in Amendment 25-152, the FAA published a revised version of the Advisory circular, which (as described in another article) is mangled such that it no longer represents a coherent calculation of anything.  One might argue that the mangling of Appendix F no longer matters, because latent failure conditions are now covered by a separate calculation specified in the regulation itself, that doesn’t invoke the usual probability ranges of 25.1309, and hence no longer depends on Appendix F. However, Major faults still need to be shown to be Remote using the AC criteria, and many such faults are latent. Moreover, the mistakes in the re-write of Appendix F effectively render it inapplicable to any failure condition, even active ones, at least if they contain any latent components.

 

One feature of the Arsenal AC that has contributed to confusion is the statement in subsection c that “the principles of calculating are described below and also in more detail in SAE ARP 4761”. The vague reference to the ARP here was somewhat gratuitous, because the AC itself gives a complete description of the principles of the calculation (see Part I of this related article). Notice that it says the principles are described “below” (i.e., in the AC itself) and in the ARP, so it is not offering the ARP as an alternative, but merely as a redundant source with “more detail”. Now, it doesn’t specify which particular part of the ARP is supposed to contain this redundant description with the extra detail, but the closest thing to it is contained in the ARP 4761A Appendix I on Markov models. Yes, this does provide a little more detail (textbook explanation) of the core part of the AC calculation, for readers who aren’t familiar with it, but it doesn’t alter or add anything to the definition of the quantity being calculated. And neither that section nor any other section of the ARP describes the 4-step calculation of the quantity called “Average Probability per Flight Hour” given in the AC.

 

The NPRM draft added a sentence to F.3 stating that “The probability of a failure condition occurring on an average flight should be determined by structured methods (see SAE ARP 4761 for example methods)”. We’ve already noted that for a persistent condition the word “occur” means “exist or be present”, and the NPRM version of F.3 itself describes unambiguously how to compute the probability of a failure condition on a given flight. As noted above, the calculation of the basic probabilities is indeed described in a bit more detail in Appendix I (Markov models) of ARP 4761A, but nothing in the ARP supersedes the AC. The Arsenal AC unequivocally states that the method of calculating Average Probability Per Flight Hour (which it places in Caps to highlight that this is a label for a defined quantity) is contained in Appendix 3 (renamed Appendix F in the NPRM), and nowhere else. However, in the published AC the sentence was changed to state that the calculation is found in Appendix F and in ARP 4761. Again, this is not accurate, because the ARP does not contain the 4-step process leading to the normalized average probability described in the AC.

 

It should be remembered that ARP 4761 is just an industry document (I was one of the authors of the recent update), and it includes a variety of generic notions and principles, but it does not (and wasn’t intended to, and legally couldn’t) provide regulatory requirements, definitions, or guidance. We can find “principles of calculating” a wide variety of things in the ARP (as well as many textbooks), but it doesn’t tell us what to calculate to show compliance to FAA regulations. The actual definition of the quantitative compliance criteria for 14 CFR 25.1309 are provided only in the FAA Advisory Circular 25.1309.  The vague references to ARP 4761 should not be taken as an invitation to disregard the definition of “Average Probability per Flight Hour” spelled out unambiguously in the AC itself.

 

Readers sometimes mistakenly point to Section G.11.1.5.5 of the ARP 4761A as a relevant example, but that section actually presents a (primitive and overly specialized) calculation of the approximate rate at which a latent combination of two latent failures is entered (for which it coins the phrase “occurrence probability”, not used anywhere else in the document). This is not the same as the normalized probability in terms of which the FAA’s quantitative criteria for compliance to 25.1309(b) are defined. A detailed discussion of that example is presented in “Failure Rates and Normalized Probabilities”, showing that the calculation presented there is computing the rate rather than the normalized probability, so it isn’t relevant to 25.1309(b) compliance showings. For the reasons explained above, the rate at which a latent fault condition arises does not represent the safety risk associated with latent failure conditions, which is why the Arsenal/EASA/NPRM AC all define the quantitative measure of risk as the normalized average probability.

 

Having said that, Appendix G (formerly D) of ARP 4761A does contain some relevant information, but nothing that over-rules the definition of “Average Probability per Flight Hour” provided by the Arsenal Advisory Circular. For example, section G.10.1 describes how to determine the minimal cutsets for a fault tree. The AC doesn’t describe this step because it is very generic and it was assumed that everyone knows it already, so it’s just background information for the sake of completeness. Once we have the minimal cutsets, the task is to evaluate the normalized average probability of each of them. This is fully defined in the Arsenal AC itself (as discussed in Part I of this related article). If a reader wants more details on the “principles of calculating” this quantity, the generic aspects of it are discussed in Appendix I of ARP 4761A, which gives background on the general solution of the Markov state model for each cutset. Of course, this can also be found in numerous textbooks. (For a more focused introduction, see “The Arsenal Companion”.) Again, the ARP doesn’t discuss specialized regulatory concepts like normalized average probability, which isn’t a math textbook concept, it is a regulatory requirement related to fatalities per passenger mile.

 

Another minor peculiarity (among many) of the published AC is the introduction of the “end of flight” notation for latent elements. This is odd because, even for active elements, the computed probability is for the end of the flight, e.g., if an element is checked good at the start of the flight, we compute the probability of the fault occurring by the end of the flight, taking the full flight length into account. The idea that “end of flight” applies uniquely to latent faults is a somewhat peculiar misconception, and it provides a strong hint as to where the re-write may have originated.

 

On the subject of normalization, the original versions of AC 25.1309-1 and AC 25.1309-1A in 1982 and 1988 respectively each included a single, somewhat labored, sentence, different in the two versions, but both struggling to articulate some kind of exception to normalization in some vague and inconsistently-specified circumstances. This was based on confusion about the difference between rates and normalized probabilities, and lack of understanding of the purpose of normalization. In 2002 the assembled experts who developed the Arsenal version of the AC thoroughly discussed and intentionally deleted the errant sentence, and addressed normalization clearly and correctly in Appendix 3. Unfortunately, misunderstanding of normalization has persisted, and in the 2022 NPRM version of 25.1309-1B the following parenthetical statement was added, re-asserting this misunderstanding:

 

…for failure conditions that are only relevant during a specific flight phase… the probability is calculated as an average probability per flight. To convert to “average probability per flight hour”, divide the per flight probability by one hour.

 

Multiple comments were posted during the NPRM process, questioning this position and explaining why it makes no sense. The purpose of normalization is not to convert a probability to a rate, it is simply to scale the average per-flight probability by the average flight length of the airplane model, to place models of different flights lengths on an equal footing in terms of fatalities per passenger mile, so that a single numerical threshold (e.g., 1E-09/FH for extremely improbable) can be used for all models. (For details, see “Normalized Average Probability”, especially Note 2.) Inserting the above-quoted statement into the AC perpetuated a significant disharmony with EASA. Note that the Arsenal/EASA/NPRM calculation (i.e., the 4-step process described previously) correctly accounts for any distribution of failure rates for the flight phases, including cases when the rate is non-zero only in a single phase. The probability of the failure condition on a given flight depends only on the integral of the probability density over the flight, not on how that density is distributed within this flight. (This is not to be confused with the concept of specific risk, by which the distribution of risk between different flights over the life of an airplane is to be considered.) This is why the statement quoted above is misguided. This criticism was not substantively addressed in the NPRM response.

 

Return to MathPages Main Menu