Changes in Appendix F of AC 25.1309-1B |
|
In December 2022 the FAA issued a Notice of Proposed Rulemaking (NPRM) that was accompanied by a draft update to Advisory Circular (AC) 25.1309 for public comment. One of the key parts of this AC is Appendix F, which defines the normalized average probability that is to be compared with the specified quantitative thresholds to determine compliance with the regulation. The intent was to harmonize with the calculation in the 2002 Arsenal draft AC and the EASA AMC 25.1309 that have been in use for over 20 years. Indeed, the description of the calculation in the NPRM draft that was circulated for public comment was essentially identical to the corresponding sections of those earlier versions (with the exception of a parenthetical statement in the NPRM draft related to normalization, which we will discuss at the end of Part II of this article). |
|
However, in the published version of AC 25.1309-1B, released in August of 2024, the description of the probability calculation in Appendix F was extensively re-written, particularly Section F.4, rendering it incoherent. If it was a computer program, it wouldn’t compile. For example, a recurrence expression with a free index is treated as if it has some definite value, so it doesn’t even parse mathematically. This and other inconsistencies make it impossible for applicants to determine the “Average Probability per Flight Hour” by following the steps in the published Appendix F as written. |
|
We’ll describe the problems with the published Appendix F below. For context, in Part I we review the original calculation of normalized average probability (also called “Average Probability per Flight Hour”) as described clearly and correctly in the 2002 Arsenal draft, the EASA AMC, and the 2022 NPRM draft, which are all essentially identical (aside from the misconceived statement about normalization in the NPRM draft). In Part II we describe the re-written version of Appendix F in the published AC. Lastly, Part III gives further discussion and general remarks on misunderstandings that may have motivated the problematic alterations in the published AC. |
|
|
I. Review of Arsenal, EASA, and NPRM AC 25-1309 Calculation |
|
For definite citations in this section, we’ll refer to the Arsenal draft. The process of calculating the quantity referred to as "Average Probability per Flight Hour" for a Failure Condition is described in Appendix 3 of the Arsenal as a four-step process, based on the assumption that the life of an aircraft is a sequence of "Average Flights". (It’s worth noting that the name of the quantity being calculated was intentionally presented in the Arsenal in quotation marks, to signal that it is the label of a specially defined quantity, which is an important fact that was lost sight of in the NPRM draft.) These four steps are described below. |
|
Step 1: Determination of the "Average Flight" |
This is described in sub-paragraph “a”. The applicant should estimate the average flight duration and average flight profile for the fleet of aircraft to be certified. The duration of each flight phase (e.g. takeoff, climb, cruise, descent, approach and landing) in the "Average Flight" should be based on the average flight profile. The applicant then determines the failure rate function for each element of the system during each phase (which may be different). |
|
Step 2: Calculation of the probability of a Failure Condition for a certain "Average Flight" |
This is described in sub-paragraph “b”. This section gives the formula for the probability of a given fault being present at some time on the kth flight recursively in terms of the probability at the start of that flight. The probability at the start of the kth flight equals zero for an element that is checked good at the beginning of the flight, and otherwise equals the probability at the end of the (k−1)th flight. The Arsenal AC gives these as two separate formulas, but they are both entailed (for a single element) by the recursive formula |
|
|
|
where Pk is the probability of the failure condition of this element being present by the end of the kth flight (called “a certain flight”), and λi(t) is the failure rate function during the ith of n phases. (For the derivation of this standard formula, see “The Failure Rate Function”.) If the element is checked and (if necessary) repaired prior to the kth flight, then the prior probability Pk−1 is set to 0, so this formula represents both formulas given in this section of the Arsenal description. This formula can be written in vector notation (for the two states, healthy or failed, of a single element) as |
|
|
|
where Pk denotes the kth probability state vector (representing the probabilities of the states of the cutset), M is the mean failure rate matrix, Sk is the kth repair transition matrix, and T is the average flight time. (See “The Arsenal Companion” for a detailed explanation of this correspondence.) The same equation applies to a model with any number of states, representing an entire failure condition (cutset), with each element treated as specified. Note that this Arsenal calculation is nothing but the standard solution of dP/dt = MP for a given combination of failures, as discussed in Section I.4.10 and I.6 of ARP 4761A, as well as in innumerable textbooks. |
|
Step 3: Calculation of the "Average Probability per Flight" of a Failure Condition |
This is described in sub-paragraph “c”. The average of the probabilities on each of the N flights, which we will denote as Pave, is given (again in vector notation) by |
|
|
|
The Arsenal gives this formula for just the scalar value of the fully-failed component of the state vector, but it’s convenient to compute the probabilities for all the states. |
|
Step 4: Calculation of the "Average Probability Per Flight Hour" of a Failure Condition |
This is described in sub-paragraph “d”. The normalized average probability is defined as the average probability divided by the average flight length T, so |
|
|
|
The relevant component of the state vector is the one representing the joint failure of all the elements in the combination (cutset), i.e., the probability that the cutset is satisfied. The probability of the union of multiple cutsets is given by inclusion-exclusion, or just simple summation for rare events. As stated in the Arsenal, the resulting quantitative value should be used in conjunction with the hazard category/effect established by the hazard analysis to determine if it is compliant for the Failure Condition being analyzed. |
|
In summary, following these four steps, the quantity “Average Probability per Flight Hour” (i.e., the normalized average probability) for any given cutset is defined in the Arsenal AC and the EASA AMC and the FAA’s 2022 NPRM draft AC as |
|
|
|
with M and S denoting the mean failure rate matrix and the inspect/repair transition matrix respectively, T is the average flight duration, and N is the number of flights in the life of the airplane. This accounts for all scheduled inspection/repair strategies (including latencies), phase dependencies, required sequencing, etc. Example calculations are provided in the articles on “Probability for Regulatory Requirements” and “Latencies and Periodic Repairs”. |
|
|
II. Review of AC 25.1309-1B Appendix F, Post NPRM |
|
One of the comments on the 2022 NPRM draft AC was a request to include a worked example of the calculation defined in Appendix F. This was not a request to change the calculation, nor to change what was being calculated (which has been stable for decades and is unobjectionable), but merely to carry out the steps of the Arsenal calculation on a simple example, to dispel any possibility of misunderstandings. The FAA declined to provide a worked example of the steps of the Arsenal calculation, but instead substantially re-wrote the section, changing not only how the calculation is to be performed, but what it is attempting to calculate. |
|
The re-write seems to have been intended to change the definition of “Average Probability per Flight Hour” from the normalized average probability to an average failure rate, which is a very different quantity for latent failure conditions with long exposure times. This is ironic, because, in the new version of 25.1309 in the same NPRM, the risk of latent failure conditions is explicitly quantified by the probability of the condition being present on a flight, not just arising on a flight. Thus the motivation for re-writing Appendix F was based on a misconception. However, the problem with the re-written Appendix F in the published AC is not just that the change was apparently motivated by a fundamental misunderstanding (and that the very significant re-write circumvented public comment), but that the implementation of the change was mangled, so that the result isn’t just misguided, it is incoherent, for the reasons described below. |
|
The overview provided in the published F.1 is essentially the same as in the corresponding sections of the Arsenal, EASA, and NPRM versions, describing the calculation in terms of the same simple four-step process described in Part I of this article. However, the description of the third step in F.4 of the published AC has been drastically altered, and the alteration was mangled, such that it no longer parses as a meaningful mathematical calculation. The change is not subtle. For example, the NPRM version and the published version of the most affected section are shown below. |
|
|
The title of the third step presented in F.4 is still given in the overview section (F.1) as |
|
Calculate the average probability per flight of a failure condition. |
|
But, as shown in the extract above, in the published AC this is now inconsistent with the title of this step in F.4, which has been changed to |
|
Calculation of the “Probability per Flight” of a Failure Condition over a period of N flights. |
|
The word “average” has been removed from the title of this step, and from some parameter names as well, even though, in the Arsenal, the entire purpose of this step is simply to compute the average of the probabilities of the failure condition being present on each of the N flights. (See Part III for a discussion of the English meanings of the word “occur”.) The recurrence relation for computing the probability of the failure condition for each flight is provided in Section F.3, and the text of F.4 says “the probability of the failure condition for each flight … should be calculated, summed up, and divided by the number of flights during that period”. As explained in detail below, with the drastic changes introduced (post-NPRM) in the published AC, the section doesn’t even claim to be doing anything like this. |
|
Step 3 in the published AC has been split into two parts, depending on whether “the element is checked operative at the beginning of each flight”. However, the individual elements have already been treated in Step 2 of the Arsenal, whereas Step 3 of the Arsenal is dealing with the overall failure condition (or cutset), which in general consist of multiple elements, some of which may be latent and others active. It is not generally possible to classify a failure condition as either checked operative or not checked operative at the beginning of each flight. Neither the Arsenal/NPRM draft nor the EASA AMC are dealing with individual elements in this step, they simply compute the average of the probabilities of the entire failure condition for the N flights, as explicitly stated in the original title and text of the section. The published AC is doing something completely different in this step, contrary to the original title and purpose of the step. |
|
The next part of Step 3 in the published AC is mathematically incoherent. Recall that the recurrence relation provided in Step 2 for the probability of elements that comprise the overall failure condition gives Pk as a function of Pk−1. The average of these is found by summing the Pk values for k = 1 to N, and dividing by N. The published AC still does this as well -- but only for “active elements”. For the “latent” case (again, overlooking that this is an unintelligible bifurcation for overall failure conditions to be addressed in this step) it replaces the calculation of the average of the Pk values with the following expression (where we have denoted Pprior by Pk−1) |
|
|
|
This is mathematically senseless, because the left side is a definite value whereas the “k” on the right side is a free index, meaning it has no definite value. For any given flight, for which the probability of the failure condition is Pk, the value of Pk−1 is the probability of the failure condition at the end of the prior flight (reset for repairs), but this doesn’t specify which two consecutive flights it is referring to. It is a generic recursive relation, and k represents an unspecified index in the range from 1 to N. The recurrence relation, given in Step 2, enables us to compute all the values P1, P2, P3, …, PN, and we can then compute the average of these values, as prescribed in the Arsenal, NPRM, and EASA versions of the calculation. (Refer to the description in Part I above.) Choosing just one of these probabilities, without specifying which one, and dividing it by N, is both underspecified and nonsensical. Even if the published AC were revised, say by specifying a particular value of k such as (say) k = N, this would still not equal the average of the probabilities, which is what the Arsenal, NPRM, and EASA versions are calculating in this Step, and what even the published AC itself claims (in the overview) to be computing in this Step. |
|
The actual safety risk due to latent fault conditions depends not just on the rate at which the condition arises, but on the latency interval during which it persists. As noted above, this is explicitly acknowledged in Amendment 25-152 of 25.1309 by defining quantitative limits on the product of the failure rate and the exposure time for single latent faults. This represents the probability that a latent component is failed on the last flight of the latency interval, which is quite different that the probability that it fails during that flight. So, presumably, everyone agrees that the risk associated with a single latent fault cannot be characterized purely by the failure rate. (See Part III for more discussion of this.) |
|
But it should be emphasized (again) that in general the re-written “calculation” described in Appendix F of the published AC cannot even be carried out, because it isn’t an executable expression. Also, it calls for classifying failure conditions (cutsets) as either active or latent, whereas failure conditions are generally combinations of failures, some active and some latent. Note that this step cannot be dealing just with elements (despite what it says), because the computed quantity from this step, after simply dividing by average flight duration in Step 4, is the final result, compared with the hazard threshold to determine compliance of the overall failure condition. |
|
Hence the published AC is incoherent on multiple levels, drastically different from the previous versions, and can’t be used to make any compliance showings at all. The only way to “fix” it would be to revert to the Arsenal/NPRM/EASA version, which (as described previously) gives a simple, straightforward, and correct calculation of the normalized average probability in all possible circumstances for both active and latent faults with any possible repair strategies, phase dependence, etc. Indeed, harmonizing on this calculation was the ostensible purpose of the new rulemaking activity in the first place. Also, on an administrative level, it seems questionable whether such an extensive re-writing of the key definitions in Appendix F, attempting to introduce such a fundamental change (and disharmony), should be introduced without public comment. |
|
Another minor peculiarity (among many) of the published AC is the introduction of the “end of flight” notation for latent elements. This is odd because, even for active elements, the computed probability is for the end of the flight, i.e., if an element is checked good at the start of the flight, we compute the probability of the fault occurring by the end of the flight, taking the full flight length into account. The idea that “end of flight” applies uniquely to latent faults is a misconception, and it provides a strong hint as to where the re-write may have originated. |
|
On the subject of normalization, the original versions of AC 25.1309-1 and AC 25.1309-1A in 1982 and 1988 respectively each included a single, somewhat labored, sentence, different in the two versions, but both struggling to articulate some kind of exception to normalization in some vague and inconsistently-specified circumstances. This was based on confusion about the difference between rates and normalized probabilities, and lack of understanding of the purpose of normalization. In 2002 the assembled experts who developed the Arsenal version of the AC thoroughly discussed and intentionally deleted the errant sentence, and addressed normalization clearly and correctly in Appendix 3. Unfortunately, misunderstanding of normalization has persisted, and in the 2022 NPRM version of 25.1309-1B the following parenthetical statement was added, re-asserting this misunderstanding: |
|
…for failure conditions that are only relevant during a specific flight phase… the probability is calculated as an average probability per flight. To convert to “average probability per flight hour”, divide the per flight probability by one hour. |
|
Multiple comments were posted during the NPRM process, questioning this position and explaining why it makes no sense. The purpose of normalization is not to convert a probability to a rate, it is simply to scale the average per-flight probability by the average flight length of the airplane model, to place models of different flights lengths on an equal footing in terms of fatalities per passenger mile, so that a single numerical threshold (e.g., 1E-09/FH for extremely improbable) can be used for all models. (For details, see “Normalized Average Probability”, especially Note 2.) Inserting the above-quoted statement into the AC perpetuated a significant disharmony with EASA. The FAA did not substantively address these criticisms during the NPRM process. |
|
Note that the Arsenal calculation (i.e., the 4-step process described previously) correctly accounts for any distribution of failure rates for the flight phases, including cases when the rate is non-zero only in a single phase. The probability of the failure condition on a given flight depends only on the integral of the probability density over the flight, not on how that density is distributed within this flight. (This is not to be confused with the concept of specific risk, by which the distribution of risk between different flights over the life of an airplane is to be considered.) |
|
Fortunately, an advisory circular is generally regarded as a means, but not necessarily the only means, of showing compliance. Applicants may propose to continue to use the correct mathematical expressions for normalized average probability given in the draft Arsenal (and the NPRM) advisory circular and the EASA AMC until the mistakes introduced in the released version of AC 25.1309-1B are repaired. |
|
|
III. Discussion and General Remarks |
|
As mentioned briefly above, the re-writing of Appendix F seems to have been motivated by a desire to change it from a calculation of normalized average probability to a calculation of average failure rate, which, for latent conditions, is quite different and not representative of the actual safety risk. The updated 25.1309 in Amendment 25-152 itself clearly quantifies the risk of latent failure conditions as the product of rate and exposure time, which is the probability of being failed on the last flight of the interval, not the probability of failing on that flight. Nevertheless, one still encounters efforts to promote the idea of quantifying latent risk in terms of the rate alone, which can underestimate the safety risk by orders of magnitude for long exposure times. Here we provide some historical background on the origin of this misconception. |
|
Numerical probability analyses (NPA) were originally applied almost exclusively to catastrophic or hazardous failure conditions, which are generally not latent (even if they entail some latent elements). For such conditions, the probability of arising on a given flight is the same as the probability of being present at some point during the flight. However, several developments gradually led to the application of numerical probability calculations to totally latent failure conditions. First was the publication of 25.981 in 2001, which imposed very stringent numerical limits on latent failure conditions related to fuel tank ignition sources. (The limits on latent faults imposed by 25.981 are orders of magnitude more stringent than the limits imposed on latent faults for every other system on the airplane in the recent Amendment 25-152.) Second, the regulatory agencies began to push for the application of numerical analyses for Major failure conditions (the category less severe than Hazardous), some of which are totally latent. Third, there was an initiative to extend requirements similar to 25.981 to a wider range of systems, beyond just fuel tank ignition sources. |
|
The calculation defined in the Arsenal AC 25.1309 actually applies perfectly well to latent failure conditions, explicitly accounting for Pprior, etc., but some of the verbiage in the body of the text unnecessarily continued to be phrased in terms of active failure conditions, e.g., referring to the number of times a failure condition arises, as distinct from the number of flights during which it is present. This verbal distinction was immaterial when all top-level failure conditions were active, but it becomes significant when addressing completely latent failure conditions. Even when the Arsenal was drafted in 2002, some of the old verbiage remained. For example, when discussing the probability of a latent failure condition on a given flight, the Arsenal at one point refers to this as the probability that the element “fails” during that flight, whereas the quantity being calculated (including the Pprior term) is clearly the probability that the element “is failed” during that flight. This was pointed out in an NPRM comment, and the FAA accepted this and changed the statement to “is failed”. Unfortunately, this correction was not consistently made throughout the AC, so a few remnants of the old focus on active faults remain, but aside from the one slip which has been corrected, the verbiage in Appendix 3 itself covers both active and latent failure conditions. |
|
The Arsenal Appendix has 14 instances referring to the probability being calculated, and in 10 of those instances it refers to the “probability of the failure condition on a given flight”, which signifies that the failure condition is present on that flight. In one instance, as noted above, the text mistakenly referred to the probability that an element “fails” on a certain flight, but this was corrected in the NPRM to “is failed”. The remaining 3 instances refer to the failure condition occurring on a flight. Strictly speaking, the word “occur” applied to a condition or persistent entity (as distinct from an event) has the meaning “exist or be found to be present”, whereas when it’s applied to an event, it has the meaning “happen or take place”. So, these 3 instances are consistent with the other 11 instances. Moreover, in every case, the Arsenal equations unambiguously calculate the probability of the failure condition being present on a certain flight, explicitly including the Pprior term (which is zero for active conditions). So there is no ambiguity. |
|
Nevertheless, some applicants proposed to disregard the Arsenal and instead use the failure rate rather than the normalized average probability to quantify the risk of latent failure conditions. However, defining the thresholds as failure rates rather than normalized probabilities can grossly underestimate the risk of latent failure conditions. A latent failure condition can be present on a given flight without arising on that flight, so the exposure time is an essential part of evaluating the risk associated with a latent failure condition. |
|
To illustrate, suppose a certain latent condition (in which the airplane is one failure away from catastrophe) is predicted to arise 7 times in the life of the fleet, and each time it arises, it is present for 10 flights before the being checked and repaired. This means there are 70 flights in the life of the fleet during which the airplane is one failure away from catastrophe. But now suppose the inspection is only performed every 10,000 flights. This means we expect 70,000 flights in the life of the fleet, during which the airplane is one failure away from catastrophe. The risk of catastrophe associated with this latent failure condition is now 1000 times as great, and yet the “failure rate (i.e., the rate of arising) is exactly the same. Thus, the “rate” approach grossly underestimates the safety risk of latent faults with long exposure times. |
|
Ironically, in the recently amended version of 25.1309 (Amendment 25-152) the FAA has defined explicit requirements on latent fault conditions (that leave the airplane one failure away from a catastrophic condition), and they closely align with what would be required by stating that the latent condition must be Remote (with the actual Arsenal definition). However, instead of simply defining the requirement this way, the new regulation states that, in any Catastrophic single latent plus one (CSL+1) combination, the active part must be Remote and the latent part must satisfy λτ < 1/1000, where λ is the failure rate and τ is the exposure time. This represents the probability that the condition is present on the last flight in the latency period, which confirms that this is the only meaningful measure of the risk associated with latent faults. Of course, the combination of active plus latent must be extremely improbable (1E-09), so if the active part is at the Remote boundary (1E-05), the latent part will need to be 1E-04, and if we apply averaging and normalization, this typically results in a value near 1E-05, which is Remote. The point is that the only relevant measure of risk for latent fault conditions is the probability of being present, not the rate at which it arises, so even if the post-NPRM re-writing of Appendix F had succeeded in replacing the calculation of normalized average probability with a calculation of the failure rate, it would still have been fundamentally wrong, drastically under-estimating the safety risk associated with latent failure conditions. |
|
One might argue that the mangling of Appendix F no longer matters, because latent failure conditions are now covered by a separate calculation that doesn’t invoke the usual probability ranges of 25.1309, and hence no longer depends on Appendix F. However, Major faults still need to be shown to be Remote using the AC criteria, and many such faults are latent. Moreover, the mistakes in the re-write of Appendix F effectively render it inapplicable to any failure condition, even active ones, at least if they contain any latent components. |
|
One feature of the Arsenal AC that has contributed to confusion is the statement in subsection c that “the principles of calculating are described below and also in more detail in SAE ARP 4761”. This was a bit of self-promotion by certain individuals on the ARAC who had also authored one particular section of the ARP, and were proud of it, and didn’t want it to be made superfluous. The situation was worsened in the NPRM draft by inserting the statement in F.3 that “The probability of a failure condition occurring on an average flight should be determined by structured methods (see SAE ARP 4761 for example methods)”. We’ve already noted the historical use of verbiage phrased in terms of active faults, and the meaning of the word “occur” for a persistent condition, and the fact that F.3 itself describes unambiguously how to compute the probability of a failure condition on a given flight. Indeed, this calculation is described in more detail in ARP 4761A, in the section on Markov models (not the section on fault trees), but nothing in the ARP supersedes the AC. It should be remembered that ARP 4761 is just an industry document (I was one of the authors of the recent update), and it includes a variety of generic notions and principles, but it does not (and wasn’t intended to, and legally couldn’t) provide regulatory requirements, definitions, or guidance. We can find “principles of calculating” a wide variety of things in the ARP (as well as many textbooks), but it doesn’t tell us what to calculate to show compliance to FAA regulations. The actual definition of the quantitative compliance criteria for 14 CFR 25.1309 (used by 25.901c) are provided only in the FAA Advisory Circular 25.1309. The vague references to ARP 4761 should not be taken as an invitation to disregard the definitions spelled out unambiguously in the AC itself. |
|
Readers sometimes mistakenly point to Section G.11.1.5.5 of the ARP 4761A, which presents a (primitive and convoluted) calculation of the approximate rate at which a latent combination of two latent failures is entered (for which it coins the phrase “occurrence [sic] probability”, not used anywhere else in the document). This is not the same as the normalized probability in terms of which the FAA’s quantitative criteria for compliance to 25.1309(b) are defined. For the reasons explained above, the rate at which a latent fault condition arises does not represent the safety risk associated with latent failure conditions. It drastically underestimates that risk. So, that quantity isn’t relevant to 25.1309(b) compliance showings. |
|
Having said that, the ARP 4761A does contain some relevant information, but nothing that over-rules the definition of “Average Probability per Flight Hour” provided by the Arsenal Advisory Circular. For example, in Appendix G (carried over unchanged from Appendix D in the original release), section G.10.1 describes how to determine the minimal cutsets for a fault tree. The AC doesn’t describe this step because it assumes everyone knows it already, so it’s just background information for the sake of completeness. Once we have the minimal cutsets, the task is to evaluate the normalized average probability of each of them. This is fully defined in the Arsenal AC itself (as discussed in Part I above). If a reader wants more details on the “principles of calculating” this quantity, the generic aspects of it are discussed in Appendix I of ARP 4761A, which gives background on the general solution of the Markov state model for each cutset. Of course, this can also be found in numerous textbooks. Again, the ARP doesn’t discuss specialized regulatory concepts like normalized average probability, which isn’t a math textbook concept, it is a regulatory requirement related to fatalities per passenger mile. |
|
The quantitative criteria for compliance to 25.1309 according to the Arsenal and NPRM (and EASA) versions AC are expressed in terms of the normalized average probability. See “The Arsenal Companion” for a succinct introduction. |
|