|
The Arsenal Companion |
|
|
|
Appendix 3 of the ‘Arsenal’ draft Advisory Circular (AC) 25.1309 defines the normalized average probability (also called “Average Probability per Flight Hour”) of a failure condition as |
|
|
|
|
|
|
|
where N is the number of flights in the life of a single airplane, TF is the duration of each flight, and Pk is the probability of the failure condition on the kth flight. For a failure condition consisting of a single element on a flight with n phases, the probability Pk is given recursively (between detection/repairs) by |
|
|
|
|
|
|
|
where Pk−1 is the “prior” probability, i.e., the probability that the element is already failed at the start of the certain flight, and λi(t) is the hazard rate function for the ith phase extending from ti−1 to ti. (For the derivation of this formula, see the note on Hazard Rate Function.) If the element is checked and repaired before the certain flight, then Pk−1 is set to zero, which implies that the probability of the failed state at the end of the previous flight is transferred to the un-failed state. For a system that is checked and repaired every m flights, the 2-state Markov model for a single flight in this simple case can be depicted as shown below. |
|
|
|
|
|
|
|
where λ(t) denotes the piecewise aggregate failure rate functions for the n phases of the flight. It can be shown (see Note 5 in Normalized Average Probability) that the argument of the exponential function in equation (2) is simply –λmeanTF if we define λmean as the mean failure rate of the element during a flight, so equation (2) can be written as |
|
|
|
|
|
|
|
Hereafter we will drop the “mean” suffix from λmean and simply refer to the mean failure rate for a flight as λ. Letting (Pk)0 and (Pk)1 denote the probabilities of the healthy and failed states respectively for the kth flight, and recalling that (Pk)0 + (Pk)1 = 1, this equation is equivalent to the pair of relations |
|
|
|
|
|
|
|
In matrix form these equations can be written as |
|
|
|
|
|
|
|
The coefficient matrix is given in terms of the mean failure rate matrix M by |
|
|
|
|
|
|
|
So, letting Pk denote the state vector [(Pk)0, (Pk)1]T, equation (2) can be written as |
|
|
|
|
|
|
|
which is simply the solution of the homogeneous state equation dP/dt = MP. This equation applies provided no inspection/repairs occur between the end of the (k−1)th flight and the beginning of the kth flight. If the element is checked and repaired prior to a certain flight, we replace the prior state vector with [1, 0]T. More generally, if the element is checked and repaired prior to the mth flight, we replace the prior state vector with [1,0]T whenever k−1 is a multiple of m. In matrix form, this reset is accomplished by multiplying the state vector at the end of the (k−1)th flight by |
|
|
|
|
|
|
|
where α = 0 if k−1 is a multiple of m, and otherwise α = 1. Therefore, taking the periodic repairs into account, the probability of the failure condition for the kth flight is given recursively by |
|
|
|
|
|
|
|
This covers both active and latent failures. For an element that is checked and repaired prior to each flight, we set m=1, meaning the state vector is reset prior to each flight. Writing the state equations in matrix form makes it immediately apparent how they apply unchanged to failure combinations of any number of elements (not just a single element), and how they account for any specified periodic inspection/repair intervals. It also accounts for any “required order” effects. An example for a failure combination of three components is shown in another article. |
|
|
|
Historically, numerical probability analysis methods for airworthiness certification were originally applied mainly to catastrophic or hazardous failure conditions involving obvious effects, and it’s clear that the probability of a catastrophic failure condition arising on a given flight is the same as the probability of it being present during that flight. The Arsenal AC quantifies the risk of a failure condition in terms of the (normalized average) probability of a failure condition during a given flight (or, equivalently, the occurrence of a failure condition), as distinct from the probability of occurrence of the failure event. Recall that the English word “occur” has two meanings, depending on the object. When applied to a discrete event, “occur” means “to happen”, whereas when applied to a condition, “occur” means “to exist or be present”. |
|
|
|
The regulation and Advisory Material are written in terms of the failure condition (rather than event) because the probability of a latent failure condition arising on a given flight is very different from the probability of it being present on that flight... indeed, the unique risk posed by latent failure conditions is precisely their continued presence on flights after the flight on which they first arise. A meaningful assessment of the risk associated with a latent failure condition must account, not just for the expected number of flights during which the condition arises, but for the expected number of flights during which the condition is present, because this represents the exposure to a subsequent fault that results in a catastrophic condition. |
|
|
|
As explained above, the numerical probability criteria described in the referenced Arsenal AC 25.1309 account for the difference between active and latent faults by explicitly specifying that if the state of a failure condition is unknown at the beginning of a certain flight, then the probability of the condition during that flight equals the probability of the condition arising on that flight plus the probability that the condition is already present at the beginning of the flight (having arisen on a previous flight and not been repaired). Thus, the FAA requirement for a Major latent failure to be Remote per the Arsenal AC terminology means that the normalized average probability as computed above must be on the order of 1E-05/FH or less. |
|
|
|
In the simple case of a single element with a latent exposure time to TL and constant failure rate λ, the Arsenal calculation gives the normalized average probability (for rare events) of approximately [(1/2)/TF](λTL). The factor is square brackets represents the averaging and normalization. If the requirement was for this latent failure condition is to be Remote per the Arsenal definition, then for a fleet with average flight length TF = 5 hours the requirement would be λTL < 1E-04. This would be just slightly more restrictive than the EASA requirement, and the proposed future FAA requirements (judging from draft NPRM materials) for dealing with latent fault, which impose a “1/1000” requirement defined as λTL < 1E-03. One might wonder why Amendment 152 of the regulation adds this requirement when it is less restrictive than the pre-existing requirement for a Major fault to be Remote. (It is assumed that a failure that leaves the airplane one failure away from a catastrophic failure represents a significant reduction in safety margin, which is Major by definition.) The likely explanation is that the Remote requirement is a fleet average requirement, whereas the 1/1000 requirement applies to the last flight in the latency period. The ARAC in 2010 was chartered to address “worst case” scenarios rather than fleet average, so this may well have been the motivation. However, in this case the fleet average requirement is slightly more restrictive than the “last flight” requirement, so the usefulness of the latter is questionable. |
|
|
|
More generally, consider a case of three individually latent components, and suppose that if all three of those components are failed, a catastrophic event may occur. This is obviously not a latent failure condition, even though the three elements are individually latent. Having any one or two of them failed is a latent condition, but the third one to fail is not latent (under the stated assumptions). This is the kind of situation that is treated in sample calculations in the fault tree section of ARP 4761A. In such cases, the overall failure condition obviously cannot be carried over from a prior flight, so, in effect, the calculation includes a per-flight check/repair of the overall failure condition, and the relevant probability is the probability of transitioning into the fully-failed state during a flight. |
|
|
|
However, this is not a latent failure condition, even though it consists entirely of individually latent failures. To perform a “latent plus one” calculation on such a system (in addition to showing that the overall condition of all three elements being failed meets 1E-09), we need to show that any combination of two of those failures (which leave us one failure away from a catastrophic condition) are Remote. Those two-failure combinations are genuinely latent, and can persist for multiple flights, so they must be treated per the Arsenal calculation for latent failures. For control system failures involving loss of two parallel alert lights and the loss of (say) overspeed protection, this entire combination of three faults is totally latent, and can persist for many flights. Indeed this is the entire problem posed by the latency, i.e., it allows dispatch in a non-dispatchable condition. Therefore, the relevant probability is the probability of this condition being present on a given flight, not the probability that it arises on that flight. This is a valid measure of the risk posed by this truly latent failure condition. |
|
|
|
The Arsenal AC refers to ARP 4761 for additional background on the calculation, but the ARP doesn’t actually define the normalized average probability of occurrence of any failure condition, let alone a latent failure condition. However, Appendix I of ARP 4761A, which covers Markov models, does include some discussion of the Arsenal calculation of the average probability of occurrence of a failure condition. That material is consistent with the Arsenal calculation method described above. |
|
|