Field Failure Early Warning System

Gregg Kittlesen


Download PDF


A presentation at the recent Avionics Maintenance Conference (AMC) provides an interesting case study to test various early warning systems for the detection of rising field failure issues. Given the large number of organizational interfaces involved in any field failure and the costs associated with any investigation, many field failure issues fester for a significant period of time before the commitment is made to identify the root cause of failure. A defined issue detection process will provide data to decision makers, support the development of organizational knowledge databases, and enhance customer satisfaction. Limited information for this case study is publicly available. The intent of this article is to compare various possible issue detection threshold processes. Some physics of failure considerations will also be presented.

Case Study Description

At the yearly AMC, avionic maintenance and reliability issues are discussed among representatives from more than 40 carriers, several aircraft manufacturers, and more than 100 avionic equipment suppliers. All parties benefit from the opportunity to share information and the presentation of findings.

Three reliability charts were provided on experience with an in-flight entertainment system over several years. Monthly quantities of liquid crystal display (LCD) monitor removals (2 to 75) and mean time between unscheduled replacement (~170,000 to ~10,000 hours) were presented for a three year period. Monthly quantities of LCD monitor printed circuit board assembly (PCBA) removals (0 to 70) were presented for a five year period, see Figure 1. Additionally, a “time since installation” (0 to 21,000 hours) failure distribution was presented. Over 90% of the failures were reported to have occurred after 14,000 to 21,000 hours since install. Over 68% of the failures were reported to have occurred after 17,000 to 21,000 hours since install. The carrier had been engaged with the avionics equipment supplier on this early wear-out issue for a period of time. The avionics equipment supplier was investigating the issue root cause to provide a corrective action.

Analysis Approach

Actual LCD monitor PCBA removed quantities by month were listed in the AMC agenda. The total number of LCD monitor PCBA in use, 2508, was calculated from publically available information. All monitors entered into service within a period of less than 2 years. For the purposes of this study comparing various issue detection threshold processes and in the absence of more detailed time to failure/suspension data, all monitors were considered to be placed in service in June 2004. This study does not have sufficient information to support conclusions regarding the significant decrease in LCD monitor PCBA removals during the most recent period of Nov. 2010 through Jan. 2011. However, the carrier reported their highest LCD monitor removal quantities uniformly during the most recent reported period of Sep. through Dec. 2010.

Actual vs. Predicted Field Failure Rates

Some organizations employ a “parts count methodology” to obtain a predicted field failure rate for each replaceable unit. The methodology may be based upon a published standard (e.g. MILHDBK-217, Telcordia SR-332) or internal data of component field failure rates. Failure rate adjustments can be made for factors such as deployment environment, quality, temperature, voltage, and duty cycle. Generally, steady-state conditions are assumed, i.e. that failure rates are constant in time. Given the breadth of choices available for parts count predictions (standard, issue1, internal data, correction factors), a wide range of parts count failure predictions are encountered among various organizations for comparable PCBA designs. Each supplier organization develops institutional knowledge for their designs, their parts count practices, and their subsequent replaceable unit field failure rates relative to the predicted field failure rates. It is suggested that user organizations consider setting a threshold to initiate tracking of troubleshooting results for each replaceable unit removal. An appropriate threshold for many applications and technologies may be 50% of the predicted field failure rate.

The LCD monitor PCBA parts count failure rate prediction is likely in the range of 100 to 400 FIT (1.0e07 to 2.5e06 hours MTBF). This estimate would predict 0.09 to 0.35% failures per year, 2 to 9 PCBA failures per year for the deployed population in this case study.

The monthly LCD PCBA removal quantities presented in Figure 1 were used to calculate rolling yearly removals in Figure 2.

For this relatively small PCBA and limited deployment, removal of 1 unit in a month exceeds the expected upper limit calculation of 0.35% removals/year. Beginning in May 2006, there were no 12 month historical periods with fewer than the lower estimate of 2 PCBA removals. Beginning in May 2008, there were no 12 month historical periods with fewer than the upper estimate of 9 PCBA removals. Generally, removals of replaceable units are caused by a variety of issues.

However, if a series of long runtime units are removed due to a problem with one component, an emerging component wear-out issue may exist. It is suggested that a risk assessment be initiated when the observed replaceable unit removal rate exceeds 50% of the predicted removal rate. If the LCD Monitor PCBA predicted failure rate is 250 FIT (midpoint of the estimate 100 to 400 FIT), then the 50% threshold would be 2.7 PCBA removals per year. At such a low number of removals per year, this case study of a moderate population of an assembly with a low predicted failure rate illustrates the challenge of selecting an appropriate threshold to escalate actions. User organizations may choose to vary the threshold percentage at which to initiate a risk assessment based on factors such as population size, technology, median deployment time, and risk exposure. The risk assessment should review historical records regarding problems identified and formalize a tracking procedure to capture problems encountered on future removals. The greatest benefit of this process is early detection of an emerging component wear-out issue.

Cumulative Failure Distributions

This case study involves a moderate population of devices deployed by one user within a relatively short period of time relative to the intended service lifetime. As stated in the case study description, over 90% of the failures had more than 14,000 hours service time. For the review of field returns from populations deployed over long periods of time, perhaps by multiple users, it is important to estimate time to failure of each unit removed. In this case study the information available was limited to the number of PCBA removed each calendar month over a period of 5 years (Mar. 2006 to Jan. 2011). The PCBA were deployed beginning in Jul. 2003. Half the population appears to have been deployed by mid-2004. The complete population appears to have been deployed by late-2005 or early-2006. Times to PCBA removal were estimated by considering a common deployment of June 2004.

Cumulative removal distributions were computed as of Feb. 2007 and forward in 6 month intervals to Feb. 2011. The data were fitted with Weibull distributions. Each set of Weibull parameters (beta, eta) was used to predict future cumulative removal percentages at 6 month intervals for the following 2 years as presented in Table 1.

The Weibull shape parameter, beta, indicates how the failure rate is changing in time. A beta value less than 1 indicates that the failure rate is decreasing with time. A beta value of 1 indicates that the failure rate is constant, suggesting random failures predominate. A beta value greater than 1 indicates that the failure rate is increasing with time, suggesting a wear-out mechanism. The Weibull scale parameter, eta, indicates the time to 63% cumulative failures.

The simplifying assumption of a common deployment date for all PCBA increases the runtime of some PCBA removals and inflates the value of the Weibull beta parameter. Nevertheless, the rather high beta value of 3 should be sufficient to flag a potential issue and track the reasons for future removals. The beta value was steady over the period Feb. 2007 to Feb. 2008 and thereafter increased considerably.

Physics of Failure

Considerations During the AMC meeting, the equipment supplier informed the airline carrier that the failures were caused by an aluminum electrolytic capacitor failing due to high temperature. The in-flight entertainment application is expected to be effected by static airflow, reduced air density, and power cycling. There is also some information regarding system boot issues, possibly exacerbating the power cycling stresses, during the time period when the failure rate was observed to increase dramatically. Aluminum electrolytic capacitor lifetime is generally decreased at high temperature and by power spikes.


The airline industry is to be commended for supporting a forum in which maintainability issues can be addressed by the effected parties with the potential to limit the impact on other parties through the sharing of information. Two reactive approaches to early detection of emerging field failure issues have been presented. One strategy is to plot rolling yearly removals of replaceable units from a fixed population with a predetermined threshold identified to initiate a risk assessment. A threshold of 50% of the predicted failure rate is suggested. Populations that are growing through increased deployment will track percentage rolling yearly removals. A second strategy is to periodically compute failure distributions. The magnitude of the failure distribution shape parameter and changes in the shape parameter can be used to justify the escalation of failure analysis activities.

A proactive approach to field failure risk mitigation through the use of automated design analysis2 has been presented previously.

1 Significant updates to component field failure rates were recently published in SR-332 Issue 3 (Jan. 2011) and previously in SR-332 Issue 2 (Sep. 2006).
2 See

Download PDF




Explore alternatives for electronics reliability prediction.  Meantime Between Failure (MTBF) is a generally accepted standard for electronics reliability prediction, but the assumption it’s the best approach is questionable.