Reliability engineering is a peculiar profession. Within most organizations, it can be viewed as the annoying traffic cop, having too much control over the engineering process. Often insufficiently scientific for academia (can you count the total number of reliability engineering departments? 10? 20?) and excessively pedantic for industry. And the practice/occupation itself can sometimes seem hopelessly divided between number-obsessed statisticians and touch it/taste it/smell it physicists (or something equivalent).
At the same time, every organization that designs, makes, or sells electronics has come to realize that reliability is a critical attribute for market success. North American/European/Japanese OEMs repeatedly justify a higher purchase price based primarily on higher reliability (or, at the very least, the claim of higher reliability). Lower cost newcomers have to demonstrate reliability to get beyond the limited percentage of the business-tobusiness market that cares only about price. In a surprising number of industries, even consumer, a significant share of time and resources goes into reliability assurance activities (design evaluations, simulation/ modeling, prediction, testing, DFMEAs, etc.). By some estimates, over 10% of product costs can be traced directly back to the need to demonstrate reliability. This includes simulations, mechanical/electrical/thermal, FMEAs, FTAs, supplier evaluations, test and measurement, and warranty analysis.
— As a quick side note, I will admit to being a reliability snob and placing quality and safety under reliability. Quality is typically performance at time zero, where reliability is performance over the lifetime of the product. Safety is just reliability with a failure mode. We can agree to disagree. —
So, where does this schizophrenic environment leave us with reliability engineering? The answer can be best framed within the three goals of reliability engineering: risk mitigation, risk prediction, and risk communication.
There are a number of tools within the traditional reliability engineer’s toolbox for risk mitigation. These include failure mode effects analysis (FMEAs), fault tree analysis (FTA), reliability block diagrams, derating, and reliability growth analysis. The advantage of these approaches is that they can be easily applied across different markets. For this reason, these tools tend to be preferred by the ‘number-obsessed statisticians’ side of the discipline. In all the activities listed above, the reliability engineer tends to be a facilitator and successful execution does not necessarily require in depth knowledge of the product or industry.
The geek squad side of the reliability house takes one look at this list of risk mitigation activities and doesn’t see the point. The see nebulous activities that provide limited return on investment. As the head of the reliability department at a major consumer electronics company told me: “We tried DFMEA once. By the time it was done, our product had been on the market for three months. What’s the point?” Instead, these physics-types look to gather more fundamental knowledge, such as thermal simulation, electromagnetic (EMC) analysis, and the assessment of test coverage.
But nowhere are the battle lines drawn more starkly than in the practice of risk prediction and communication. Specifically, mean time between failures (MTBF). The very concept of MTBF, and most of our reliability engineering tools, came out the disastrous performance of military electronics during World War II. As a boy, I can remember reading numerous World War II novels and autobiographies where the main character curses the failure of the torpedo/radar/engine/etc. By requiring some minimum MTBF number, end users could guarantee the performance of their systems.
There are two big complaints regarding MTBF: how it is determined (prediction) and how it is interpreted (communication). In the electronics industry, MTBF has historically been calculated using empirical prediction handbooks. The mother of all of these handbooks is RCA’s TR-1100 Reliability Stress Analysis for Electronic Equipment, published in 1956. This spawned MIL-HDBK-217, which spawned Telcordia SR-332, IEC-62380, FIDES, and numerous other publications claiming to have the secret to reliability prediction. The failure rates within these handbooks are based on ‘best available’ historical field failure rate data with some ‘simplifying assumptions’.
These handbooks have been shown to have several severe flaws. The original purpose of the original document, MILHDBK-217, was not for reliability prediction, but to provide a basis for evaluating competing designs. The predictions can tend to be overly conservative (one director of engineering described how reliability predictions using MIL-HDBK-217 or SR-332 are divided by 10 or 3, respectively, before being used for planning or marketing purposes). They cannot respond to the pace of technology (parts become obsolete in time required to gather sufficient field data to make a prediction). They tend to ignore the mechanical elements of the design (a CPU in a leaded package is viewed as the same as a CPU in an area array package). And they assume that the failure rate is constant over time, which is not always the case (see the two life curves from the same manufacturer).
Other flaws include providing no motivation for failure avoidance other than reducing temperature or removing parts and being far too easy to manipulate the predictions (true story: we once improved the MTBF of a product by 10X through various tweaks in our calculation).
MTBF is also easily misinterpreted by the outside world. A MTBF of over 50 years can sound really impressive until you realize it means almost 20% failure in 10 years (MTBF is not the minimum time to failure).
As a result these drawbacks, there is a big push to eliminate MTBF from the tool set and eliminate it from the vocabulary (‘No MTBF’). And, nominally, I would tend to support such an effort. However, such a movement should be tempered by some of the realities of the world and the needs of complex systems.
There are number of alternative approaches for risk prediction, including on-going reliability testing (ORT), warranty analysis, and physics of failure (PoF) [No, HALT is not a way to do reliability prediction; see my column next month]. However, all of these approaches have similar limitations. Most companies do not have the resources to perform ORT. Warranty analysis can sometimes be limited to the first year, when failure rates are still fluctuating. And physics of failure (PoF), as insightful as it is, requires a wearout mechanism. It can’t predict a EMI/EMC event, an EOS event, or even a mishandling event. For small to mid-size companies with a one year warranty selling into a data center (very controlled environment), empirical prediction may be the only option. At the very least, a blended approach may be most effective, where technologies at risk of wearout are modeled using PoF and other technologies obtain failure rates from other sources, of which one could be a handbook.
And while MTBF is sorely misunderstood at the OEM level, it does help managers of complex systems differentiate critical vs. non-critical failures, capture the influence of time to repair, compare to the time during a mission, and perform relatively simple arithmetic to compute availability. As a result, systems people tend to like MTBF and therefore force upon OEMs who supply products that integrate into these complex structures.
Long term, there is a path for the eventual demise of MTBF, especially with more sophisticated reliability tools hitting the market. In the meantime, understanding how and why MTBF is used will bring us closer to that goal. To reach ‘No MTBF’, we need to ‘Know MTBF’.
Craig Hillman is CEO and Managing Member for DfR Solutions. Dr. Hillman’s specialties include best practices in Design for Reliability (DfR), Pb-Free strategies for transitioning to Pb-free, supplier qualification (commodity and engineered products), passive component technology (capacitors, resistors, etc.), and printed board failure mechanisms. Dr. Hillman has over 40 Publications and has presented on a wide variety of reliability issues to over 250 companies and organizations.