Transitioning to Physics of Failure Reliability Assessments for Electronics

James G. McLeish, CRE with DfR Solutions

Download PDF

Key Words: Reliability, Physics of Failure, Reliability Physics, Assessment, Reliability Prediction, MIL-HDBK-217, Design for Reliability.

ABSTRACT

The U.S. Dept. of Defense’s (DoD) has initiated multiple efforts to revitalize reliability in defense systems acquisition and development [1+2]. One of these projects involves a series of revision to MIL-HDBK-217 Rev F, the often imitated and frequently criticized reliability prediction bible for electronics equipment.

The MIL-HDBK-217 revision team has proposed eventually migrating to Computer Aided Engineering (CAE) tools with science-based Physics of Failure (PoF) reliability modeling, simulations and probabilistic mechanics techniques to expand beyond the current limitations of actuarial reliability prediction methods.

This approach uses Computer Aided Design (CAD) data to provide specific insight into which components are most susceptible to failure, when failures can be expected, the rate or failure growth and why the failures are occurring. This information can then be used to design out susceptibility to failure mechanisms in order to achieve highly reliable and robust systems by means of Virtual Reliability Growth.

The proposal has been submitted and under going review and awaiting funding approval. This paper provides a brief overview of PoF methods and reviews the concepts on how they can be applied to update the reliability prediction techniques in MIL-HDBK-217 from the point of view of a member of the 217 revision team.

OVERVIEW OF MIL-HDBK-217 - RELIABILITY PREDICTION OF ELECTRONIC EQUIPMENT

The current version of MIL-HDBK-217 revision F defines two reliability predictions methods (“Part Count” and “Part Stress”) to estimate the average life of electronic equipment. in terms of their Mean Time To or Between Failures (MTTF/MTBF). In the “Part Count” method the MTTF/MTBF value is determined by taking the inverse of the sum of the failure rates (from actuarial tables) for each generic component type in an electronic device The basic failure rates can then be scaled to account for operating in different environmental conditions such as mobile, naval, airborne, missile and space. The “Part Stress” method provides additional generic scaling factors intended to account for the reliability degradation effects of usage stresses such as power, voltage, and temperature.

MIL-HDBK-217 CONCERNS

There are numerous concerns about MIL-HDBK-217’s actuarial MTTF/MTBF reliability prediction methods. A summary of the primary criticisms which have been covered thoroughly in other publications [3+4] are:

1) Its reliability predictions are based on constant failure rates which model only random failure situations. This does enable simplify data collection and calculations, a necessity back in the 1950s and 1960s when the method was first developed. But when failure trends are modeled this way via the exponential distribution, infant mortality and wearout related failures are not accounted for.

2) Actuarial models based on industry wide average failure rates, are not vendor, device or event specific. They typically correlate poorly to actual field performance and cannot provide insight for evaluating the susceptibility of new designs to actual failure mechanisms. They can not valuate new technologies that lack a field history to base projections on.

3) The mean results provide no insight on the starting point growth rate and distribution range of true failure trends. Also the MTTF/MTBF concept is often misinterpreted by people without formal reliability training.

4) Over emphasis on the Arrhenius model and steady state temperature as the primary factor in electronic component failure while the roles of key stress factors such as: temperature cycling, humidity, vibration and shock have not been individually modeled [5+6+7].

5) Over emphasis on component failures despite RAIC studies that shows that at least 78% of electronic failures are due to other issues that are not modeled by 217 such as: design errors, PCB assembly defects, solder and wiring interconnect failures, PCB insulation resistance and via failures, software errors. . . etc. [8]

6) The last 217 update was in 1995; new components, technology advancement and quality improvements since then are not covered. It is grossly out of date, the majority of the data is from the 1980’s or older [4+9].

7) The handbook actuarial data needs to be kept up to date with regularly scheduled releases. This is an enormous task that grows with the creation of each new device and component family that needs to be tracked.

THE MIL-HDBK-217 REV G UPDATE EFFORT

The first MIL-HDBK-217 update is Rev. G. It was authorized under DoD Acquisition Streamlining and Standardization Information System (ASSIST) Project # SESS-2008-001. The Naval Surface Warfare Center (NSWC) Crane Division is managing the project [10]. The G Revision objectives were to:

  • Refresh the data for today’s electronic part technologies.
  • The “G” update does not to produce a new reliability prediction approach. The current actuarial models are intended to remain intact
  • The G update will continue to look and work the same so reliability engineers would not have to learn a new tool.
  • 217 is to continue to be a paper document despite the obvious need for a web based, living electronic failure rate database essential for staying up to date with the rapid, continuous advancements in electronics.
  • Contrary to past revision efforts when universities and research organizations were contracted to make the revisions, Rev G would be performed on a shoe string budget relying on volunteers to support the effort.

The objective of the 217 Rev G update was not to develop a better, more accurate reliability prediction tool or to produce more reliable systems. It was to return to a common and consistent method for estimating the inherent reliability of an eventually mature design during acquisition so that competitive designs could be evaluated by a common process.

Since 217 Rev F failure rate data and models are a frozen snapshot of conditions from over 15 years ago that are well out of date. Many organizations attempted to improve their reliability estimates by using modified or alternative prediction methods. These varied from using the 217 models with their own component failure rate data to using alternative empirical models such as SR-332, the European FIDES method, or RIAC 217-PLUS) method, to using PoF techniques. These efforts to make better predictions and more reliable systems were encouraged by many reliability professionals. However, the diversity made it difficult for acquisition personnel and program managers to evaluate proposals for new systems.

Concerns were raised at the Rev G kickoff meeting that developing better reliability assessment methods should be considered along with the needs of the acquisition community when defining the goals for updating 217 for the first time in many years.

Therefore as second phase to the project to develop proposals for better reliability prediction methods for a future revision H was also started. The Phase II task was to research and define a proposal for an improved reliability prediction methodology and the best means to implement it. Upon acceptance of the Phase II plan, a Phase III effort would later be created to implement the Phase II plan, which would become MIL-HDBK-217-Rev. H.

PHASE II FINDINGS

The 217 work groups applied a QFD (Quality Function Deployment) analysis using data collected by Aerospace Vehicle Systems Institute (AVSI) AFE 70 Reliability Framework and Roadmap project. This effort compiled and documented the needs of potential reliability prediction results users and correlated them into functions and tasks for achieving the objective.

The QFD analysis identified that a more holistic approach to reliability prediction was needed that could more accurately evaluate the risks of specific issues in addition to overall reliability. A need to evaluate the time to first failure in addition to MTBF and a way to deal with the constant emergence of new technologies that did not require years of field performance before reliability predictions could be made were also identified.

After considerable evaluation, the Phase II team converged on two basic approaches: 1) To improve the empirical reliability predictions approach and 2) to embrace and standardize the science-based Physics of Failure (PoF) approach where cause and effects deterministic relationships are analyzed using fundamental engineering principles.

Eventually a two-part hybrid approach was developed where first an updated and improved empirical approach based on the RIAC 217 Plus methodology [11] was proposed to support acquisition comparison and program management activities during the early stages of an acquisition program.

The proposed second part would define Physics of Failure (PoF) modeling and simulation methods for use during the actual engineering phases of a program. This method would assess the susceptibility and durability of design alternatives to various failure mechanisms under the intended usage profile and application environment. In this way items that lacked the required durability and reliability required for an application can be screened out early, at low cost during the design phase resulting in more reliable hardware and systems. Since the 217 Plus approach has been well defined in other publications [11], the rest of this paper will provide an overview to the PoF approach proposed for 217 Rev H.

PHYSICS OF FAILURE BASICS

PoF (also known as Reliability Physics) applies analysis early in the design process to assess the reliability and durability of design alternatives in specific applications. This enables designers to make design and manufacturing choices that minimize failure opportunities, which produces reliability optimized, robust products. PoF focuses on understanding the cause and effect physical processes and mechanisms that cause degradation and failure of materials and components. It is based in the analysis of loads and stresses in an application and evaluating the ability of materials to endure them from a strength and mechanics of material point of view. This approach integrates reliability into the design process via a science-based process for evaluating materials, structures and technologies. These techniques known as load-to-strength interference analysis are a basic part of mechanical, structural, construction and civil engineering processes.

Unfortunately during the early development and evolution of Electrical/Electronics (E/E) technology in the 1950s and 1960, this approach was not used. Since electrical engineers were not trained in or familiar with structural analysis techniques and the miniaturization of electronics had not yet reached the point where structural and strength optimization was required. As with any new technology, the reasons for failures were not initially well understood. Research into E/E failures was slow and difficult because unlike mechanical and structural items most E/E failures are not obvious. Evaluating and learning about new E/E failures was more difficult because they are not readily apparent to the naked eye since most components are microscopic and electrons are not visible. Due to these difficulties, actuarial reliability methods were adapted instead and became so entrenched that the development of better alternatives were stifled.

Over the last 25 years, great progress has been made in PoF modeling and the characterization of E/E material properties. By adapting the techniques of mechanical and structural engineering, computerized durability simulations of the E/E devices using deterministic physics and chemistry models are now possible and becoming more practical and cost effective every year. Failure analysis research has led PoF methods to be organized around 3 generic root cause failure categories which are: Errors and Excessive Variation, Wearout Mechanisms and Overstress Mechanisms.

Overstress Failures

Overstress failures such as yield, buckling and electrical surges occur when the stresses of the application exceed the strength or capabilities of a device’s materials causing immediate or imminent failures. In items that are well designed for the loads in their application, overstress failures are rare and random. They occur only under conditions that are beyond the design intent of the device, such as acts of god or war, such as being struck by lightning or submerged in a flood. Overstress is the PoF engineering view of random failures from traditional reliability theory. If overstress failures occur frequently, then the device may not be not suited for the application or the range of application stresses were underestimated. PoF load-stress analysis is used to determine the strength limits of a design for stresses like shock and electrical transients and to assess if they are adequate.

Wearout Failures

Wearout in PoF is defined as stress driven damage accumulation of materials which covers failure mechanisms like fatigue and corrosion. Numerous methods for stress analysis in structural materials have been developed by mechanical engineers. These techniques are readily adapted to the microstructures of electronics once material properties have been characterized. PoF wearout analysis identifies the most likely components or features in a device to fail 1st, 2nd , 3rd . . . etc, along with their times to first failure and their related fall out rates afterwards, for various wearout mechanisms. Designers can then identify items that are prone to various types of wearout during the intended service life of a new product. The design can then be optimized until susceptibility to wearout risks during the desired service life are designed out.

Errors and Excessive Variation Related Failures

Errors and excessive variation issues comprise the PoF view of the traditional concept of infant mortality. Opportunities for error and variation touch every aspect of design, supply chain and manufacturing processes. These types of failures are the most diverse and challenging category. Since diverse, random, stochastic events are involved, these types of failures can not be modeled or predicted using a deterministic PoF cause and effect approach. However, reliability improvements are still possible when PoF knowledge and lessons learned are used to evaluate and select manufacturing processes that are proven to be capable, ensure robustness and implement error proofing.

INTEGRATING PoF INTO MIL-HDBK-217 REV H

The 217WG developed a dual approach for integrating PoF overstress and wearout analysis into 217 Rev. H along side improved empirical prediction methods. One proposed PoF section addresses electronic component issues while the second deals with Circuit Card Assembly (CCA) issues. These sections are meant to serve as a guide to the type of PoF models and methods that exist for reliability assessments.

Physics of Failure Methods for Components

The proposed PoF component section focuses on the failure mechanisms and reliability aspects of semiconductor dies, microcircuit packaging, interconnects and wearout mechanisms of components such as capacitors. A current key industry concern is the expected reduction in lifetime reliability due to the scaling reduction of IC die features that have reached nanoscale levels of 90, 65 and 45 nanometers (nm) [12]. Models that evaluate IC failure mechanisms such as Time Dependent Dielectric Breakdown, Electromigration, Hot Carrier Injection and Negative Bias Temperature Instability are being considered to address this concern [13].

Physics of Failure Methods for Circuit Card Assemblies

The proposed PoF circuit card assembly section defines 4 categories of analysis techniques (See Figure 1) that can be performed with currently available Computer Aided Engineering (CAE) software. A probabilistic mechanics [14] approach is used to account for variation issues. This methodology is aligned with the Analysis, Modeling and Simulations methods recommended in Section 8 of SAE J1211 - Handbook for Robustness Validation of Automotive Electrical/Electronic Modules [15]. The 4 categories are:

1) E/E Performance and Variation Modeling used to evaluate if stable E/E circuit performance objectives are achieved under static and dynamic conditions that include tolerancing and drift concerns.

2) Electromagnetic Compatibility (EMC) and Signal Integrity Analysis to evaluate if a CCA generates or is susceptible to disruptions by Electromagnetic Interference and if the transfer of high frequency signals are stable.

3) Stress Analysis is used to assess the ability of a CCA’s physical packaging to maintain structural and circuit interconnection integrity, maintain a suitable environment for E/E circuits to function reliable and determine if the CCA is susceptible to overstress failures [16].

4) Wearout Durability and Reliability Modeling uses the results of the stress analysis to predict the long-term stress aging/stress endurance, gradual degradation and wearout capabilities of a CCA [16]. Results are provided in terms of time to first failure, the expected failure distribution in an ordered list of 1st, 2nd, 3rd . . . etc. devices, features, mechanisms and sites of mostly likely expected failures.

IMPLEMENTATION CONCEPTS

Each of the 4 groups contains analysis tasks that use similar analytical skills and tools. Combined, these techniques provide a multi-discipline virtual engineering prototyping process for finding design weaknesses, susceptibilities to failure mechanisms and for predicting reliability early in the design when improvements can be implement at low costs.

Most of these modeling techniques require specialized modeling skills and experience with CAE software. It is not expected that reliability engineers would personally learn and perform these tasks. However, the definition and recognition of PoF methods as integral, accepted, reliability methods for creating robust and high reliable systems is expected to help connect reliability professional with design engineers and help integrate reliability by design concepts into design activities.

The PoF sections are not intended to mandate that every model has to be applied to every item in every design or that modeling is limited to only the listed models since new models are constantly being developed. Furthermore, the list is not all inclusive since PoF models for every issue do not yet exist. The goal is to identify existing evaluation methods that can be selected as needed during design and development activities to mitigate reliability risks. This way, more reliability growth can occur faster, at lower costs, in a virtual environment during a project’s design phase.

By establishing a roadmap for merging fundamental engineer analysis and reliability methods, a technology infrastructure can be encouraged to continue to grow (perhaps faster) to provide more tools and methods for reliability engineers and product design teams to use in unison.

fig 1-12

REFERENCES

[1] Report of the Reliability Improvement Working Group, U.S. Dept of Defense, June 2008

[2] Report of the Defense Science Board on Developmental Test & Evaluation, U.S. Dept of Defense, May 2008

[3] F. R. Nash, “Estimating Device Reliability: Assessment of Credibility”. AT&T Bell Labs/Kluwer Publishing, MA, 1993.

[4] M. Pecht, “Why the Traditional Reliability Prediction Models Do Not Work - Is There an Alternative?”, Electronics Cooling, Vol. 2, pp. 10-12, January 1996

[5] M. Osterman, “We Still have a headache with Arrhenius”, Electronics Cooling, Vol. 7, No. 1, pp 53-54, Feb. 2001

[6] M. Pecht, P. Lall, E. Hakim, “Temperature as a Reliability Factor”, 1995 Eurotherm Seminar No. 45: Thermal Management of Electronic Systems, pp. 36.1-22

[7] O. Milton, “Reliability& Failure of Electronic Materials & Devices”, Ch 4.5.8 – “Is Arrhenius Erroneous” Academic Press, San Diego CA. 1998

[8] D.D. Dylis, M.G. Priore, “A Comprehensive Reliability Assessment Tool for Electronic Systems (Prism)”, IIT Research/Reliability Analysis Center, Rome NY, RAMS 2001

[9] “PRISM vs. commercially available prediction tools”, RIAC Admin Posting #558. May 17, 2007 RIAC.ORG, http://www.theriac.org/forum/showthread.php?t=12904

[10] L. Gullo, “The Revitalization of MIL-HDBK-217”, IEEE Reliability Newsletter, Sept 2008 http://www.ieee.org/portal/cms_docs_relsoc/relsoc/Newsletter s/Sep2008/Revitalization_MIL-HDBK-217.htm

[11] D. Nicholls, “An Introduction to the RIAC 217Plus Component Failure Rate Models”, The Journal Of The Reliability Information Analysis Center, 1st Quarter - 2007

[12] R. Alderman, “Physics of Failure: Predicting Reliability in Electronic Components”, Embedded Technology, July 2009

[13] S. Salemi, L. Yang, J. Dai, J. Qin , J.B. Bernstein Physics-of-Failure Based Handbook of Microelectronic Systems, Defense Technical Info Center/Air Force Research Lab Report., U of MD & RIAC, Utica, NY, Mar. 2008

[14] I. Elishakoff “Probabilistic Theory of Structures” 2nd edition, Dover Publications, Feb. 1999

[15] SAE J1211 – “Handbook for Robustness Validation of Automotive E/E Modules”, Section 8 - Analysis, Modeling and Simulations, SAE, April 2009.

[16] S.A. McKeown, Mechanical Analysis of Electronic Packaging Systems, Marcel Dekker, New York 1999.

 

BIOGRAPHY

James G. McLeish, CRE

DfR (Design for Reliability) Solutions

5110 Roanoke Place, Suite 101

College Park, Maryland 20740 – USA

e-mail: jmcleish@dfrsolutions.com

Mr. McLeish holds a dual EE/ME Masters degree in Vehicle E/E Controls System. He is a Certified Reliability Engineer and a core member of the Society of Automotive Engineering Reliability Standards Workgroup with over 32 years of automotive and military Electrical/Electronics experience. He started his career as a practicing electronics designing engineer who helped invent the first microprocessor based engine computer at Chrysler Corp. in the 1970’s. He has since worked in systems engineering, design, development, product, validation, reliability and quality assurance of both E/E components and vehicle systems at General Motors and GM Military. He is credited with the introduction of Physics-ofFailure methods to GM while serving as an E/E Reliability Manager and E/E QRD (Quality/Reliability/Durability) Technology Expert. Since 2006 Mr. McLeish has been a partner and manager of the Michigan office of DfR Solutions, a quality/reliability engineering consulting and laboratory services firm formed by senior scientists and staffers from the University of Maryland’s CALCE Center for Electronic Product and Systems. DfR Solutions is a leader in providing PoF science and expertise to the global electronics industry.