Modern electronics typically consist of microprocessors and other complex integrated circuits (ICs) such as FPGAs, ADCs, and memory. They are susceptible to electrical, mechanical and thermal modes of failure like other components on a printed circuit board, but due to their materials, complexity and roles within a circuit, accurately predicting a failure rate has become difficult, if not impossible. Development of these critical components has conformed to Moore's Law, where the number of transistors on a die doubles approximately every two years. This trend has been successfully followed over the last four decades through reduction in transistor sizes creating faster, smaller ICs with greatly reduced power dissipation. Although this is great news for developers and users of high performance equipment, including consumer products and analytical instrumentation, a crucial, yet underlying reliability risk has emerged. Semiconductor failure mechanisms, which are far worse at these minute feature sizes (tens of nanometers), result in higher failure rates, shorter device lifetimes and unanticipated early device wearout. This is of special concern to users whose requirements include long service lifetimes and rugged environmental conditions, such as aerospace, defense, and other high performance (ADHP) industries. To that end, the Aerospace Vehicle Systems Institute (AVSI) has conducted research in this area, and DfR Solutions has performed much of the work as a contractor to AVSI.
Physics-of-Failure (PoF) knowledge and an accurate mathematical approach which utilizes semiconductor formulae, industry accepted failure mechanism models, and device functionality can access reliability of those integrated circuits vital to system stability. Currently, four semiconductor failure mechanisms that exist in silicon-based ICs are analyzed: Electromigration (EM), Time Dependent Dielectric Breakdown (TDDB), Hot Carrier Injection (HCI), and Negative Bias Temperature Instability (NBTI). Mitigation of these inherent failure mechanisms, including those considered wearout, is possible only when reliability can be quantified. Algorithms have been folded into a software application not only to calculate a failure rate, but also to produce confidence intervals and lifetime curves, using both steady state and wearout failure rates, for the IC under analysis. The algorithms have been statistically verified through testing and employ data and formulae from semiconductor materials (including technology node parameters), circuit fundamentals, transistor behavior, circuit design and fabrication processes. Initial development has yielded a user friendly software module with the ability to address siliconbased integrated circuits of the 0.35mm, 0.25mm, 0.18mm, 0.13mm and 90nm technology nodes.
In the ADHP industries, there is considerable interest in assessing the long term reliability of electronics whose anticipated lifetimes extend beyond those of consumer "throw away" electronics. Because complex integrated circuits within their designs may face wearout or even failure within the period of useful life, it is necessary to investigate the effects of use and environmental conditions on these components. The main concern is that submicron process technologies drive device wearout into the regions of useful life well before wearout was initially anticipated to occur. The continuous scaling down of semiconductor feature sizes raises challenges in electronic circuit reliability prediction. Smaller and faster circuits cause higher current densities, lower voltage tolerances and higher electric fields, which make the devices more vulnerable to early failure. Emerging new generations of electronic devices require improved tools for reliability prediction in order to investigate new manifestations of existing failure mechanisms, such as NBTI, EM, HCI, and TDDB.
Working with AVSI, DfR Solutions has developed an integrated circuit (IC) reliability calculator using a multiple failure mechanism approach. This approach successfully models the simultaneous degradation behaviors of multiple failure mechanisms on integrated circuit devices. The multiple mechanism model extrapolates independent acceleration factors for each semiconductor mechanism of concern based on the transistor stress states within each distinct functional group. Integrated circuit lifetime is calculated from semiconductor materials and technology node, IC complexity, and operating conditions.
A major input to the tool is integrated circuit complexity. This characteristic has been approached by using specific functionality cells called functional groups. The current set of functional groups covers memory-based devices and analog-to digital conversion circuitry. Technology node process parameters, functional groups and their functionality, and field/test operating conditions are used to perform the calculations. Prior work verified the statistical assessment of the algorithms for aerospace electronic systems and confirmed that no single semiconductor failure mechanism dominates failures in the field. Two physics-of-failure approaches are used within the tool to determine each of four semiconductor failure mechanisms’ contribution to the overall device failure rate. The tool calculates a failure rate and also produces confidence intervals and a lifetime curve, using both steady state and wearout failure rates, for the part under analysis.
Reliability prediction simulations are the most powerful tools developed over the years to cope with these challenging demands. Simulations may provide a wide range of predictions, starting from the lower-level treatment of physics-of-failure (PoF) mechanisms up to high-level simulations of entire devices [1, 2]. As with all simulation problems, primary questions need to be answered, such as: “How accurate are the simulation results in comparison to in-service behavior?” and “What is the confidence level achieved by the simulations?” Thus the validation and calibration of the simulation tools becomes a most critical task. Reliability data generated from field failures best represents the electronic circuit reliability in the context of the target system/application. Field failure rates represent competing failure mechanisms' effects and include actual stresses, in contrast to standard industry accelerated life tests.
In this paper, the failure rates of recorded field data from 2002 to 2009 were determined for various device process technologies and feature sizes (or “technology nodes”). Theses failure rates are used to verify the PoF models and a competing failure approach, as implemented in the software. Comparison of the actual and simulated failure rates shows a strong correlation. Furthermore, comparing the field failure rates with those obtained from the standard industry High Temperature Operating Life (HTOL) Test reveals the inadequacy of the HTOL to predict integrated circuit (IC) failure rates. The validation process and its data sources are illustrated in Figure 1.
Semiconductor life calculations were performed using an integrated circuit reliability prediction software tool developed by DfR Solutions in cooperation with AVSI. The software uses component accelerated test data and physics-of-failure (PoF) based die-level failure mechanism models to calculate the failure rate of integrated circuit components during their useful lifetime. Integrated circuit complexity and transistor behavior are contributing factors to the calculation of the failure rate. Four failure mechanisms are modeled in this software using readily available, published models from the semiconductor reliability community and NASA/Jet Propulsion Laboratory (JPL) as well as research from the University of Maryland, College Park. These mechanisms are Electromigration (EM), Time Dependent Dielectric Breakdown (TDDB), Hot Carrier Injection (HCI) and Negative Bias Temperature Instability (NBTI). Taking the reliability bathtub curve (see Figure 2) into consideration, research shows that EM and TDDB are considered steady-state failure modes (constant random failure) where as HCI and NBTI have wearout behavior (increasing failure rate).
Each of these failure mechanisms is driven by a combination of temperature, voltage, current, and frequency. Traditional reliability predictions assume temperature, and sometimes voltage as the only accelerators of failure. Each failure mechanism affects the on-die circuitry in a unique way. Therefore, each is modeled independently and later combined with the others using associated proprietary weighting factors. This software uses circuit complexity, test and field operating conditions, derating values, and transistor behavior as mathematical parameters. Since there is not one dominant parameter set, each mechanism could have the largest contribution to a component's failure rate depending on the use conditions. In general, there is no dominant failure mechanism - and thus for a specific component, any combination of the four mechanisms can affect it.
Failure rates were calculated using a specialized set of time to failure (TTF) equations. Time to Fail is the approximate reciprocal of failure rate. Mean time to failure (MTTF) is the mean or expected value of the probability distribution defined as a function of time. MTTF is used with non-reparable systems, like an integrated circuit. Non-reparable systems can fail only once. For reparable systems, like a re-workable printed circuit board or assembly, mean time between failures (MTBF) is used as the metric for probability distribution. Reparable systems can fail several times. In general, it takes more time for the first failure to occur than it does for subsequent failures to occur. The mathematics are the same for MTTF and MTBF. Since this analysis method is for integrated circuits, they can be replaced on the assembly, but they themselves are nonrepairable circuitry.
Smaller and faster circuits have higher current densities, lower voltage tolerances and higher electric fields, which make integrated circuits more vulnerable to electrically based failure. New generations of electronic devices and circuits demand new means of investigation to check the possibility of introducing new problems or new versions of old issues. New devices with new designs and materials require failure analysis to find new models for both individual failure mechanism and also the possible interaction between them. Understanding these potential interactions is particularly important and requires serious investigation.
In the sub-micrometer region, the demand for higher performance is in conflict with reliability. Proper tradeoffs in the early design stage are a dominating challenge. After performing a quick and effective reliability analysis (like the one performed for this project), both a lifetime estimation for the device and a failure mechanism dominance hierarchy are achieved. Using reliability knowledge and improvement techniques, higher reliability integrated circuits can be developed using two methods: suppression of die-level failure mechanisms and the adjustment of circuit structures. This has been realized for electromigration (through Black's equation) using design techniques, however, it is counter-productive across industry in the aim of device scaling to adjust transistor sizes. Redesign of transistor architecture and circuit schematics is too resource intensive both in time and costs to be the corrective action for reliability concerns. The end user must decide what reliability goals need to be achieved; more so, it has become the user’s responsibility to determine how to achieve those goals without any influence on component design, manufacturing, or quality. This type of reliability assessment is crucial for the end user as adjustments to electrical conditions and thermal management seem to be the only way to improve reliability of modern technology nodes. The tradeoff in performance can be significantly reduced by using devices from larger technology nodes as they provide larger operating tolerances and the architectures necessary to reduce the effects of multiple mechanism degradation behaviors.
As technology shifts to the smaller nodes, the operating voltage of the device is not reduced proportionally with the gate oxide thickness, which results in a higher electric field; moreover, the increasing density of transistors on a chip causes more power dissipation and in turn increases operating temperature through self-heating. Conversely, introducing nitrogen into the dielectric to aid in gate leakage reduction together with boron penetration control has its own effect - linearly worsening NBTI and other modes of degradation. Because the threshold voltage of new devices is not being reduced proportionally to the operating voltage, there is more degradation for the same threshold voltage.
There has been steady progress over the years in the development of a physics-of-failure understanding of the effects that various stress drivers have on semiconductor structure performance and wearout. This has resulted in better modeling and simulation capabilities. Early investigators sought correlations between the degradation of single device parameters (e.g. Vth, Vdd or Isub) and the degradation of parameters related to circuit performance such as the delay between read and write cycles. It was quickly realized that the degradation of a broad range of parameters describing device performance had to be considered, rather than just a single parameter . Most of the simulation tools tend to simulate a single failure mechanism such as Electromigration -, TDDB , NBTI - and HCI . System-level simulators attempting to integrate several mechanisms into a single model have been developed as well. The latest circuit design tools, such as Cadence Ultrasim and Mentor Graphics Eldo, have integrated reliability simulators. These simulators model the most significant physical failure mechanisms and help designers address the lifetime performance requirements. However, inadequacies, such as complexity in the simulation of large-scale circuits and a lack of prediction of wearout mechanisms, hinder broader adoption of these tools .
Reliability simulations are commonly based on combinations of PoF models, empirical data and statistical models developed over the years by different research groups and industries. The inevitable consequence of a wide range of models and approaches is a lack of confidence in the predictions obtained for any given model. From the point of view of a real-world end-user, single failure mechanism modeling and simulation is less meaningful then the system level reliability.
Validation and calibration of simulations is accomplished by comparing simulation predictions with empirical data obtained from laboratory tests or by analyzing field data (or both). To evaluate the reliability of their devices, semiconductor manufacturers use laboratory tests such as environment stress screens (ESS), highly accelerated life testing1 (HALT), HTOL and other accelerated life tests (ALT). Several concerns cause doubts about the prediction accuracy derived from such tests. The assumption of single failure mechanism is an inaccurate simplification of actual failure dynamics. Furthermore, ALT tests based on sampling a set of devices have the inherent problem of a lack of statistical confidence in the case of zero observed failures. Finally, ALT tests can only mimic actual field conditions to estimate real-world reliability and extrapolation from test environmental stresses to field stresses can be misleading -.
The dominant failure mechanisms in Si-based microelectronic devices that are most commonly simulated are EM, TDDB, NBTI and HCI. Other degradation models do exist but are less prevalent. These mechanisms can be generally categorized as either Steady State Failure Modes (EM and TDDB) and Wearout Failure Modes (NBTI and HCI) . A brief explanation of each failure mechanism is necessary to understand their contribution to the overall device failure rate.
Electromigration can lead to interconnect failure in an integrated circuit. It is characterized by the migration of metal atoms in a conductor in the direction of the electron flow. Electromigration causes opens or voids in some portions of the conductor and corresponding hillocks in other portions -, .
Time Dependent Dielectric Breakdown is caused by the formation of a conducting path through the gate oxide to the substrate due to an electron tunneling current. If the tunneling current is sufficient, it will cause permanent damage to the oxide and surrounding material. This damage will result in performance degradation and eventual failure of the device. If the tunneling current remains very low, it will increase the field necessary for the gate to turn on and impede its functionality. The gate dielectric breaks down over a long period of time for devices with larger feature sizes (>90 nm) due to a comparatively low electric field. Although core voltages have been scaled down as feature sizes have shrunk, supply voltages have remained constant. These field strengths are an even greater concern since high fields exacerbate the effects of TDDB , .
Negative Bias Temperature Instability occurs only in pMOS devices stressed with a negative gate bias voltage while at elevated temperatures. Degradation occurs in the gate oxide region allowing electrons and holes to become trapped. Negative bias is driven by smaller electric fields than hot carrier injection, which makes it a more significant threat at smaller technology nodes where increased electric fields are used in conjunction with smaller gate lengths. The interface trap density generated by NBTI is found to be more pronounced with thinner oxides -, .
Hot Carrier Injection occurs in both nMOS and pMOS devices stressed with drain bias voltage. High electric fields energize the carriers (electrons or holes), which are injected into the gate oxide region. Like NBTI, the degraded gate dielectric can then more readily trap electrons or holes, causing a change in threshold voltage, which in turn results in a shift in the subthreshold leakage current. HCI is accelerated by an increase in bias voltage and is the predominate mechanism at lower stress temperatures , . Therefore, hot carrier damage, unlike the other failure mechanisms, will not be accelerated by HTOL tests, which are commonly used for accelerated life testing .
A trending study was performed to understand the risk associated with reduction in feature size to facilitate better design decisions and mitigation strategies. Five component types were identified:
One component of each type from each of five technology nodes (0.35mm, 0.25mm, 0.18mm, 0.13mm and 90nm) was selected for analysis to show lifetime trends. The components selected are industrial grade. Thermal characteristics were researched and electrical parameters, commonly found on the component's datasheet, were identified for the calculations. Component complexity and electrical characteristics were extracted from corresponding component documentation for use in the calculator. The results of the calculations are used to correlate expected life for each component to technology node for a specified use environment (identified as 65°C).
Research showed that conductor material improvements were made around the 0.18 micron node and later to reduce the effects of electromigration. The resulting trend shows a reduction in failure rate from electromigration. However, as feature sizes decrease, the wearout effects of hot carrier injection and negative bias temperature instability become more prevalent. Two failure models for TDDB are used. Research shows that the applicable electro-chemical models for TDDB follow the dielectric (oxide) thickness at each node. A change occurs when scaling passes 5nm in thickness (corresponding to the 0.13 micron node). TDDB becomes a constant failure rate as oxide thickness approaches 1nm. Above 5nm, however, failure rate increases as the thickness approaches this turn-over point. Differences in trending can be seen for each failure mechanism:
The combined failure rate graph for the microprocessor device type is shown in Figure 3. This graph shows that a technology node dependent trend does exist for failure rates. As feature sizes are scaled down, failure rate does increase. The microprocessor device type is a prime example of this trending as the electrical and thermal conditions of these parts are consistent over each technology node.
The science behind the visible trends of each failure mechanism across the technology nodes is worth discussing. Consider 90nm technology as an appropriate starting-point for future technology node trending. A main differentiation between 0.35 micron and 90nm is conductor materials. Electromigration is directly influenced by this, which is why industry has made process improvements to both reduce the effects of EM through metallurgical improvements, and development of design rules to mitigate EM. The former increases the activation energy required to start degradation from ~0.6eV to ~0.75eV. However, even with design rules, i.e. Black's law, the latter can only forestall EM for a finite period of time by ensuring properly laid out geometries of traces and interconnects on die. It is unknown at this point in time whether or not any more improvements will be made to conductor metals (Al, Al + Cu) beyond what has already been done. The overall trend of electromigration is lower reliability and lifetime trending which shows reduction in lifetimes as a result of feature scaling. However, it can be considered a constant additive to failure rate because the trend is two plateaus (three or more if the material changes again). The failure rate constituent from EM will likely be the same for future nodes.
Although small compared to EM and TDDB at these nodes, hot carrier injection (HCI) and negative bias temperature instability (NBTI) contributions to failure rates increase as features are scaled down. Hot carrier injection will be almost negligible at high temperatures, i.e. 65°C operating environment. Although 65nm and 45nm process data are not currently included in this calculator, the projected contribution to failure rate of both of these failure mechanisms will surely increase and exceed those of EM and TDDB (which are also constant, as mentioned above).
The inverted trend of TDDB has to do with voltage tolerances of each component type. Above 5nm oxide thickness (0.25 micron and 0.35 micron nodes), the influence of TDDB is directly related to the electric field on the gate and in turn the voltage on it. The industry accepted reliability models at these nodes are different from that at 0.18 micron, 0.13 micron and 90nm. TDDB is trending toward a time independent mechanism and will induce random failures instead of acting as a wearout function. Reasons for this include all possible failure sites being sufficiently large within the bounds of the electric field to cause instantaneous failure. The effects of TDDB from 0.18 micron down to likely the ~32nm node will be a plateau just like EM. It is driven by voltage rather than magnitude of the electric field on the gate oxide. Therefore, when considering "old technology" as 180, 130 and even 90nm compared to the high performance 65 and 45nm, the effects of both EM and TDDB will be the same. For trending purposes, these contributions could be subtracted out altogether. This would result in increasing failure rate trends for all analyzed device types as feature sizes are scaled down.
The simulation tool used for this research is a web-based application based on recent PoF circuit reliability prediction methodologies that were developed by the University of Maryland (UMD), in cooperation with AVSI, for 130 nm and 90 nm devices . The two methods developed are referred to by the acronyms of FaRBS (Failure-Rate-Based SPICE [spacecraft, planet, instrument, C-matrix, events]) and MaCRO (Maryland Circuit Reliability-Oriented). FaRBS is a reliability prediction process that uses accelerated test data and PoF based die-level failure mechanism models to calculate the failure rate of integrated circuit components during their useful lifetime. As its name implies, it uses mathematical techniques to determine the failure rate of an integrated circuit . MaCRO contains SPICE (Simulation Program with Integrated Circuit Emphasis) analyses using several different commercial applications, wearout models, system reliability models, lifetime qualification, and reliability and performance tradeoffs in order to achieve system and device reliability trends, prediction and analysis . The simulation tool implements two simplified approaches to compute reliabilities:
These approaches are used to determine each failure mechanism's contribution to overall device failure. The ITB approach makes two assumptions:
Conversely, DTB utilizes back-end SPICE simulation to determine these contributions based on transistor behavior and circuit function. Using these mechanism weighting factors, sub-circuit cell counts, and transistor quantities, an overall component failure rate is calculated.
The software assumes that all the parameters for these models are technology node dependent. Although many different intermediate process technologies can be identified for devices under analysis, only major nodes are used. Nodes considered major nodes of CMOS processes on the International Technology Roadmap for Semiconductors (ITRS) reflect a trend of 70% scaling every 2-3 years and falls within the projections of Moore's Law2 . It is assumed that the technology qualification (process qualification) has been performed and at least one screening process has been applied before a device is packaged. This reliability prediction covers the steady-state random failures and wearout portions of the bathtub curve.
Each failure mechanism described above would have a failure rate, li , driven by a combination of temperature, voltage, current, and frequency. Parametric degradation of each type affects the on-die circuitry in its own unique way; therefore, the relative acceleration of each one must be defined and averaged for the applied condition. The failure rate contribution of each can be normalized by taking into account the effect of the weighted percentage of that failure rate. We ignore interactions between failure mechanisms for practical reasons although more rigorous studies of potential interactions could be made in the future. For the four mechanisms of EM, HCI, NBTI and TDDB, the normalized failure rate can be defined as lEM , lHCI , l NBTI and l TDDB respectively. In order to achieve more accuracy in the overall failure rate estimation, it is useful to split the IC into equivalent function sub-circuits and refer to it as a system of functional group cells, for example: 1 bit of SRAM, 1 bit of DRAM, one stage of a ring oscillator, and select modules within Analog-to-Digital circuitry (ADC), etc. For each functional group type, the failure rate can be defined as a weighted summation of each failure rate type multiplied by a normalization constant for the specific failure mechanism.
Where 1 l F is the failure rate of one unit of functional group, F . Ki,F is a constant defined by the weight percentage of functional group F as it affects the ith failure mechanism and li is the normalized failure rate of any failure mechanism. For example, the failure rate of electromigration affecting a DRAM group would be KEM ,DRAM ´l EM , where KEM ,DRAM is a constant defining the weight percentage that DRAM has on the normalized electromigration failure rate. The overall DRAM failure rate per functional group, 1 l DRAM , is:
Where KEM ,DRAM is a constant defined by the weight percentage that DRAM has on EM, l EM is the normalized failure rate of EM, KHCI,DRAM is a constant defined by the weight percentage that DRAM has on HCI, l HCI is the normalized failure rate of HCI, KNBTI ,DRAM is a constant defined by the weight percentage that DRAM has on NBTI, l NBTI is the normalized failure rate of NBTI, KTDDB,DRAM is a constant defined by the weight percentage that DRAM has on TDDB and lTDDB is the normalized failure rate of TDDB. Considering the probability of a specific functional group being operationally active at the instance when failure occurs is a modification to Equation (1):
Where l F is the failure rate of a functional group as the cause of the potential failure of the device under analysis and PF is the probability that that functional group was operational during failure. The total failure rate of a component, lT , can be defined as being equal to the summation of the total number of each functional group multiplied by the failure rate of each functional group type.
Where lT is the failure rate of the component under analysis, NF is the total number of each functional group, N is the total number of all types of functional groups and AF is the ratio of the number of units of the nth functional group type to the total number of functional groups that exist in the component under analysis. The prediction process is demonstrated in Figure 4.
An assembly, the output from a system reliability assessment application, and/or the Bill of Materials (BOM) for an assembly is examined for complex integrated circuits that could be analyzed with the Integrated Circuit Lifetime Prediction calculator. The current limitations of the software are number of functional group types and technology node data beyond 90nm. Thermal characteristics are researched and electrical parameters, commonly found on the component's datasheet, are identified for the calculations. Component complexity and electrical characteristics are extracted from corresponding component documentation for use in the calculator. Thermal parameters for field conditions are acquired through prototyping, direct thermal measurements, and simulations.
Preliminary analysis of the device uses a process that divides an integrated circuit into smaller functional blocks to apply acceleration factors at the most basic level. Equivalent function sub-circuits are used as part of the calculator to organize the complexity of the integrated circuit being analyzed into functional group cells, i.e. one (1) bit of DRAM. As an example, the functional group block diagram for National Semiconductor's 12-bit ADC component, ADC124S021, is shown below. It contains a multiplexer group, track and hold function, control logic, and 12-bit analog-to-digital converter.
The standard procedure for integrated circuit analysis uses high temperature operating life (HTOL) test conditions for the test conditions used for extrapolation:
The HTOL ambient temperature was calculated for each component (except when supplied by the manufacturer). Thermal information was obtained from the datasheet and/or thermal characteristic documentation and each manufacturer's website. Using (5), junction temperature, power dissipation, and junction-to-air thermal resistance are used to calculate ambient temperature.
Junction-to-air thermal resistance was obtained either from a component's datasheet or from thermal characteristic databases for package type and size; i.e. Texas Instruments or NXP Semiconductors websites. The ambient temperature calculation for an example component is shown in (6).
Inputs on the calculator are the test parameters and results from the standard JEDEC accelerated test and information pertaining to the integrated circuit:
An extensive field study was conducted in order to demonstrate the simulation tool and verify its prediction capabilities. Reliability predictions were performed based on field failures of DRAM, microcontrollers and microprocessors, as shown in Table 1.
The field data were extracted from a Motorola database which encompasses shipments and customers’ claims. Unique identifiers of each product and failure enabled the detailed statistical field analysis. The ICs were assembled on boards belonging to a family of communication products shipped during 2002-2009. Component complexity and electrical characteristics were extracted from corresponding component documentation.
As the shipping and failures are recorded continuously, several reliability measurements can be performed. The first one is the monthly failure rate, which is expressed as:
The second one is the cumulative failure rate:
If the cumulative failure rate is approximately constant, the exponential distribution may be used for times to failure. Thus, a rough estimate of the failure rate may be obtained by dividing the total number of failures by the cumulative working months:
In order to estimate the part failure rate in FIT (failures per billion device hours), the mission profile must be considered:
The microcontroller (MC68HC908SR12CFA) is used here to illustrate the process for acquiring environmental information and determining its failure rate based on field data. A similar process was performed for the other four ICs. A total of 96 microcontrollers were replaced during a cumulative total of 595,412 working months. Figure 6 shows the failed ICs and cumulative months vs. operating time. The monthly and cumulative failure rates are displayed in Figure 7.
The monthly failure rate plot is rather noisy while the cumulative plot allows visualization of the overall trend. The cumulative failure rate exhibits a steady state failure rate of approximately 0.02%. We then can calculate a rough estimation for the failure rate, λ , and the mean time to failure, MTTF:
Converting the MTTF from months into hours requires estimation of the user mission profile. Since the microcontroller operates 24 hours a day (730 hours per month), we can roughly estimate the card MTTF as 4,527,612 hours which corresponds to 220 FIT.
Each failed IC is assigned to a single product with unique serial number. Binding the shipment data by the serial number, on one hand, and the failure data, on the other hand, enabled the following Weibull analysis. Figure 8 presents a Weibull probability plot of the time to failure with 95% confidence interval (time scale is days, not months). Start time for the failed and censored data is considered the shipment date. The failure date is considered the date on which the customer issued a claim to the depot. The sensitivity of those assumptions was analyzed and found to have a minor effect for the long term on large populations.
Excellent correlation to the exponential distribution was obtained, with β=1.0266. A null hypothesis of β=1 was tested in order to establish the justification for use of the exponential distribution. A Bonferroni test at confidence level of 95% provided lower and upper limits for β of 0.8758 and 1.203 respectively, with a p-value of 0.746. Thus the exponential distribution can be assumed, as shown in the following probability plot (Figure 9):
Using the exponential distribution, the MTTF is 141,824 Days. The lower and upper confidence limits are 119,072, 168,923 respectively. Assuming 24 hours per day, the MTTF is equal to:
This corresponds to 293 FIT with lower and upper confidence limits of 247 and 350 FIT respectively. This is slightly above the rough estimate of 220 FIT. Weibull analyses for all the analyzed ICs showed similar justification for the use of the exponential distribution.
The reliability calculations are based on the time domain of the host computer. Except for the microcontroller, which is stressed 24 hours a day, we assume that memory parts and the processor are partly stressed depending on the user profile. A conservative assumption is that a regular user will stress the parts two shifts/day, i.e. 16 hours/day. The predicted failure rates were calculated using the methodology described in the section The Simulation Tool. Field and test condition inputs were extracted from the components' datasheets. These calculator inputs are shown in Table 2. The functional group distribution of each IC was found in each component's description as provided on the datasheets. Table 3 shows the field failure rates, as obtained using Eq. (8), and the corresponding results of the predictions. Figure 10 shows the comparison of the field failure rates and the prediction results, along with the 95% confidence intervals obtained by the Weibull analysis.
It should be noted that the DRAM failure rates presented in Table 3 and Figure 10 refer to critical faults which forced the user to replace the part. They do not reflect specific rates of different kind of errors (correctable or non-correctable data errors or single event upsets from radiation) but rather a complete part failure rate.
JEDEC standards define a family of accelerated life tests, of which the one most commonly used for estimating component failure rate is HTOL (JEDEC Standard number 47D, “Stress-Test-Driven Qualification of Integrated Circuits”). It consists of stressing 77 pieces per qualification lot, for an extended time, usually 1,000 hours, at an accelerated voltage and temperature (typically +125°C). The stated purpose of HTOL testing is to simulate device operation at elevated temperatures and higher than nominal operating voltages to provide sufficient acceleration to simulate many years of operation at ambient temperatures (typical +55°C). The data obtained from the HTOL test are traditionally translated to a lower temperature by using the Arrhenius temperature acceleration model. In a standard HTOL where 77 parts per lot are taken from 3 different lots (total of 231 parts tested) during 1,000 hours in +125°C, the calculated acceleration factor, AF , using the Arrhenius model would be 78 [assuming: 1) Ea=0.7, 2) the ambient temperature is +55°C, and 3) temperatures refer to junction]. The equivalent field time is ~18 million hours. In case of zero failures in test, the upper limit for the failure rate at 60% confidence would be 51 FIT. It is clearly apparent that the predicted failure rate, based on HTOL, is misleading.
Activation energy is the parameter used to express the degree of acceleration related to temperature. Single failure mechanisms are accompanied with unique activation energy values (JEDEC Publication No. 122B). However, it is a traditional method to use an activation energy of 0.7eV as it is generally assumed as average activation energy for failure mechanisms that occur during the useful life of a device. This useful life lies beyond the early stages of infant mortality failures (defect driven failures). Industry is widely using this value of 0.7eV in the following two cases:
The goal of HTOL is to gain maximum possible acceleration to accumulate maximum equivalent field time, with zero failures. Assuming higher activation energies will serve this goal, but will reduce the failure rate upper limit. For example, assuming activation energy of 1.0 instead of 0.7eV will raise the acceleration factor to 504 instead of 78 (6.5 times more). On the other hand the failure rate will reduce from 51 FIT to only 8 FIT, which is even more overoptimistic, see Table 4.
The validation study has shown strong correlation between the field failure rates and the rates obtained by the prediction tool. The results in Figure 10 clearly demonstrate the accuracy and repeatability of the multi-mechanism model to predict the field performance of complex integrated circuits.
The simulated estimates lie well within the confidence intervals except for the Intel processor, where a small deviation of 60 FIT observed. The small deviation between the rough estimates and the point estimations obtained from the statistical plots justify the use of the exponential distribution. For memories, an average failure rate of 720 FIT was observed with an average deviation of 10% between the field and simulated failure rates. The average interval of the field failure rate (upper limitlower limit) is 280 FIT. Considering the fact that the 512MB DRAM node technology is quite similar to the 1GB DRAM (100 nm and 110 nm accordingly), both parts actually exhibit the same failure rate of 0.8 FIT per 1 MB. In contrast, the 256MB DRAM with 689 FIT does not correspond to this projection which should have led to a failure rate of 205 FIT. This is explained by the lower accelerated test ambient temperature to which the 256MB DRAM is exposed, relative to the other two memories. Nevertheless, components whose predicted failure rate is relatively large compared to similar device types, i.e. 1GB DRAM, might be categorized as more sensitive to electro-thermal tolerances. They will be subjected to greater stresses at the peripheries of these sensitive operating ranges. Components with large operating ranges are typically operated at an average nominal value. Therefore, small fluctuations away from the mean of these larger ranges will not excessively stress the components. A graphical depiction of this is shown in Figure 11. The microcontroller and the processor experienced lower failure rates then the memories. Furthermore, the average failure rate is 220 FIT with interval of 120 FIT.
In comparison to the HTOL prediction, it can be seen that standard HTOL testing should not be used for failure rate prediction, as it produces estimates that are too low. This is due to two limitations for standard HTOL: the multi-mechanism and zero failure limitations.
The multi-mechanism limitation: Ideally, yet unrealistically, a complete lifetime distribution of a component is generated under a fixed set of accelerated loading conditions and where inference about the reliability of the components under a different set of loading conditions can be made confidently, using a proven acceleration model. One accelerated test, such as HTOL, cannot stimulate all the major failure mechanisms (i.e. HCI), and the acceleration factor obtained from some of them is negligible. Under the assumption of multiple failure mechanisms, each will be accelerated differently depending on the physics for each mechanism. If a HTOL test is performed at an arbitrary voltage and temperature for acceleration based only on a single failure mechanism, then only that mechanism will be reasonably accelerated. In that instance, which is generally true for most devices, the reported FIT rate (especially one based on zero failures) will be meaningless with respect to other failure mechanisms.
The zero failure limitation: The fact that HTOL is a zero-failure test limits the statistical confidence of the predicted failure rate. Zero failures in HTOL is a poor indicator of the expected failure rate. To obtain statistical confidence, failures must be observed.
This work was performed by DfR Solutions and initially funded by Aero Engine Controls, Boeing, General Electric, the National Aeronautics and Space Administration, the Department of Defense, and the Federal Aviation Administration in cooperation with the Aerospace Vehicle Systems Institute (AVSI) for the 0.13mm and 90nm technology nodes. DfR is now working to extend the capability of the tool into smaller technology nodes, including 65nm, 45nm, and 32nm. Several commercial organizations have indicated a willingness to assist with the development and validation of 45nm technology through IC test components and acquisition of field failure data. Continued development would incorporate this information and would expand into functional groups relevant for analog and processor based (e.g. DSP and FPGA) integrated circuits.
IC components are finding their way into every major electronics application, across various industries and product categories. For the semiconductor industry, the major focus has been to keep up with Moore’s law and deliver on power, performance, area and cost (PPAC).
Reliability models, based on physics-of-failure mechanisms, have been developed for dynamic random access memories (DRAM), microcontrollers and microprocessors using a new software tool. Field data from a large fleet of mobile communications products, that were deployed over a period of 8 years, were analyzed to validate the tool’s accuracy. Strong correlation of 80% is demonstrated between measured and predicted values.