ANSYS DfR reversed_white

Root-Cause Analysis (RCA) of HALT Failures: Case Study

Cheryl Tulkoff, Greg Caswell, Nathan Blattau, and Craig Hillman

Download PDF

Introduction

A manufacturer of industrial controls equipment requested review and commentary on the performance of one of their new power supply products subjected to a classic Highly Accelerated Life Testing (HALT) activity. The HALT consisted of

  • Cold Step Stress
  • Hot Step Stress
  • Rapid Thermal Transition
  • Vibration Step Stress
  • Combined Thermal and Vibration

This document reviews each failure mode, when the failure occurred during the HALT process, and the root-cause of the failure. It will provide an assessment as to relevancy of the failure mode for the field environment and possible corrective actions. To provide realistic and effective guidance, this report assumed an operating temperature range of 5ºC to 50ºC (semi-controlled industrial location) and a storage temperature range of -40ºC to 65ºC.

Cold Step Stress

  • The results from cold step stress testing are listed below:
  • At -30°C: Refresh rates of quarter video graphics array (QVGA) TFT liquid crystal display (LCD) slowed. Refresh speed returned to normal when rechecked within operational range of 5°C to 50°C
    • First recoverable failure
  • At -40°C: Refresh rate of LCD stopped and display went dark. Refresh speed returned to normal when rechecked within operational range of 5°C to 50°C
  • At -40°C: LCD failed to illuminate on one unit. This failure was non-recoverable.
    • First destruct failure
  • At -80°C: Units, not including the LCDs, stopped functioning due to excessive voltage drop. Operation restored when temperature returned to -70°C.

Cold Step Stress Root-Cause Analysis

The limited temperature range of liquid crystal displays (LCDs) is a well-known phenomenon. The expected operating range of LCDs is typically between -20°C to 70°C, so the operational performance of the LCD was appropriate. However, the storage performance of the LCD was potentially problematic because failure occurred within the defined storage range of -40ºC to 65ºC. The recoverable performance of the rest of the unit does not seem to be an issue as the failure point is approximately 85°C (5°C vs. -80°C) below the expected temperature range.

Cold Steps Stress Corrective Action

While there are potentially several material or design limitations in regards to modern LCD architecture that would induce failure at low temperatures, the storage range of this LCD was specified by the manufacturer down to -40C. Given that the product storage range is down to - 40C, this results in no margin between storage spec and the destruct limits.

Hot Step Stress

The results from hot step stress testing are listed below

  • At +110°C: One unit shut down. Unit did not recover.
    • First destruct failure

Hot Step Stress Root-Cause Analysis

Failure analysis identified the failure site as a Compact, Regulated Single Output DC/DC Converter in a Dual-Inline type package (DIP). Electrical characterization of the device identified no output. No additional component-level failure analysis was authorized.

The operating range of the DC/DC Converter was given by the manufacturer as -25°C to 71°C ambient and -25°C to 90°C for case. Thermal measurements identified the DC/DC Converter operating approximately 7°C above ambient conditions.

As seen by the cold step stress results, there is significant margin on the cold side beyond specifications (-25°C vs. -80°C). There seems to be less margin on the hot side (either 39C or 27C, depending on ambient vs. case temperatures), but this is not uncommon for component performance. There is some concern about the lack of a recoverable failure (e.g., voltage drop). It is DfR’s experience that a typical margin between recoverable and permanent failures is approximately 20C for a significant population of components.

There are two potential options to assess the risk of this failure. The first option is to assume the DC/DC Converter did have a ‘recoverable’ failure (that is, it drifted out of specification) at some lower temperature, but the surrounding circuit was sufficiently robust to continue operating. This could be a problem as it could indicate little to no margin on the DC/DC Converter performance.

The second option is that there was no ‘recoverable’ failure. This is not uncommon in complex power devices, as they typically are designed very aggressively in regards to size and performance. An example of a failure mechanism that can explain this failure mode is the rupturing of an electrolytic capacitor rated to 85C1.

Hot Step Stress Corrective Action

In both scenarios, the failure temperature, whether around 90°C or at 110°C, is at least 40°C above the operating specification of 50°C. There are two potential corrective actions considering the temperature difference between the operating specification and the failure temperature. The first is to do nothing because sufficient margin was present. The second is to consider the DC/DC converter the weakest component and to determine if this may indicate a potential future risk which would require additional actions (supplier audits, construction analysis, tracking of performance parameters using SPC, etc.).

Rapid Thermal Transition

The results from rapid thermal transitions are listed below:

  • One unit had a sticky relay (K2). This behavior was intermittent

Thermal Transition Root-Cause Analysis

Intermittent failures should be taken very seriously whenever detected during product qualification. “Sticky” relays are often an indication of micro-welding, potentially due to timing issues or excessive current. In some manner, rapid thermal transitions may have aggravated the component or the circuit sufficiently to trigger this event, potentially indicating insufficient margin or robustness.

Thermal Transition Corrective Action

It was recommended that the customer should begin an assessment of this failure by assessing field returns or customer complaints for similar behavior in similar product. Additional review of the circuit or the component may also be appropriate.

Vibration Step Stress Testing

The results from vibration step stress testing are listed below:

  • At 15 Grms: Ground screw loosened
  • At 40 Grms: Horizontal lines noted in LCD
  • At 40 Grms: Power supply failure (no 160VDC indication).

Vibration Root-Cause Analysis

LCD was not available for root-cause analysis. Visual inspection of the power supply identified broken ceramic bleeder resistors that had separated from their solder joints.

Vibration Corrective Action

One of the most important questions in assessing the results of a HALT test is determining its relevancy. Since the operating environment of the product is not expected to see any vibration, the vibration step stress test is, to some extent, assessing the robustness of the design to shipping and transportation loads. Therefore, an appropriate assessment must be based upon an understanding of the actual loads seen during shipping and transportation.

Three Power Spectral Density (PSD) profiles for shipping/transportation are shown in Figure 1 and Table 1. The first is an actual vibration profile experienced during transportation, while the second and third are profiles for testing for vibration loads experienced during transportation.

We can see that the applied vibration loads during HALT have a higher Grms than the loads experienced during transportation or during transportation testing. There is an equation in MILSTD-810G that provides a correlation between two different vibration profiles using a time compression equation,

root eq 1

where T1, T2 are the test times, S1, S2 are the corresponding severity in grms and m is a value based on the slope of the S-N curve. MIL-STD-810G indicates that a value of m = 7 has been historically used for random environments.

The limitation of this time compression equation is that it is based on the two profiles having a similar spectrum. And spectrum from transportation and HALT could not be more different.

root fig 1 table 1

Vibration experienced during road transportation primarily occurs at low frequencies (1000 Hz). A comparison of the two spectrums is shown in Figure 2. The two conditions are therefore not comparable.

root fig 2

Since the relevancy of the HALT vibration failures cannot be quantitatively assessed, appropriate corrective actions are determined through a combination of prior experience and the cost and expense of implementing the corrective action.

Based on prior experience, 15 Grms is a relatively low loading condition for failures to initiate under HALT test conditions. The appropriate corrective action, such as use of a split ring washer, a lock washer, or a threadlocking adhesive, is also a small additional expense. Therefore, this corrective action should be implemented.

root fig 3

The failure of the display unit at 40 Grms is believed to be at the material limit of LCDs and is not considered to be a concern. The dislodging of the bleeder resistors at the same vibration load is slightly more complex. The resistors do have high standoffs, most likely for thermal reasons, a large mass, and are relatively close together. In this situation, in comes down to cost-effectiveness. Bleeder resistors at the same price with more robust standoffs are available. Examples of this standoff can be seen in the image to the right and at www.ohmite.com/cgibin/showpage.cgi?product=tvw_tvm_series

root fig 4

 

An appropriate corrective action is to select this packaging style in place of the axial standoffs currently being used.

Combined Environment Testing

The results from combined vibration and temperature cycling testing are listed below:

  • At +80°C/18 Grms: PCB mounting hardware (screws and standoff) came off and caused an electrical short.
  • At -60°C/30 Grms: Bleeder resistor was observed to have separated from the printed board.

Combined Environment Root-Cause Analysis

The first two failures under the combined testing are the same type of failures observed during vibration step-stress testing. This is inline with DfR’s experience, where combined environment testing does not typically induce a different failure mode from the first four test conditions.

Combined Environment Corrective Action

None