Space-bound systems use 65nm Radiation Hardened FPGA technologies that are nearing end-of-life (Xilinx Virtex 5QV). Rather than redevelop these systems using the next successor FPGAs at 40nm, which offers only a limited improvement in performance, the industry finds it necessary to skip this generation and start performing viability analyses on the 28nm FPGAs instead.
Although these FPGAs are said to be unprecedented in performance, their state-of-the-art 3D packages (see Figure 1) and 28nm feature sizes lack the empirical test data necessary for designers to make critical reliability decisions on whether this new technology is a suitable replacement for the Virtex 5 FPGAs. DfR Solutions will investigate the reliability of the Xilinx -7 Series 28nm products in this qualification activity.
While Xilinx offers three FPGA platforms (Artix, Kintex, and Virtex), it can be assumed1 that each platform uses the same fabrication technology and core IP block layout and differ predominantly in quantities of each feature set, i.e. amount of block RAM or number of logic cell slices available for programming. These similarities favor a two temperature stress test that will subject the selected components to conditions suitable for studying semiconductor reliability.
Xilinx, like other FPGA or Programmable Logic Device (PLD) manufacturers, designs their devices to operate and electrically compensate themselves in manners similar to those found in system level prognostics. FPGAs are atypical integrated circuits in that design features exist in them to mitigate the effects of known semiconductor degradation mechanisms: specifically Dielectric Breakdown (DB), Electromigration (EM), Negative Bias Temperature Instability (NBTI) and Hot Carrier Injection (HCI). One such example is a component-level program routine built into the FPGA’s non-user space that toggles CMOS pairs. This has been shown to recover degradation effects of NBTI in some PMOS transistors. In other cases, proprietary design verification rules provide extra margin so that devices are less susceptible to materials breakdown or transistor timing issues resulting from geometric layout imperfections.
Xilinx provides some guidance for its customers on the effects of these mechanisms within their Power Estimator spreadsheet calculator for the “-7” product series. This calculator reports trade off results with regard to environment of the system, device package selection, and other life limiting parameters such as power dissipation. Discussions with Xilinx indicate that while they have a model in place to extrapolate time to failure (TTF) for Hot Carrier Injection, they have not performed adequate device testing, design enhancements or scientific modeling to guarantee customer lifetime expectations. Rather they suggest “The chip designers can choose to adjust these mechanisms to meet their design goals. For example, a high-performance microprocessor might, by choice, accept a wear-out time of seven years in order to meet its performance requirements. These effects can be reduced by applying lower voltage and temperature stresses, by using thicker oxide, or by designing the circuits to function within the range of threshold shifts that are expected to occur over 20 years of operation.”
A test group of two (2) Xilinx Artix-7 FPGAs was subjected to each of the two temperature extremes. To drive nucleation of known failure mechanisms, it is necessary to operate the components at minimum and maximum junction temperatures. The hot-side or maximum operating temperature test is meant to drive the diffusion related mechanisms Dielectric Breakdown and Electromigration. Cold-side testing is meant for Hot Carrier Injection as it is manifested by relatively colder temperatures than those used in HTOL.
Two additional samples were needed to identify these temperatures in a step stress test fashion similar to those found in HALT2 . These samples are necessary to explore the thermal limitations of the Artix-7 platform as operating outside datasheet ratings and operating conditions may cause irreversible damage to the product. The product specification is provided in Figure 2.
AC701 evaluation boards for the Artix-7 (which include the XC7A200T-2FBG676C FPGA) were used for testing. Each board has a PCI express form factor which allows interaction and programming of the card while it is installed in a compatible desktop computer. Depending on which computer test system is used and barring unforeseen software limitations, it is likely that one system can utilize multiple evaluation boards using the PCI, UART or Serial bus interfaces. The cards, while under test, operate independently from the computer.
Because FPGAs need to be programmed, common programming syntax was used to stress specific ondie circuitry. The best way to do this is by creating an iterative program routine to retrieve, perform an operation on, and later store data patterns in memory. Related applications of this example are rooted in calculations of cryptography keys or generating prime number series. At various steps, the data in memory can be compared to a look up table (LUT) either resident in the read-only memory on the evaluation card or on the computer if file system access is permitted by the evaluation board. These program steps not only exercise the memory cells but also specific memory registers found in memory controllers, charge pumps (when writing to flash memory cells) and logic cells when dictated by the program. Maximum size memory addresses were used to fully utilize the FPGA features. DfR Solutions structured programming to allow for an assessment of a range of FPGA operations and their effect on end-of-life behavior.
Program routines were duplicated and run concurrently to ensure sufficient sets of logic cells, blocks or slices within the FPGA were being used and the device was running at maximum capability (voltages, clock speeds and resource utilization). The focus was on creating program routines that subjected the FPGA to stresses, not features residing on the evaluation board.
The 7-Series component classification from Xilinx shows that this FPGA, X7CA200T-2FBG676C, has a commercial grade temperature range:
It is worth noting that the grading distinction most likely comes from a wafer level test which sorts the individual die by speed and an undisclosed performance parameter that has temperature dependence. Therefore, considering variability, guard banding and other process margins, the die has an optimal temperature range between extended and industrial.
The card’s layout (Figure 4) permits minimum modification to its thermal stack up. A chiller and water block style heat sink (Figure 5) were a cost effective solution to expose the FPGA itself and not the test computer to cold-side temperatures. Minimal changes to the evaluation card were necessary, i.e. adding an extension cable to the LCD display. Hot-side temperatures were achieved by treating the chamber environment as an elephant chamber with an overwhelming volume of controlled ambient air temperature.
Both hot side and cold side testing ran concurrently for 3,000 hours (approximately four months). This duration should be sufficient to identify degradation behaviors associated with the three failure mechanisms (DB, EM and HCI). Upon catastrophic failure, some degree of failure analysis should be performed but is outside of the scope of this work. When a reasonable amount of degradation has taken place without failure, confirming the degradation behavior should be done on at least one sample (assuming more than one is exhibiting the behavior). That sample should be power cycled and functionally checked at room temperature before recording its time to failure (TTF). A procedure to perform a functional check is provided in the Failure Criterion section of the Experimental Procedure.
An iterative program routine was created using VHDL that stressed the RAM, Registers and DSP structures on die. The program generates a signal every 100 iterations. The routine has been benchmarked to take 2840 mSec to generate a signal per iteration. A script was compiled on the monitoring computer to listen for the signal from the AC701 card and record timestamps. Anomalous electrical behavior (when the DUT are at temperature) such as clock speed changes or program malfunctions, which will directly impact the signal generator, will indicate signs of functional failure.
In order to fit the DSP layout design and drivers for input/output (LCD, LED and buses) within the FPGA fabric, some resources were used less than expected. Attempts were made towards full utilization by adding in more RAM addressing and Logic comparisons, but the designs would not compile in the Xilinx development software.
The FPGA was subjected to a maximum and minimum operating temperature while running the VHDL program. These operating limits were defined by stepping up and down, respectively, the temperature within the chamber until the signal generation from the FPGA stopped. Much like the process in defining destructive HALT limits, the temperature extreme was reduced in either case to add a 10°C safety factor for each test.
A data logger recorded the die temperature in real time. Temperature was initially ramped at 5°C/min. At 90°C, the ramp rate was decreased to 1°C/min. The destructive limit of the FPGA was identified as 149°C at limited power dissipation (Figure 7).
The cold side setup was constructed using machined components and modifications to the evaluation card to fit it within the water block stack up. Figure 8 shows the evaluation card mounted on its new heat sink. Thermocouples were placed in the thermal compound of the heat sink stack up, the secondary side of the AC701 card (behind the FPGA), and on an auxiliary water block. Temperature was decreased at a rate of -5°C/min.
To guarantee that the component was exposed to a low enough temperature, a liquid nitrogen chamber was built around the water block setup, Figure 10. The measured die temperature during test was -35°C.
The thermal stack up was defined using Figure 11 - Figure 13. The heat sink, die and card were assessed for optimal heat transfer and leakage paths to determine whether additional measures were necessary to control the temperature exposure during the life test. The junction temperatures for the High Temperature Operating Life (HTOL) and Low Temperature Operating Life (LTOL) tests were calculated using the thermal resistance values at each interface and the power dissipation estimated by the Xilinx development software.
In the event of an FPGA not sending the desired signal, a step-by-step procedure was followed to identify device failure versus system related anomalies. This procedure is defined in Figure 14.
From the VHDL program, we know that the nominal time it takes for a signal to be generated is 2840mSec. This time interval was used to assess the health of the FPGA as it is solely dependent on signal generation. A 1% margin (±28.4mS) was used as a first pass filter against the data. The output is then parsed against the nominal 2840mSec to sense speed up (compensation) and slow down (degradation) of the clock signature through the program.
A quick descriptor on signals is necessary to understand the parse. The connected interfaces are FPGA, bus and Laptop:
The signal interfaces that are touched in this analysis are the FPGA’s connection to the bus and the computer’s hardware and software connections to the very same bus. It is assumed that the interfaces within the computer are faster than the bus such that they do not introduce any latency in the time between received signals. The FPGA’s connection to the bus is by its signal generator as programmed in the code. The UART bus is defined within code and is given a baud or modulation rate. The baud is synonymous to “symbols per second”. The baud was set to envelope the mSec resolution used in the data parse. The baud can be thought of as the number of available seats on any given overly efficient train, where a new train is always guaranteed to be ready for boarding, serially behind its predecessor as to not leave any patrons stranded on the platform for more time than it takes to fill the train.
Figure 16 shows a screenshot of the parsing spreadsheet. At times, the output happened before the baud causing a “speed up” in the time interval (standing room on the train). Other times, the output happens after the baud causing a “slow down” in the time interval (missed train, boarded subsequent train). In both of these circumstances, they tend to happen with a “slow down” consecutively followed by a “speed up” event. These paired events were identified in the data but later censored from the analysis as they were attributed to the baud. Events that did not have a subsequent pair were assessed further.
Any event that did not have a subsequent event pair is provided in Table 4 - Table 7 to show when the event occurred in elapsed mSec. This same data is reported in Table 8 and Table 9 to show relative time elapsed in hours when these events were parsed. Table 9 shows the time to failure of the hot side test devices, 2842 hours and 48 hours.
System D failed at 48 hours of testing. The hot chamber was setup using the limits defined during our step stress test. Because of the component grading of the evaluation cards, we had assumed that all the other cards would behave similarly to the step stressed cards. They did not. 125°C was selected to provide a safe margin below the 140°C destruction limit (includes ~10°C HALT margin) . Although the stress test card had a working ambient temperature of 149°C, the cards under test lost contact with the recording equipment after a few days. Lowering the temperature, System C’s card responded around 105°C. The test condition was reinitialized to 100°C.
The anomalous events can be assessed from different analytical standpoints. While no failure analysis was performed on the failed FPGAs, it is clear that failure took place within the FPGA as it would not reinitialize or re-enumerate as a programmable device within the VHDL programming computer.
If we focus on the time domains of the two systems within the test setup: laptop, bus and FPGA, we can make some general conclusions. The signal generation from the FPGA is based on its clock which is either an oscillator on the evaluation board and/or an oscillation circuit (i.e. ring oscillator or phase locked loop). Because the spurious signals did not become progressively worse on a consistent basis or increased in frequency before failure, we can assume that the clock within the FPGA is not operating out of specification.
If the timestamp of the clock signal was compared to the timestamp within the laptop or if bidirectional communication was present, then our test system would be truly asynchronous as the laptop is not dependent on the FPGA system and there is no feedback loop in place to confirm receipt of a signal or an acknowledgement that the system is behaving nominally. The randomization of events would then be attributed to convoluted signals between the two.
Our test system can rather be compared to a lighthouse along a trade route. As long as the lighthouse bulb emits light bright enough to be visible by a passing ship it is considered as doing its job nominally. In the event of a bulb failure, the lighthouse fails to emit light. In either of these cases, a passing ship’s captain will be looking at horizon for that light, but neither he nor the ship’s heading will affect the lighthouse light. The FPGA, independent of its surroundings much like the lighthouse, is generating a simple character signal ($) after it completes some mathematics. The laptop acts as the ship’s captain, constantly on the lookout for the light. When “light is seen”, it merely records the event on an arbitrary timeline. Independent of the signal generation, the duration of time between events is controlled by the FPGA, not the laptop.
It is unlikely that metastability within this otherwise asynchronous system had caused the anomalous events recorded in this test.
By definition, we were able to record anomalous behavior from an otherwise stable system. Each random event was an outlier to the expected behavior and existed throughout the test’s duration. If the events were front loaded, then the instability would be attributed to some type of settling-in while taking place during the infant mortality region of the useful life. Infant mortality behavior could be screened out through an environmental stress screen similar to this test. If the events were back loaded, then the instability would be attributed to wear out of the FPGA materials and its electrical design.
Time to Failure was recorded for both the hot side test systems. It is unclear whether failure would occur on the cold side test systems given a longer test duration as no definitive trending behavior was identified in the systems other than a relative rate of events (slope of events per time interval), Figure 17. Without failure analysis, we do not know what element within the FPGA failed.
We can conclude that things do in fact seem to be slowing down on the FPGA. While these events were corrected on subsequent signal generation iterations by the FPGA, their very nature should raise questions as to whether critical mission capabilities could be affected by such an event, even in a system with triple mode redundancy (TMR). We can speculate that real degradation is occurring, but also compensation and/or some type of error correction method is taking place since the FPGA recovers back to normal (the laptop receives the expected signal after the 2840mSec interval). This behavior is similar to the accumulation of defects that take place in dielectric breakdown, but rather than a breakdown event causing a failure, we’re seeing a breakdown event followed by recovery. More so, the negative bias temperature instability (NBTI) mechanism causes degradation with similar behavior as seen from previous IC wearout work performed at DfR Solutions3 . In the gate oxide region and along its interface allowing electrons and holes to become trapped and break Si-H interface bonds. NBTI is different from other failure mechanisms because it has a recovery process. This can cause a reduction in the rate of degradation where the amount of NBTI recovery is equal to the amount of degradation. When CMOS pairs are toggled, NBTI can be forced to recover by re-passivating those bonds. If we are truly seeing NBTI degradation, then an additional assessment of its effects on a more complex programming routine is necessary to determine how detrimental the slowdown effect is.
The research study by DfR Solutions was made possible by funding from the NASA Electronic Parts and Packaging (NEPP) Program and NASA Jet Propulsion Laboratory (JPL).
Traditionally, determining appropriate component temperature was based on a combination of datasheet information and derating strategies. This method has since been proven outmoded since it does not factor in actual failure models and degradation mechanisms, resulting in expensive designs and/or products that lack optimum reliability.
Advanced software makes board component modeling and analysis more efficient - Calculating thermal derating for boards with multiple components is time-consuming and often based on inaccurate assumptions about the board’s uniform exposure to high temperatures. In our How to Improve Thermal Derating Analysis with Sherlock and Abaqus webinar, you’ll learn how powerful software tools make the process more efficient and accurate by using the results from a thermal conduction analysis performed in Abaqus as the basis for thermal de-rating analysis on each component and creating models automatically through the Sherlock Automated Design Analysis™ Software interface.