While the primary focus on environmental challenges in electronics has been in regards to the requirements derived from the RoHS and REACH legislation, several other ‘greening’ trends have serious implications on the design, manufacturing, and qualification of electronic components, products, and systems. The most consequential is the concept described as Free Air Cooling or Air Economization for ubiquitous data centers that power the Internet.
While servers within data centers have become progressively cheaper and more powerful (thanks to Moore’s Law and low cost manufacturing), they continue to output a significant amount of heat. The energy costs to remove this heat from the server racks have continued to rise to the point that the electricity required for cooling exceeds the purchase price of the server after only four years. And utility rates will continue to rise for the foreseeable future.
Combine the profit motive with the desire for companies such as Google, Facebook, Microsoft, etc, to reduce the carbon footprint, and you have a significant push to find ways to cool these immense data centers with less energy. The solution, according to the industry, is Free Air Cooling. Free Air Cooling bypasses the traditional air conditioner by pulling in cold air from the outside, cycling it through the server racks, and flushing it back outside.
The advantages of Free Air Cooling seem obvious and analogous to opening the windows of your home on a nice spring day. Several recent reports, including one by McKinsey & Company strongly encourage all new data centers to adopt this cooling strategy and most of the major players in data center supply chain and utilization, such as Sun Microsystems, are moving forward with implementation1. A calculator offered by Green Grid estimates annual savings from $20,000 to $200,000 (Intel believes it could be as high as $2 million). With the obvious monetary and publicity benefits, what’s not to like?
Despite the benefits, there is no such thing as a free lunch. Several challenges are involved when attempting to ensuring adequate reliability in this semi-controlled environment. They include temperature variations, elevated temperature, humidity variations, and exposure to corrosive gases.
During a proof of concept study performed by Intel, average diurnal temperature variation showed a range from 24F to 31F (13C to 17C)2. Over a seven year period (nominal lifetime for a server product), this would result in an additional 2555 temperature cycles. If these cycles are worst case 20C, this would be equivalent to an additional 100 cycles of 0 to 100C product qualification testing3
In addition to Free Air Cooling, several data centers have become much more aggressive in turning off servers when not in use (often initiated through automated power strips [http://www.cpscom.com/gprod/cps.htm]). One enterprise OEM has recently indicated that they expect the number of power on / power off cycles to increase from 30 to over 1000 over the lifetime of the product.
All these additional thermal cycles could be very detrimental to long-term reliability, especially given the continued trend towards less robust semiconductor packaging (QFN, LGA, stacked die, etc.) and the tendency for some server OEMs to skip temperature cycling during product qualification due to the extended time requirements.
Server manufacturers typically provide some latitude in regards to specifying the maximum ambient temperature and minimum air flow. While this is currently based on operational considerations, this will increasingly have to consider the influence of elevated temperature on long-term reliability. As well detailed in a recent IEEE Spectrum article, the ability of current and future generations of integrated circuits to operate reliably over the desired lifetime is increasingly being called into question.
These concerns are likely to increase with the introduction of more aggressive data center temperatures (interesting side note: several RBOCs have already initiated a similar activity by moving equipment from the central office to remote locations and increasing maximum temperature requirements). In response to these risks, DfR has developed a reliability prediction tool for integrated circuits using known wearout mechanisms. This ‘virtual qualification’ process will be an important part of overall product qualification if Free Air Cooling increases in popularity.
Conditioned air is not only cooler; it is also drier. Typical humidity levels in data centers are often maintained between 40 and 60%RH. This controlled environment provides very effective protection against a number of corrosive failure mechanisms, such as electrochemical migration and conductive anodic filament (CAF) formation. The introduction of free air from the outside could potentially result in significant swings in ambient humidity. As shown in the chart on the right, a relatively dry climate such as western Oregon can still introduce humidity levels as high as 80%RH.
There are several ways to resolve this potential issue. The first is to assess the relative humidity at the board level. With board temperatures, due to power dissipation, up to 20C higher than ambient, the relative humidity can plummet to single digits under when the humidity outside elevated. If there is still some risk, design rules (such as minimum spacings and maximum electric fields) and manufacturing processes (such as cleaning) can provide additional robustness to humidity-driven failure mechanisms. If neither of these mitigations is available, the OEM will need to spec it.
As seen on the map to the right, limiting free air to only those days with humidity levels below 60% greatly constrains the geographic area and the number of hours available for free air cooling.
With the new air from the outside also comes various reactive gases. These gases will attack metals on the circuit boards and components and can cause failures. Humidity in combination with the gases plays an important role. At 60% relative humidity, a layer of moisture 2-4 molecules thick will form on most surfaces. This pool of moisture will absorb these reactive gases from the air and the pH can drop to as low as 2 (highly acidic). The protective oxides on the metals can be compromised and fresh metal is exposed to corrosive attack.
At 80% RH, the moisture layer is 5-20 molecules thick and electrolytes move freely on the surface, further accelerating the corrosion. The table to the right shows typical levels of various pollutants in the outside air in the United States1. The values can be many times higher if the data center is located near an industrial site. Furthermore, other parts of the world such as China, India, and large cities in Latin America can be many times higher.
The recent transition of products to Pb-free has resulted in a significant rise in electronic product failure from creep corrosion2,3. Such failures can occur in as little as 30 days and particularly plague electronics near industries high in sulfur. These would include water treatment plants, petrochemical facilities, fertilizer companies, mining, rubber manufacturing and so forth.
The dominant protective surface finish on printed circuit boards was formerly SnPb HASL (hot air solder level). Most companies replaced this finish with immersion silver or organic solder preservative (OSP). These finishes provide much less protection, and in the case of silver, can actually accelerate creep corrosion. As server and storage products transition to Pb-free (RoHS mandates that all such products must be Pb-free by 2014), these high value critical systems will be made more susceptible to corrosion. In combination with the free air flow trend, one can expect a much higher failure rate from corrosion in data centers in the future.
Even though Intel’s initial study found minimal differences in failure rates between servers exposed to conditioned air and free air, the time period (10 months) was likely to be insufficient for truly understanding the potential risks. The ramifications of insufficient preparation for this change in customer environment are significant. Design flaws, as opposed to quality defects, can have the potential to induce enormous failure rates (>50%) as every server within a particular data center, or even every data center with the same server, would be susceptible to the same wearout mechanism. Given the increasing dependence on data centers for real time communication and data storage (like all of DfR’s sales information), even short-term access issues could be catastrophic for many organizations.
DfR represents that a reasonable effort has been made to ensure the accuracy and reliability of the information within this report. However, DfR Solutions makes no warranty, both express and implied, concerning the content of this report, including, but not limited to the existence of any latent or patent defects, merchantability, and/or fitness for a particular use. DfR will not be liable for loss of use, revenue, profit, or any special, incidental, or consequential damages arising out of, connected with, or resulting from, the information presented within this report.
The information contained in this document is considered to be the property of DfR Solutions. Dissemination of this information, in whole or in part, without the prior written authorization of DfR Solutions, is strictly prohibited.
Systems must perform perfectly to gain the trust of drivers and unknown factors like inclement weather, traffic conditions, other drivers, and unfamiliar terrain need to be considered as engineers develop technology. Physics of Failure techniques play a major role in designing electronics and systems to meet these challenges. In our Reliability Challenges Facing Autonomous Vehicles webinar, Meg Novacek, an independent consultant and reliability expert with 28 years of automotive experience, will speak on the complexities of blending human control with autonomous control.