The heat is on

Editorial Type: Technology Focus Date: 2022-09-29 Views: 291 Tags: Storage, HDD, Data Centre, Hard Drives, Hardware, Management, Toshiba PDF Version:
Rainer W. Kaese of Toshiba Electronics Europe explains the importance of HDD temperature control

The internal temperature of a hard drive is an important factor not only in its correct functioning but also when it comes to the reliability and lifetime of a storage system. The mechanical components of the hard disk drive, especially the fluid-dynamic bearing of the platter-stack spindle, wear out faster at higher temperatures as the fluid oil leaks out at a faster rate.

So, controlling the temperature of hard drives in storage systems is necessary to ensure optimal functionality and high reliability.

HOW TO MEASURE HDD TEMPERATURE
Modern hard drives have a built-in internal temperature sensor, which can be read via the S.M.A.R.T parameter, by tools within the operating system, or by the management tools of the host bus adapter or RAID controller.

Open-E JovianDSS GUI provides access to management tools of Broadcom and Microchip-Adaptec management tools:

  • In the case of Microchip-Adaptec: :8443, login with aac, password raid
  • In the case of Broadcom: :9000, login with root, password admin
By expanding controller-, enclosure-, and port-view, the connected hard disk drives are listed and temperature is displayed.

On the operating system level, smart values can be read out with "smartmontools", a common freeware tool available for Windows and Linux.

HOW HOT IS TOO HOT?
The manufacturer specifies the correct functionality of a hard disk drive within a certain range. When specifically looking at Enterprise HDDs, a controlled data centre-type of cooling is assumed, hence they are usually specified to operate from 5 to 60° C, with an ambient temperature of max. 55° C. NAS drives are specified from 5 to 65° C and surveillance-specific models from 0-70° C (this is because surveillance systems may operate in less stable environments).

TEMPERATURE & RELIABILITY
The average hard drive temperature has a direct impact on its reliability. The reliability of a hard drive, measured in Mean Time To Failure (MTTF), will only be achieved if the average hard drive temperature stays below 40° C. 'Average' here means that periods with more than 40° C will need to be compensated by periods with less than 40° C. In data centre environments with active cooling, the user should ensure that the temperature of 40° C is never exceeded.

A typical enterprise hard drive is rated with an MTTF of 2 Mio. hours. MTTF can be translated into an "annualised failure rate" (AFR) using the following formula:

This exponential formula assumes that the already failed drives have to be considered when calculating the failure rate for the remaining drives. With low failure rates, the formula can be simplified as follows:

An MTTF of 2 Mio hours would result in an AFR of 0.438 %. So, within 1000 drives in operation, 4-5 parts would be expected to fail throughout a year of operation.

But this MTTF/AFR commitment of the HDD manufacturer is restricted to certain conditions, which are:

  • Valid only within the warranty period (typically 5 years),
  • Valid only when not exceeding excessive workload (less than 550TB/year reading and writing),
  • Valid only when not exceeding the maximum power on hours per period (only for non-24/7 models),
  • Valid only when not exceeding the average temperature of 40° C.
If any of these conditions are violated, the hard drive may not fail straight away. Specifically, the workload limitation is not meant in the same way as endurance when referring to SSD write endurance (where a violation would always result in a failure of a component). For a hard drive operating beyond these limits, AFR (annual failure rate) would slowly increase, depending on the level of the violation.

The same can be applied when considering a hard drive usage after 5 years of warranted service lifetime - a massive fail should not necessarily be expected, but over time, higher yearly failure rates exceeding a level of 4-5 parts out of 1000 could be experienced outside of the first 5 years.

After exceeding an average temperature of 40° C, the AFR of 0.438% increases. As a rough indication, each 5° C over 40° C could increase the failure rate by around 30%. At 55° C continual or sustained hard drive temperature, the failure rate is expected to double.

Keeping the average hard drive temperature below 40° C shouldn't be a problem in any well-designed thermal system with appropriate airflow. However, operating without proper airflow and a set of fans will not facilitate the service and support for the thermal requirements for the 24/7 continuous operation of enterprise drives. In case of operation at room temperature, high room temperatures of >30° C may result in hard drive temperatures of over 40° C, but these could be offset with periods of lower temperatures as previously explained.

As for installation in data centres, well-designed thermal servers and JBODs keep the hard drive at a maximum of 10-15° C over air inlet temperature. So, air inlet temperatures lower than 20° C will support proper hard drive operations, even if stacked in a rear position of a large 4U top-loader JBOD or server.

In such cases where there is a sustained higher HDD temperature of 15° C over ambient/air-inlet temperature, something is fundamentally wrong in the thermal design of the system. You should check the airflow/fan operation and analyse the system for potential blockages of the airflow.

OBSERVE AND REACT
For correct functioning and the highest possible reliability, it is important to observe the temperature of the hard disk drive in operation. Maximum temperatures of 60° C and more should be avoided by all means, and the average temperature should ideally not exceed 40° C.

More info: toshiba.semicon-storage.com/eu/storage