Revealing the hidden costs of HPC

Editorial Type: Feature Date: 2020-06-01 Views: 943 Tags: Storage, Research, HPC, Strategy, Management, Data Centre, Panasas PDF Version:
New research into buyers of storage in HPC environments reveals that TCO is now considered almost as crucial as performance itself

Total cost of ownership (TCO) now rivals performance as a top criterion for purchasing high-performance computing (HPC) storage systems, according to an independent study published by Hyperion Research.

While performance still ranked first (57%), TCO tied with purchase price at 37% as the second most important consideration cited by users. This points to an important shift, as HPC storage buyers have historically given less credence to ongoing operating costs, particularly millions of dollars lost due to downtime. Almost half of the surveyed respondents experience storage system failures once a month or more, with some outages leading to downtimes that can last as long as a week. A single day of downtime costs can range from US$100,000 to more than US$1 million.

The report, commissioned by Panasas, surveyed data centre planners and managers, storage system managers, purchasing decision-makers and key influencers, as well as users of HPC storage systems. Hyperion surveyed organisations with annual revenues from less than US$5 million to more than US$10 billion.

“A clear implication of this study is that to compete effectively, storage vendors need to deliver value far beyond the initial purchase price,” said Steve Conway, senior advisor, HPC market dynamics at Hyperion Research. “They must pay attention to the full range of buyer considerations, including reliability, cost of management, responsive support and uninterrupted application user productivity.”

Among the salient points covered by the research findings were the following:

Growth Drivers. The largest factors driving the growth of HPC storage capacity were iterative simulation workloads and new workloads such as AI and other Big Data jobs (see figure 1).

Growth Inhibitors. The most often-named challenge for HPC storage operations was recruiting and hiring qualified staff, followed by the time and cost needed to tune and optimise the storage systems.

Total Cost of Ownership. Deliberately presented without a definition, TCO emerged as the second-most-important of all purchasing criteria for the surveyed group of HPC storage buyers, tied with "price" and trailing only "performance." Though the definitions respondents had in mind presumably differed in some particulars, as a group these buyers endorsed the importance of TCO.

Downtime. Almost half of the surveyed sites experience storage system failures once a month or more frequently (see figure 2). Downtimes range from less than one day to more than a week. Supporting high productivity for users of HPC servers (scientists, researchers, analysts and engineering staff) is of paramount importance to data centre managers and other senior officials at HPC sites.

In some industries, a day of downtime can cost the organisation more than $1 million in lost revenue. Lack of storage system resiliency in the face of failures and changing requirements has been an ongoing issue for some file systems. Optimal time to customer problem resolution is particularly challenging when there are multiple layers in the customer support chain.

Satisfaction and Loyalty. Although a large majority (82%) of respondents were relatively satisfied with their current HPC storage vendors, a substantial minority said they are likely to switch storage vendors the next time they upgrade their primary HPC system. The implication here is that a fair number of HPC storage buyers are scrutinising vendors for competencies as well as price.

Storage Source. HPC buyers as a group have grown sophisticated enough about storage to pay more attention to the product than to who sells it to them. The study showed that most buyers sometimes purchase storage at the same time as the HPC system it will support, other times separately. Many buyers don't care whether the storage system is sold by a dedicated storage vendor or a system vendor intermediary. It's the product and the support staff that count most.

A clear implication of this study is that to compete effectively, storage vendors need to deliver value far beyond the initial purchase price. They must pay attention to the full range of buyer considerations, including reliability, cost of management, responsive support and uninterrupted application user productivity.

“HPC storage buyers have come to expect downtime as the norm in HPC storage, trading off the lowest cost of acquisition for the inevitable headaches and lost productivity caused by system downtime,” commented Faye Pairman, president and CEO at Panasas. “As a result, HPC storage vendors skimp on the development expenses associated with reliability, manageability and support; something we don’t do at Panasas. With the release of PanFS 8, we go beyond delivering the lowest cost of ownership that we are known for by offering our high-performance file system on commodity hardware to provide the lowest cost of acquisition as well – making the buying decision easy.”

TCO defined

TCO is a term variously used in the HPC community and therefore deliberately presented to respondents of this study without a definition. This had the advantage of enabling the respondents to apply their own definitions. When they did, TCO emerged as one of the top purchasing criteria of the surveyed sites — tied in importance with "price" and second only to the "performance" of HPC storage systems under consideration.

HPC storage systems have become significantly more important in the current era of digital transformation and high-performance data analysis, including AI methods such as machine and deep learning. To meet emerging requirements for what the U.S. Department of Energy calls "extreme heterogeneity" - the convergence of simulation and analytics, traditional and enterprise environments, and interoperation with cloud infrastructures — HPC storage systems, like other parts of the HPC ecosystem, have become more complex and more challenging to manage in many cases.

As the study shows, HPC storage systems are subject to downtimes that can increase costs while lowering productivity, and finding qualified job candidates to help manage HPC storage systems can be a major challenge. These trends are likely to continue.

In addition to the rising importance of TCO, the survey findings also challenge the accepted HPC storage narrative that cost-effective performance necessitates complexity and unreliability. Consider the following key findings:

Recruiting and hiring qualified staff, followed by the time and cost needed to tune and optimize the storage systems, were the two most often-named challenges for HPC storage operations - findings that go hand in hand with high levels of downtime.

More than three-quarters of respondents experienced reduced productivity in the past year due to storage issues. One in eight sites experienced this more than 10 times in the past 12 months.

When asked how long it took to recover from a storage system failure, 40% of HPC sites typically require more than two days to restore their storage system to full functionality.

With all of these factors in mind, the conclusion of the research advises HPC sites to evaluate a wide range of HPC storage vendors before making a purchase decision. There are important differences in the vendors' products, strategies and support - a wider search could pay large TCO dividends.

The full report “The Importance of TCO for HPC Storage Buyers” is available for download here:

More info: