Is Your Storage Highly Available, Or Simply Fault-Tolerant? – Part 1

The storage landscape has seen some significant advancements in recent years. At the time of this writing we have witnessed the announcement of 6TB drives, PCIe SSD’s that exceed several hundred thousand IOPS for a single drive, storage devices that plug into the DIMM slot of a server, and the term ‘petabyte’ fast becoming a word used at the evening dinner table. But there remains one very important aspect of storage that has to be addressed regardless of the type of storage media deployed, regardless of how fast the storage is, regardless of how it is connected, and it is data availability. Ask yourself, “how much of my infrastructure would be necessary if all the data disappeared?”. I think most would say, “none of it”. In Part 1 of this series we will explore what high-availability is and how it differs from fault-tolerance. In Part 2 we will explore how true high-availability is achieved and what additional benefits are realized through a highly-available infrastructure.

DATA: The Most Important Asset
The bottom line with any data processing system is the data. Without the data what is the point of having the infrastructure in the first place? In a typical enterprise environment, a lot of effort, time and money is spent ensuring that applications and users always have access to the data. If this is true, then why do most still rely on a single frame of storage hardware to house the most important asset in the data center? Is it possible that many have mistaken fault-tolerance for high-availability?

The concept of fault-tolerance involves a system’s ability to survive expected failures and to ensure that data accessibility is maintained. The most common storage hardware failures are controller failures (for hardware and software reasons), hard drive failures (for hardware and software reasons), and power supply failures (for hardware or facility related reasons). Most hardware vendors offer fault-tolerance in the form of component-level redundancy. This redundancy mitigates against failures within the frame thereby reducing the odds of data loss and/or loss of data accessibility when a simple failure occurs. But, what happens when it is something more severe such as multiple drive failures in a RAID set, multiple controller failures due to a bug in the firmware/software, or power loss affecting the entire storage frame (data center wide power outage)? Not only could the data become inaccessible, it could be damaged as a result of the failure.

Avoiding The ‘That-Shouldn’t-Have-Happened’ Scenario
High-availability improves upon the principles of fault-tolerance by extending component-level redundancy with data-level redundancy. By achieving data-level redundancy, you are protecting against a much larger potential disaster and mitigating the risk to data integrity from external or unforeseen factors.

High-availability can be viewed as the big-brother to fault-tolerance and can be summarized in these six storage principles:

End-to-End Redundancy – The storage system in its entirety has no single point of failure up and down the entire storage stack. This is inclusive of all component-level and data-level redundancies.

Subsystem Autonomy – The underlying systems which have the responsibility of storing the data have no dependency on one another. These systems are completely unaware of each other and have no direct connection whatsoever.

Subsystem Separation – The underlying systems which have the responsibility of storing the data are physically separated from one another in different rooms, buildings, or campuses to mitigate against a localized environmental failure or disaster.

Subsystem Asymmetry – The underlying systems which have the responsibility of storing the data are of different makes and models. This ensures that any inherent design flaws that may exist in one subsystem is unlikely to show up in the other and eliminates the risk of concurrent failures due to common design flaws.

Subsystem Diversity – The underlying systems which have the responsibility of storing the data are powered by different power systems, cooled by different air handlers, and connected by different switches and routers.

Polylithic Design – The underlying systems which have the responsibility of storing the data are comprised of multiple smaller storage frames instead of a single large storage frame. This reduces the risk of the entire storage infrastructure being affected by a failure as it could be if all the data was contained within a single frame.

There are very clear differences between fault-tolerance and high-availability. Fault-tolerance should be the absolute minimum deployment model in any infrastructure, but for business and mission critical applications, true high-availability is a must. Check out Part 2 where we will explore how to architect and deploy highly available storage solutions… It is easier than you may think.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s