Disk Is Dead? Says Who?

INTRODUCTION

Before we begin, I want to state right up front that I am not anti-flash, nor am I anti-hardware. I work for DataCore Software which has mastered the ability to exploit hardware capabilities for nearly two decades for the sole purposes of driving storage I/O. Our software needs hardware and hardware needs software to instruct it to do something useful. However, over the last the last year I have read a lot of commentary about how disk (aka. HDDs or magnetic media) is dead. Colorful metaphors such as “spinning rust” are used to describe the apparent death of the HDD market, but is this really the case?

According to a report from TrendFocus, the number of drives that shipped in 2015 declined by 16.9% (to 469 million units), however the amount of capacity that shipped increased by more than 30% (to 538 exabytes, or 538,000 petabytes, or 538,000,000 terabytes). In other words, a lot of HDD capacity.

Please note however, this is NEW capacity added to the industry on top of the already mind-blowing amount of existing capacity in the field today (estimated at over 10 zettabytes, or 10,000 exabytes, or, well, you get the idea). Eric Brewer, VP of Infrastructure at Google recently said,

“YouTube users are uploading one petabyte every day and at current growth rates that they should be uploading 10 petabytes per day by 2021.”

The capacity trend certainly doesn’t show signs of slowing which is why new and improved ways of increasing HDD density are emerging (such as Helium-filled drives, HAMR, and SMR). With these new manufacturing techniques, HDD capacities are expected to reach 20TB+ by 2020.

So, I wouldn’t exactly say disk (HDD) is dead, at least from a capacity demand perspective, but it does raise some interesting questions about the ecosystem of drive technology. Perhaps the conclusion that disk is dead is based on drive performance. There is no doubt a battle is waging in the industry. On one side we have HDD, on the other SSD (or flash). Both have advantages and disadvantages, but must we choose between one or the other? Is it all or nothing?

MOVE TO FLASH NOW OR THE SKY WILL FALL

In addition to the commentary about disk being dead, I have seen an equal amount of commentary about how the industry needs to adopt all-flash tomorrow or the world will come to an end (slight exaggeration perhaps). This is simply an impossible proposition. According to a past Gartner report,

 “it will be physically impossible to manufacture a sufficient number of SSDs to replace the existing HDD install base and produce enough to cater for the extra storage growth

Even displacing 20% of the forecasted growth is a near impossibility. And I will take this one step further, not only is it impossible, it is completely unnecessary. However, none of this implies HDD and SSD cannot coexist together in peace, they certainly can. What is needed is exactly what Gartner said in the same report,

“ensure that your choice of system and management software will allow for seamless integration and intelligent tiering of data among disparate devices.”

The reason Gartner made this statement is because they know only a small percentage of an organization’s data footprint benefits from residing on high-performance media.

THE SOLUTION TO THE PROBLEM IS SOFTWARE

One of the many things DataCore accomplishes with the hardware it manages is optimizing the placement of data across storage devices with varying performance characteristics. This feature is known as auto-tiering and DataCore does this automatically across any storage vendor or device type whether flash or disk based.

Over the last six years, DataCore has proven with its auto-tiering capability that only 3-5% of the data within most organizations benefit from high-performance disk (the percentage is even less when you understand how DataCore’s Parallel I/O and cache works, but we will touch on this later). Put another way, 95% of an organization’s I/O demand occurs within 3-5% of the data footprint.

While the 3-5% data range doesn’t radically change from day to day, the data contained within that range does. The job of DataCore’s auto-tiering engine is to ensure the right data is on the right disk at the right time in order to deliver the right performance level at the lowest cost. No need to wait, schedule, or perform any manual steps. By the way, the full name of DataCore’s auto-tiering feature is: fully automated, sub-LUN, real-time, read and write-aware, heterogeneous auto-tiering. Not exactly a marketing-friendly name, but there it is.

WAIT A SECOND, I THOUGHT THIS WAS ABOUT DISK, NOT FLASH

While DataCore can use flash technologies like any other disk, it doesn’t require them. To prove the point, I will show you a very simple test I performed to demonstrate the impact just a little bit of software can have on the overall performance of a system. If you need a more comprehensive analysis of DataCore’s performance, please see the Storage Performance Council’s website.

In this test I have a single 2U Dell PowerEdge R730 server. This server has two H730P RAID controllers installed. One RAID controller has five 15k drives attached to it forming a RAID-0 disk group (read and write cache enabled). This RAID-0 volume is presented to Windows and is designated as the R: drive.

The other RAID controller is running in HBA mode (non-RAID mode) with another set of five 15k drives attached to it (no cache enabled). These five drives reside in a DataCore disk pool. A single virtual disk is created from this pool matching the size of the RAID-0 volume coming from the other RAID controller. This virtual disk is presented to Windows and is designated as the S: drive.

DellOMSAThe first set of physical disks forming the RAID-0 volume as seen in the OpenManage Server Administrator interface – larger

 

DC_DiskPool
The second set of physical disks and disk pool as seen from within the DataCore Management Console – larger

 

Win_LogicalDrivesThe logical volumes R: and S: as seen by the Windows operating System – larger

DRIVERS, START YOUR ENGINES

I am going to run an I/O generator tool from Microsoft called DiskSpd (formally known as SQLIO) against these two volumes simultaneously and compare the results using Windows Performance Monitor. The test parameters for each test are identical: 8K block, 100% random, 80% read, 20% write, running 10 concurrent threads, with 8 outstanding I/Os against a 10GB test file.

DiskSpdDiskSpd test parameters for each logical volume – larger

The first command on line 2 is running against the RAID-0 disk (R:) and the second command on line 5 is running against the DataCore virtual disk (S:). In addition to having no cache enabled on the HBA connecting the physical disks presented to DataCore within the pool, the DataCore virtual disk also has its write-cache disabled (or write-through enabled). Only DataCore read cache is enabled here.

SingleVDWrite-cache disabled on the DataCore virtual disk – larger

PerfMon_1Performance view of the RAID-0 disk – larger

 

PerfMon_2Performance view of the DataCore virtual disk – larger

As you can see from the performance monitor view, the disk being presented from DataCore is accepting over 26x more I/O per second on average (@146k IOps) than the disk from the RAID controller (@5.4k IOps) for the exact same test. How is this possible?

This is made possible by DataCore’s read cache and the many I/O optimization techniques DataCore uses to accelerate storage I/O throughout the entire stack. For much more detail on these mechanisms, please see my article on Parallel Storage.

In addition to Parallel I/O processing, I am using another nifty feature called Random Write Accelerator. This feature eliminates the seek time associated with random writes (operations which cause lots of armature action on the HDD). DataCore doesn’t communicate with the underlying disks the same way the application would directly. By the time the I/O reaches the disks in the pool the I/O pattern is much more orderly and therefore more optimally received by the disks.

So now as any good engineer would do, I’m going to turn it up a notch and see what this single set of five physical so-called “dead disks” can do. I will now test using five 50GB virtual disks. Remember, these virtual disks are coming from a DataCore disk pool which contain five 15k non-RAID’d disks. Let’s see what happens.

DiskSpd_2DiskSpd test parameters for five DataCore virtual disks – larger

The commands on lines 8-12 are running against the five DataCore virtual disks. Below are the results of the testing.

PerfMon_3Performance view of the five DataCore virtual disks – larger

Note, nothing has changed at the physical disk layer. The change is simply an increase in the number of virtual disks now reading from and writing to the disk pool which in turn has increased the degree of parallelism in the system. This test shows for the same physical disks we have achieved greater than a 63x performance increase on average (@344k IOps) with bursts well over 400k IOps. This test is throwing 70-80,000 write I/Os per second at physical disks which are only rated to deliver 900 random writes per second combined. This is made possible by sequentializing the random writes before they reach the physical disks and therefore eliminating most of the armature action on the HDDs. Without adding any flash to the system, the software has effectively returned greater than flash-like performance with only five 15k disks in use.

Another important note. This demonstration is certainly not representative of the most you can get out of a DataCore configuration. On the latest SPC-1 run where DataCore set the world-record for all out performance, DataCore reached 5.12 million SPC-1 IO per second with only two engines (and the CPUs on those engines were only 50% utilized).

CONCLUSION

There are two things happening in the storage industry which has caused a lot of confusion. The first is an unawareness of the distinction between I/O parallelization and device parallelization. DataCore has definitively proven its I/O parallelization technique is superior in performance, cost, and efficiency. Flash is a form of device parallelization and can only improve system performance to a point. Device parallelization without I/O parallelization will not take us where the industry is demanding we go (see my article on Parallel Storage).

The second is a narrative being pushed on the industry which says “disk is dead” (likely due to my first concluding point). The demonstration above proves “spinning disk is very much alive”. Someone may argue I’m using a flash-type device in the form of RAM to serve as cache. Yes, RAM is a solid state device (a device electronic in nature), but it is not exotic, has superior performance characteristics, and organizations already have tons of it sitting in very powerful multiprocessor servers within their infrastructures right now. They simply need the right software to unlock its power.

Insert DataCore’s software layer between the disk and the application and immediately unbind the application from traditional storage hardware limitations.

Parallel Application Meets Parallel Storage

INTRODUCTION

A shift in the computer industry has occurred. Did you notice it? It wasn’t a shift that happened yesterday or even the day before, but rather 11 years ago. The year was 2005 and Moore’s Law as we know it took a deviation from the path that it had been traveling on for over 35 years. Up until this point in history, improved processor performance was mainly due to frequency scaling, but when the core speed reached ~3.8GHz, the situation quickly became cost prohibitive due to the physics involved with pushing beyond this barrier (factors such as core current, voltage, heat dissipation, structural integrity of the transistors, etc.). Thus, processor manufacturers (and Moore’s Law) were forced to take a different path. This was the dawning of the massive symmetrical multiprocessing era (or what we refer to today as ‘multicore’).

The shift to superscalar symmetrical multiprocessing (SMP) architectures now required a specialized skill set in parallel programming in order to fully realize the performance increase across the numerous processor resources. It was no longer enough to simply rely on frequency scaling to better application response times and throughput. Interestingly today, more than a decade later, a severe gap persists in our ability to harness the power of the multicore mainly due to either a lack of understanding of parallel programming or the inherent difficulty in porting a well-established application framework to a parallel programming construct. Perhaps virtualization is also responsible for some of the gap since the entire concept of virtualization (specifically compute virtualization) is to create many independent virtual machines whereby each one can run the same application simultaneously and independently. Within this framework, the demand for parallelism at the application level may have diminished since the parallelism is handled by the abstraction layer and scheduler within the compute hypervisor (and no longer as necessary for the application developer, I’m just speculating here). So, while databases and hypervisors are largely rooted in parallelism, there is one massive area that still suffers from a lack of parallelism, and that is storage.

THE PARALLEL STORAGE REVOLUTION BEGINS

In 1998, DataCore Software began work on a framework specifically intended for driving storage I/O. This framework would become known as a storage hypervisor. At the time, the best multiprocessor systems that were commercially available were multi-socket single-core systems (2 or 4 sockets per server). From 1998 to 2005, DataCore perfected the method of harnessing the full potential of common x86 SMP architectures with the sole purpose of driving high-performance storage I/O. For the first time, the storage industry had a portable software-based storage controller technology that was not coupled to a proprietary hardware frame.

In 2005, when multicore processors arrived in the x86 market, an intersection formed between multicore processing and increasingly parallel applications such as VMware’s hypervisor and parallel database engines such as SQL and Oracle. Enterprise applications started to slowly become more and more parallel, while surprisingly, the storage subsystems that supported these applications remained largely serial.

MEANWHILE, IN SERIAL-LAND

The serial nature of storage subsystems did not go unnoticed, at least by storage manufacturers. It was well understood that at the current rate of increase in processor density coupled with wider adoption of virtualization technologies (which drove much higher I/O demand density per system), a change was needed at the storage layer to keep up with increased workloads.

In order to overcome the obvious serial limitation in storage I/O processing, the industry had to make a decision to go parallel. At the time, the path of least resistance was to simply make disks faster, or taken from another perspective, make solid state disks, which by 2005 had been around in some form for over 30 years, more affordable and with higher densities.

As it turns out, the path of least resistance was chosen, either because alternative methods of storage I/O parallelization were unrealized or perhaps there was an unwillingness by the storage industry to completely recode their already highly complex storage subsystem programming. The chosen technique, referred to as [Hardware] Device Parallelization, is now used by every major storage vendor in the industry. The only problem with it is, it doesn’t drastically address the fundamental problem of storage performance which is latency.

Chris Mellor from The Register wrote recently in an article, “The entire recent investment in developing all-flash arrays could have been avoided simply by parallelizing server IO and populating the servers with SSDs.”

TODAY’S STORAGE SYSTEMS HAVE A FATAL FLAW

There is one major fatal flaw in modern storage subsystem design and it is this: today’s architectures are still using the old method of dealing with the problem of I/O by pushing the problem down to the physical disk layer. The issue is that the disk layer is both the furthest point away from the application which is generating the I/O demand and simultaneously the slowest component in the entire storage stack (yes, including flash).

In order to achieve any significant performance improvement from the applications’ perspective, a large amount of physical disks must be introduced into the system, either in the form of HDDs or SSDs (An SSD is a good example of singular device parallelization because it represents a multiple of HDDs in a single package. SSDs are not without their own limitations however. While SSDs do not suffer from mechanical latencies like HDDs do, they do suffer from a phenomenon known as write-amplification).

A NOT-SO-NEW APPROACH TO PARALLELIZATION

Another approach to dealing with the problem of I/O is to flip the problem on its head, in a manner of speaking. Conversely, rather than dealing with the I/O at the furthest point from the application and with the slowest components, like device parallelization attempts to do, let’s entertain the possibility of addressing the I/O as soon as it is encountered and with the fastest components in the stack. Specifically, let’s use the abundance of processors and RAM that now exist in today’s modern server architectures to get the storage subsystem out of the way of the application. This is precisely what DataCore‘s intentions were in 1998 and with the emergence of multicore processors in 2005, the timing could not have been better.

Let’s take a look at a depiction of what this looks like in theory:

parallelIO

The contributory improvement of storage performance per device using the device parallelization technique simply cannot compare to that of the I/O parallelization technique. Simply put, the parallelization that the industry is attempting to use to solve the storage I/O bottleneck is being applied at the wrong layer. I will prove this fact with a real world comparison.

perftable

On the latest showing of storage performance superiority, DataCore posted a world-record obliterating 5.12 Million SPC-1 IOps while simultaneously achieving one of the lowest $/IO ever seen ($0.10 per IO), only being beat out on the $/IO measurement by another DataCore configuration. Comparatively, the DataCore IOps result was faster than the previous #1 and #2 test runs from Huawei and Hitachi, COMBINED! For a combined price of $4.37 million dollars (cost of Huawei and Hitachi) and four racks of hardware (size of both Huawei and Hitachi test configurations) you still can’t get the performance that DataCore achieved with only 14U of hardware (1/3rd of one rack) and a cost of $506,525.24.

latency2

Put another way, DataCore is nearly 1/9th the cost at 1/12th the size and delivered better than 1/3rd the response time than Huawei and Hitachi combined. If you try to explain this in terms of traditional storage or device parallelization techniques, you cannot get there. In fact the only conclusion you can reach using that technique is that it is impossible, and you would be correct. But it is not impossible when you understand the technique DataCore uses. This technique is referred to as I/O Parallelization.

MORE THAN SIMPLY CACHE

Some have argued recently that it is simply the use of RAM as cache that allowed DataCore to achieve such massive performance numbers. Well, if that was true, then anyone should be able to reproduce DataCore’s numbers tomorrow because it is not as if we have a RAM shortage in the industry. By the way, the amount of RAM cache in the Hitachi and Huawei systems combined was twice the amount DataCore used in its test run.

What allowed DataCore to achieve such impressive numbers is a convergence of several factors:

  • CPU power is abundant and continues to increase 20% annually
  • RAM is abundant, cheap, and doesn’t suffer from performance degradation like flash does
  • Inter-NUMA performance within SMP architectures have approached near-uniform shared memory access speeds
  • DataCore exploits the capabilities of modern CPU and RAM architectures to dramatically improve storage performance
  • DataCore runs in a non-interrupt non-blocking state which is optimal for storage I/O processing
  • DataCore runs in a real-time micro-kernel providing the determinism necessary to match the urgent demands of processing storage I/O
  • DataCore deploys anti-queuing techniques in order to avoid queuing delay when processing storage I/O
  • DataCore combines all these factors across the multitude of processors in parallel

CONCLUSION

So what does this mean? What does this mean for me and my applications?

First, it means that we now live in an era that has parallel processing occurring at both the application layer and the storage layer. Second, it means that applications are now free to process at top performance because the storage system is now out of the way. And finally, it means that the act of having to spend more money creating larger and larger environments in order to achieve high performance is abolished.

Applications are now unlocked and the technology is now within reach of everyone, let’s go do something amazing with it!

A Match Made in Silicon

I was reminiscing the other day about the old MS-DOS days. I remember being fascinated by the concept of using a RAM disk to make “stuff” run faster. Granted, I was only 10 years old, and while I didn’t understand the intricacies of how this was being accomplished at the time, I understood enough to know that when I put “stuff” into the RAM disk, it ran much faster than my 80MB Connor hard drive. If the RAM disk was only slightly faster it wouldn’t have been that interesting, but it was amazingly faster.

By the mid-90’s, many commercial applications, specifically databases, began treating RAM more and more like a disk rather than just simply a high-speed working space for the application. Today there are many very well known in-memory database (IMDB) systems, most notably from Microsoft (SQL Server/Hekaton), Oracle (RDBMS), and SAP (HANA), to name a few.

In 1998, DataCore Software set out, among many other things, to use RAM as a general-purpose caching layer made accessible via software that could be installed on any x86 based system for any application. With the introduction of Intel multi-core processors in 2005, the software evolved even more to include exploitation of the additional processors, in parallel. Processors and RAM were getting faster and more abundant, which meant a much higher potential for tapping into the power of parallelism.

Now let’s fast forward to more recent times…

FORT LAUDERDALE, Fla., June 15, 2016 – Following a scorching run of world records, DataCore Software today rocketed past the old guard of high-performance storage systems to achieve a remarkable 5.1 million (5,120,098.98) SPC-1 IOPS™ on the industry’s most respected head-to-head comparison — the Storage Performance Council’s SPC-1™ benchmark. This new result places DataCore number one on the SPC-1 list of Top Ten by Performance. To put the accomplishment into perspective, the independently-audited SPC-1 Result for the DataCore™ Parallel Server software confirms the product as faster than the previous top two leaders combined.

The benefits of using RAM as cache cannot be denied. It worked very well in the beginning as RAM disks. It worked extremely well for IMDBs. Today, DataCore Software is the world-record holder for the fastest block storage system ever tested by the Storage Performance Council, not simply because of the use of RAM as cache, but more specifically because of the software mechanism used to turn the RAM into cache. If it was simply a matter of using RAM as cache, then any storage vendor should be able to reproduce what DataCore produced at the same or better price point on the SPC-1, tomorrow. I wouldn’t recommend holding your breath on that one.

In essence, what DataCore has done is create the world’s fastest in-memory “everything” storage engine (i.e. file data, object data, virtual machines, AND databases). Modern Intel x86-64 based architectures combined with the fastest RAM is truly a match made in silicon… a match only made possible and held together by the most efficient and most powerful storage software ever developed.

DataCore Breaks With Traditional Storage Thinking… 18 years ago.

INTRODUCTION

When you look across history, there are plenty of examples of ideas which were originally thought as crazy that turned out later to be breakthroughs. Tesla’s wireless energy transfer system (mainstream adoption: 120 years later), Faraday’s electric generator (mainstream adoption: 50 years later), and Bell’s telephone (mainstream adoption: 26 years later).

Eighteen years ago, DataCore Software broke with the traditional ways of approaching storage, specifically the handling of I/O. It took nearly 14 years after DataCore’s founding for the industry to adopt what is being called today, Software-defined Storage (aka. SDS). Although today, 18 years later, there are many competitors in the field, there are still none that have disrupted the traditional way of thinking like DataCore has. DataCore attacks the problem of I/O from the complete opposite direction and with staggering results.

[BEGIN SOAPBOX #1]
I don’t particularly like the term ‘Software-defined Storage’ mainly because it implies that previous to SDS, hardware alone was used to drive storage. In reality, simply having hardware-only void of some sort of instruction set, whether it be firmware or software, is useless. Likewise, software without hardware to run on is just as useless. You need both hardware and software to do something useful. I like our CEO’s founding statement about DataCore’s vision better:

…Creating an enduring and dynamic ‘Software-driven Storage’ architecture liberating storage from static hardware-based limitations
– George Teixeira, CEO, 1998

Substituting Software-driven for Software-defined implies that both storage hardware and software work together yet in their own ways to achieve the end goal. As we will see next, the industry calls it “Software-defined Storage”, but it is still very much hardware-driven.
[END SOAPBOX #1]

FIRST: WHAT IS THE PROBLEM?

The problem is I/O. What is I/O? I/O is input and output or the movement of data through a system. The I/O takes the form of either a read or write operation. In other words, the system is either retrieving information (reads) or committing changes to information (writes). The rate at which reads and writes are occurring is referred to as IOps (I/O’s per second).

[BEGIN SOAPBOX #2]
I could spend the rest of the day writing about what IOps are and what they are not, but I will keep it short. The IOps value by itself is completely meaningless. Without understanding whether the I/O is a read or a write, performed in a sequential or random pattern, what size the I/O is (referred to as block size), and the latency of the I/O, the value of IOps means literally nothing. At the very least, assuming the exact same test conditions, you could get a relative performance comparison between two systems using the IOps value. But this is almost never the case. The only exception is an industry standardized and audited benchmark called the SPC-1 from the Storage Performance Council. This is truly the only meaningful and consistent comparison of IOps you are likely to find. NOTE: The SPC-1 represents an intensive OLTP workload, similar to that of a database.
[END SOAPBOX #2]

SECOND: WHAT IS THE SOLUTION?

When you boil this whole problem down to the ground floor, when you get at the core of the issue, when you finally peel back all the layers of the onion, you are ultimately left with two fundamental solutions to the problem of I/O. You can either,

#1: Throw hardware at the problem. This includes the amount of hardware as well as the type of hardware (i.e. faster and more expensive disks such as SSD and Flash). This approach is called Hardware Parallelization.

#2: Throw software at the problem. Not just any software mind you, software that is super intelligent and is able to completely harness and fully exploit the power of the underlying hardware. And not just the disk hardware, but the CPU and memory resources. This approach is called I/O Parallelization.

Approach #1 attacks the I/O problem by pushing the problem down to the disk, to the slowest components in the entire system stack, furthest from the application. As you will see in a moment, this is not only ‘not efficient’, it is a tremendous waste of expensive resources.

Approach #2 attacks the I/O problem as soon as the I/O is encountered, at the fastest layer in the entire system stack (CPU and memory), closest to the application. As you will see in a moment, this is not only ‘extremely efficient’, but the end result is nothing short of an enigma, achieving something that in traditional terms is impossible (but remember, DataCore broke with tradition 18 years ago).

WHAT DOES APPROACH #1 LOOK LIKE?

From a physical perspective, hardware parallelization tends to look like this:

hitachi

First off, I want to make very clear, that neither myself, nor DataCore is at war with our friends at Hitachi or any other hardware vendor. DataCore is software, we need hardware. This system (Hitachi VSP G1000) was chosen simply because it achieved similar performance (IOps and latency) and price/performance levels to what DataCore achieved on the SPC-1 benchmark. This system arrived at $2,003,803 achieving $0.96 per SPC-1 I/O, or 2,004,941 IOps.

WHAT DOES APPROACH #2 LOOK LIKE?

From a physical perspective, I/O parallelization looks like this:

dc1

This system (DataCore Parallel Server running on a single Lenovo x3650 2U server with a 2U 24-bay storage array attached) arrived at $136,759 achieving $0.09 per SPC-1 I/O, or 1,510,090 IOps.

HOW DO THE LATENCIES COMPARE?

The first obvious set of distinctions between the two are costs and physical size. Let’s take a look at a third, less obvious distinction. One that is critical to your applications and users, more so than even IOps… Latency.

latency

As you can see from the latency graph taken over each ramp phase of the benchmark (10% thru 100%), Hitachi (as well as every other storage system tested by the SPC-1), falls over what I refer to as the interrupt cliff. What should really stand out to you however, is the flat line which represents DataCore’s latency curve. Since DataCore is a real-time, non-interrupt based, parallel I/O engine, you will not see the typical latency curves you see with other storage systems. Interestingly, this marks for the first time ever in history a storage system achieving sub-100 microseconds of response time at 100% load on the SPC-1.

THE BOTTOM LINE

DataCore, like Tesla, Faraday, and Bell, landed way ahead of their time in the industry. And DataCore, just like those early pioneers who faced the naysayers and received the scrutiny from those around them, in the end, prevailed and proved to the world their way was the best way. The results don’t lie: Fastest storage platform with the lowest latency at the lowest cost in existence today, 10,000+ worldwide customers, 30,000+ worldwide deployments, 10th generation proven technology, and over 18 years of development.

The bottom line is this: If you want a platform with rich enterprise features that delivers outstanding performance while saving lots of space and money, then DataCore is the answer. Otherwise, choose a traditional storage platform.

For more information about DataCore and it’s 10th generation award winning storage software, check out this three minute video:

DataCore SANsymphony

or the DataCore website.

DataCore drops SPC-1 bombshell delivering 5.1 Million IOps

The Fort Lauderdale boys have struck again, with a record-breaking run of 5 million IOPS, and maybe killed off every other SPC-1 benchmark contender’s hopes for a year or more.

DataCore Software, head-quartered in Fort Lauderdale, Florida, has scored 5,120,098.98 SPC-1 IOPS with a simple 2-node Lenovo server set-up, taking the record from Huawei’s 3 million-plus scoring OceanStore 18800 v3 array, which needs multiple racks and costs $2.37m. The DataCore Parallel Server configuration costs $505,525.24, almost a fifth of Huawei’s cost.

It is astonishing how SPC-1 results have rocketed in the past few years, as Huawei and Hitachi/HPE and Kaminario have sent results above the 1 million IOPS mark.

What seemed ground-breaking at first is now viewed as ordinary; a million SPC-1 IOPS? Okay, move on. Five million, though, is more than the previous top two results combined and comes from just a pair of hybrid flash/disk servers, not a super-charged all-flash array.

Find full article here: http://www.theregister.co.uk/2016/06/15/datacore_drops_spc1_bombshell/

Isaac Newton, SPC-1, and The Real World

Definitions

The SPC-1 is an industry recognized storage performance benchmark developed by the Storage Performance Council to objectively and consistently measure storage system performance under real-world high-intensity workloads (principally, OLTP database workloads).

Introduction

Over the last four years, the storage industry has transformed at an amazing rate. It seemed almost every other week, another software-defined storage startup emerged. On the surface this appears great, right? Lots of competition, lots of choice, etc. However, with all of this also comes lots of confusion and disappointment. What is actually new with all of these developments? Are there truly pioneers out there taking us into new and uncharted territory? Let’s go exploring!

Isaac Newton and the SPC-1

Wait, What? How in the world does Isaac Newton relate to the SPC-1? As you may know, Newton was the founder of calculus, discovered the laws of motion and gravity. He is unquestionably one of the most notable scientists in history. Without him, we don’t have much of the modern world that we enjoy today. While some appreciate what he accomplished technically speaking, most people do not go around citing the intricacies of Newtonian Mechanics. However, we do appreciate the result of his discoveries: cars, planes, space shuttles, sending satellites to other planets, and many other amazing things. So, while these things are necessary to operate in the modern world, they are generally reserved for the areas of academia. Such is the case with the SPC-1.

This article has one simple objective. It is focused on drawing the parallel of what the SPC-1 demonstrates to implications in the real world. Similar to Newtonian Mechanics, most do not walk around citing SPC-1 results. However, just like with Newton, the results have real world implications specifically to the information technology world; a world which we are all deeply connected to in one way or another.

What Does The SPC-1 Show Us and… “So What?”

The SPC-1 analyzes all out performance and price/performance for a given storage configuration. While not showcased, latency analysis is also included within the full disclosure report for each benchmark run. The importance of latency will become apparent later in this article. But in the end, who doesn’t want performance, right?

One question that usually jumps out after referring to the SPC-1 results is, “so what?”. Well as it turns that is precisely what I am trying to answer here. On the surface there is basic vendor performance comparison. The higher the IO per second, the better the all-out performance is. The lower the $/IO the more cost efficient a system is. What happens when a vendor is able to achieve top performance numbers and price/performance numbers on the same benchmark run? Now that would be interesting.

Generally speaking, you will not find the same vendor system in the top 10 for both categories simultaneously mainly because the two categories fall at opposite ends of the spectrum. Typically, the higher the IOps produced, the more expensive the system and conversely, the lower the $/IO, the lower the total overall performance.

So hypothetically speaking, what would it mean if a vendor was to construct an individual storage system that landed in both categories? First off, it would mean that the system is both really fast and really efficient (one could argue that it is really fast because it is really efficient). Second, it would raise certain questions about how storage systems are constructed. In other words, it would be like having a Bugatti Veyron with a top speed of 268 mph for the price of a Toyota Camry. It wouldn’t just be interesting; it would change the entire industry.

If your next response is, “But I don’t need millions of IOps”, you would be missing the point completely. Ok, so you don’t need millions of IOps, but you get them anyways. What you need to realize is that you don’t need as many systems to achieve your goal in the infrastructure. In other words, why buy 10 of something when 2 will do the job?

What I am driving toward here is this: imagine how much more performance you could get for every dollar spent, imagine how much more application and storage consolidation you could get while simultaneously reducing the number of systems, imagine how much more you could save on operational expenses with less hardware, imagine running hundreds of enterprise virtual machines with true data and service high-availability in an N+1 configuration while simultaneously serving enterprise storage services to the rest of the network. Oh, the possibilities.

Below are examples of one type of convergence you can achieve with a system such as this. The server models shown below are used for illustration purposes, but it could be Lenovo, Dell, Cisco, or any multi-core x86-based system available in the market today. While traditional SAN, converged, and hyper-converged models are also easily achievable and have been available for many years, the model shown below represents a hybrid-converged model. It provides the highest level of internal application consolidation while simultaneously presenting enterprise storage services externally to the rest of the infrastructure. Without DataCore SANsymphony-V, this level of workload consolidation wouldn’t be possible.

Hybrid_Converged_HyperV

Hybrid_Converged_VMware

So, Does This System Actually Exist?

As it turns out, this isn’t theoretical, it is very real. It has been very real for many years now. DataCore’s SANsymphony-V software is what makes this possible. DataCore’s approach to performance begins and ends with software. This is completely opposite of all other vendors who try to solve the performance problem by throwing more expensive hardware at it. And this is precisely why for the first time (from what I can tell), a vendor (specifically DataCore) landed on both top 10 categories (price and price/performance) simultaneously with the same test system.

And What About This Matter of Latency?

There still tends to be a lot of talk about IOps. As I have been saying for years now, IOps is a meaningless number unless you have other pieces of information regarding the test conditions such as % read, % write, % random, % sequential, and block size. Then, even with this information, it only becomes useful when comparing systems that have been tested with the same set of conditions. In the marketing world, this is never the case. Every storage vendor touts some sort of performance achievement, but the numbers are incomparable to other systems because the test conditions are different. This is why the SPC-1 is so significant. It is a consistent application of test conditions for all systems making objective comparison possible.

One thing that is not talked about enough, however, is latency, and specifically the latency across the entire workload range. Latency is what will define the application performance and user experience in the end.

In general, when comparing systems, IOps are inversely proportional to latency (response time). In other words, the higher the IOps the lower the latency tends to be and vice versa. Note, this is not always the case because there are some systems the deliver decent IOps, but terrible latency (primarily due to large queue depths and/or queuing issues).

DataCore SANsymphony-V not only set the world record lowest price/performance number, not only landed in both top-10 categories for the same test system, but also set a new world record in terms of the lowest latency ever recorded by the SPC-1… sub-100 microseconds! Interestingly, the most impressive part, which you could miss if you are not paying attention, is that they achieved this world record latency at 100% workload. This is simply staggering! Granted, you may not run at an all-out 100% workload intensity, but that just means your latency will be that much lower under normal conditions. The analogy here is the same Bugatti Veyron mentioned earlier running at top speed while towing 10 tractor-trailers behind it.

Below shows a throughput latency curve comparing DataCore SANsymphony-V to the previous fastest response time on the SPC-1 benchmark (the fastest one I could find in the top-10 at least). Notice how flat the latency curve is for DataCore. This is indicative of how efficient DataCore’s engine is. Not only did DataCore SANsymphony-V post better than 7x latency numbers (at 100% workload) against Hitachi, it also drove an additional 900,000 SPC-1 IOs per second. And finally, it achieved this result at a cost of 1/13th the previous record holder!

LatencyCurve

How was this accomplished? Simply put, it is baked into the foundation of how DataCore moves IO through the system, in a non-interrupt-real-time-parallel fashion. In other words, DataCore doesn’t just “not get in the way”, it actually removes the barriers that normally exist.

Conclusion

Hopefully by now you can see the answer to the “so what” question. These SPC-1 results go well beyond just a storage discussion. This directly impacts the way applications are delivered. You can now achieve what was once impossible. Is it virtual desktops you are after? Imagine running 10x more with less hardware without sacrificing performance. Is it mailboxes you are after? Imagine running 20x more with less hardware without sacrificing performance. Is it database performance you are after? Imagine running on the fastest storage system on the planet (not my words, the SPC-1’s findings) with the lowest latency and doing it at a cost that is untouchable by other solutions (hardware and software-defined alike). So while the SPC-1 is rooted in storage performance, the effect this has on the rest of the ecosystem is beyond just interesting… it is revolutionary!

References

Storage Performance Council Website
SPC-1 Top Ten List
DataCore Parallel IO Website

Inline vs. Post-Process Deduplication: Good vs. Bad?

INTRODUCTION

Over the last several months I have spoken with many clients interested in deduplication. There is good reason for this interest, but one aspect of deduplication always gets more attention. The question of whether a solution performs deduplication via “inline” or “post-process” is always of significant interest. The prevailing mindset in the industry, it would seem, is that inline is superior to post-process. Let’s pull back the covers to see if there is any real truth to it. To ensure we are on the same page, let’s define these terms before proceeding.

CONCEPT REVIEW

Deduplication effectively gives you some percentage more usable storage capacity above the native capacity (albeit, this gain is highly variable based on the data types involved). You can either look at it as a given amount of data consuming less space, or the normalized approach of increasing the total effective usable storage space. In other words, if you reduce your data to 1/2 of the original size then you have effectively doubled (or 2x) your usable storage capacity.

Inline refers to deduplicating ingress data before it is written to the destination device. Post-process refers to deduplicating ingress data after it has been written to the destination device.

INLINE ANALYSIS

First let’s look at why a vendor would choose one method over the other. Take all-flash vendors for example which always use inline duplication.  Without some sort of data reduction, the economics of all-flash systems are not nearly as attractive. Besides the need to reduce the $/GB of all-flash (which makes a lot of sense in this case), there is another issue that deduplication must address. This issue is related to the inherent disadvantage that all flash solutions suffer from: write amplification.

Flash blocks that already contain data must be erased before they can be rewritten. Before the flash can be erased, existing valid data has to be moved prior to the erase. This ultimately causes many more reads and writes to occur for a single ingress write operation increasing response time and wearing. This is where inline deduplication comes in. The best way to reduce write amplification (which cannot be totally eliminated) is to reduce the amount of ingress data to be written. For all-flash systems there simply is no other choice but to use inline.

Not surprisingly however, there are costs involved. Placing another intensive operation in the I/O path before committing the data to the disk slows overall performance. This processing overhead coupled with the reality that write amplification cannot be completely eliminated leads to some unpredictable performance characteristics especially as the total amount of valid data increases on the system (which increases the metadata that needs to be tracked).

POST-PROCESS ANALYSIS

With systems that utilize post-process (mainly non-all-flash array systems) the performance impact is almost entirely eliminated. I say “almost” because the deduplication process needs to happen at some point and it does generate some amount of additional load (albeit, small). I say “small” because the impact of the eventual deduplication is mitigated by monitoring overall system activity to determine when the best time is to perform the operation thus minimizing contention. Interestingly, the net resultant data reduction is at least as good if not better than inline deduplication. Most importantly, the write commit response time as seen by the application is not impacted since the data is committed immediately with no intermediate operation standing in the way. This ensures the user and application experience is not negatively impacted when the write is initially generated.

The tradeoff here is that the capacity consumption is slightly higher for a period of time until the deduplication process kicks in. In today’s world where most shops have 10 or 100’s of unused TB’s, this is seemingly and increasingly a non-issue.

CONCLUSION AND RECOMMENDATIONS

It should be apparent by now that it is not really an issue of “Good vs. Bad”. It is more a matter of necessity on the part of the vendor. But, if we were to consider which method has the least amount of negative impact on overall system operation, post-process would seem to have the upper hand.

On a related note, one thing I would highly recommend being careful of is promises related to the actual data reduction ratio. Anyone saying they are going to reduce your data footprint without first knowing what the data consists of, is lying to you. The only guaranteed data reduction method I know of is a method that gives you 100% data reduction and it’s called FORMAT! (Kidding of course, please do not attempt this at home).

Below is an example of Microsoft’s deduplication and compression ratios based on common file types:

Scenario Content Typical Space Savings
User documents Documents, photos, music, videos

30-50%

Deployment shares Software binaries, cab files, symbols files

70-80%

Virtualization libraries Virtual hard disk files

80-95%

General file share All of the above

50-60%

Great candidates for deduplication:
Folder redirection servers, virtualization depot or provisioning library, software deployment shares, SQL Server and Exchange Server backup volumes, VDI VHDs, and virtualized backup VHDs

Should be evaluated based on content:
Line-of-business servers, static content providers, web servers, high-performance computing (HPC)

Not good candidates for deduplication:
Virtualization hosts (running workloads other than VDI or virtualized backup), WSUS, servers running SQL Server or Exchange Server, files approaching or larger than 1 TB in size

** Random writes are detrimental to the performance and the lifespan of flash devices. Look for systems that are able to sequentialize I/O which will help to reduce the write-amplification effect.

** There are tools available that will estimate your data reduction savings prior to implementation. Microsoft includes one with Windows Server 2012 if the deduplication services are installed (DDPEval.exe).