A shift in the computer industry has occurred. Did you notice it? It wasn’t a shift that happened yesterday or even the day before, but rather 11 years ago. The year was 2005 and Moore’s Law as we know it took a deviation from the path that it had been traveling on for over 35 years. Up until this point in history, improved processor performance was mainly due to frequency scaling, but when the core speed reached ~3.8GHz, the situation quickly became cost prohibitive due to the physics involved with pushing beyond this barrier (factors such as core current, voltage, heat dissipation, structural integrity of the transistors, etc.). Thus, processor manufacturers (and Moore’s Law) were forced to take a different path. This was the dawning of the massive symmetrical multiprocessing era (or what we refer to today as ‘multicore’).
The shift to superscalar symmetrical multiprocessing (SMP) architectures now required a specialized skill set in parallel programming in order to fully realize the performance increase across the numerous processor resources. It was no longer enough to simply rely on frequency scaling to better application response times and throughput. Interestingly today, more than a decade later, a severe gap persists in our ability to harness the power of the multicore mainly due to either a lack of understanding of parallel programming or the inherent difficulty in porting a well-established application framework to a parallel programming construct. Perhaps virtualization is also responsible for some of the gap since the entire concept of virtualization (specifically compute virtualization) is to create many independent virtual machines whereby each one can run the same application simultaneously and independently. Within this framework, the demand for parallelism at the application level may have diminished since the parallelism is handled by the abstraction layer and scheduler within the compute hypervisor (and no longer as necessary for the application developer, I’m just speculating here). So, while databases and hypervisors are largely rooted in parallelism, there is one massive area that still suffers from a lack of parallelism, and that is storage.
THE PARALLEL STORAGE REVOLUTION BEGINS
In 1998, DataCore Software began work on a framework specifically intended for driving storage I/O. This framework would become known as a storage hypervisor. At the time, the best multiprocessor systems that were commercially available were multi-socket single-core systems (2 or 4 sockets per server). From 1998 to 2005, DataCore perfected the method of harnessing the full potential of common x86 SMP architectures with the sole purpose of driving high-performance storage I/O. For the first time, the storage industry had a portable software-based storage controller technology that was not coupled to a proprietary hardware frame.
In 2005, when multicore processors arrived in the x86 market, an intersection formed between multicore processing and increasingly parallel applications such as VMware’s hypervisor and parallel database engines such as SQL and Oracle. Enterprise applications started to slowly become more and more parallel, while surprisingly, the storage subsystems that supported these applications remained largely serial.
MEANWHILE, IN SERIAL-LAND
The serial nature of storage subsystems did not go unnoticed, at least by storage manufacturers. It was well understood that at the current rate of increase in processor density coupled with wider adoption of virtualization technologies (which drove much higher I/O demand density per system), a change was needed at the storage layer to keep up with increased workloads.
In order to overcome the obvious serial limitation in storage I/O processing, the industry had to make a decision to go parallel. At the time, the path of least resistance was to simply make disks faster, or taken from another perspective, make solid state disks, which by 2005 had been around in some form for over 30 years, more affordable and with higher densities.
As it turns out, the path of least resistance was chosen, either because alternative methods of storage I/O parallelization were unrealized or perhaps there was an unwillingness by the storage industry to completely recode their already highly complex storage subsystem programming. The chosen technique, referred to as [Hardware] Device Parallelization, is now used by every major storage vendor in the industry. The only problem with it is, it doesn’t drastically address the fundamental problem of storage performance which is latency.
Chris Mellor from The Register wrote recently in an article, “The entire recent investment in developing all-flash arrays could have been avoided simply by parallelizing server IO and populating the servers with SSDs.”
TODAY’S STORAGE SYSTEMS HAVE A FATAL FLAW
There is one major fatal flaw in modern storage subsystem design and it is this: today’s architectures are still using the old method of dealing with the problem of I/O by pushing the problem down to the physical disk layer. The issue is that the disk layer is both the furthest point away from the application which is generating the I/O demand and simultaneously the slowest component in the entire storage stack (yes, including flash).
In order to achieve any significant performance improvement from the applications’ perspective, a large amount of physical disks must be introduced into the system, either in the form of HDDs or SSDs (An SSD is a good example of singular device parallelization because it represents a multiple of HDDs in a single package. SSDs are not without their own limitations however. While SSDs do not suffer from mechanical latencies like HDDs do, they do suffer from a phenomenon known as write-amplification).
A NOT-SO-NEW APPROACH TO PARALLELIZATION
Another approach to dealing with the problem of I/O is to flip the problem on its head, in a manner of speaking. Conversely, rather than dealing with the I/O at the furthest point from the application and with the slowest components, like device parallelization attempts to do, let’s entertain the possibility of addressing the I/O as soon as it is encountered and with the fastest components in the stack. Specifically, let’s use the abundance of processors and RAM that now exist in today’s modern server architectures to get the storage subsystem out of the way of the application. This is precisely what DataCore‘s intentions were in 1998 and with the emergence of multicore processors in 2005, the timing could not have been better.
Let’s take a look at a depiction of what this looks like in theory:
The contributory improvement of storage performance per device using the device parallelization technique simply cannot compare to that of the I/O parallelization technique. Simply put, the parallelization that the industry is attempting to use to solve the storage I/O bottleneck is being applied at the wrong layer. I will prove this fact with a real world comparison.
On the latest showing of storage performance superiority, DataCore posted a world-record obliterating 5.12 Million SPC-1 IOps while simultaneously achieving one of the lowest $/IO ever seen ($0.10 per IO), only being beat out on the $/IO measurement by another DataCore configuration. Comparatively, the DataCore IOps result was faster than the previous #1 and #2 test runs from Huawei and Hitachi, COMBINED! For a combined price of $4.37 million dollars (cost of Huawei and Hitachi) and four racks of hardware (size of both Huawei and Hitachi test configurations) you still can’t get the performance that DataCore achieved with only 14U of hardware (1/3rd of one rack) and a cost of $506,525.24.
Put another way, DataCore is nearly 1/9th the cost at 1/12th the size and delivered better than 1/3rd the response time than Huawei and Hitachi combined. If you try to explain this in terms of traditional storage or device parallelization techniques, you cannot get there. In fact the only conclusion you can reach using that technique is that it is impossible, and you would be correct. But it is not impossible when you understand the technique DataCore uses. This technique is referred to as I/O Parallelization.
MORE THAN SIMPLY CACHE
Some have argued recently that it is simply the use of RAM as cache that allowed DataCore to achieve such massive performance numbers. Well, if that was true, then anyone should be able to reproduce DataCore’s numbers tomorrow because it is not as if we have a RAM shortage in the industry. By the way, the amount of RAM cache in the Hitachi and Huawei systems combined was twice the amount DataCore used in its test run.
What allowed DataCore to achieve such impressive numbers is a convergence of several factors:
- CPU power is abundant and continues to increase 20% annually
- RAM is abundant, cheap, and doesn’t suffer from performance degradation like flash does
- Inter-NUMA performance within SMP architectures have approached near-uniform shared memory access speeds
- DataCore exploits the capabilities of modern CPU and RAM architectures to dramatically improve storage performance
- DataCore runs in a non-interrupt non-blocking state which is optimal for storage I/O processing
- DataCore runs in a real-time micro-kernel providing the determinism necessary to match the urgent demands of processing storage I/O
- DataCore deploys anti-queuing techniques in order to avoid queuing delay when processing storage I/O
- DataCore combines all these factors across the multitude of processors in parallel
So what does this mean? What does this mean for me and my applications?
First, it means that we now live in an era that has parallel processing occurring at both the application layer and the storage layer. Second, it means that applications are now free to process at top performance because the storage system is now out of the way. And finally, it means that the act of having to spend more money creating larger and larger environments in order to achieve high performance is abolished.
Applications are now unlocked and the technology is now within reach of everyone, let’s go do something amazing with it!