From time to time the topic of split-brain comes up among storage techies. The discussion very quickly turns into splitting-hairs rather than splitting-brains, but nonetheless, the apparent difficulties it presents can be addressed very effectively and practically.
This article doesn’t go into every conceivable combination of events that could occur in an environment, however I will break-down the common denominators needed for split-brain to occur. The goal here is to outline which elements have to exist for split-brain to occur in order to understand how to avoid one. Let’s press on.
What is Split-Brain, technically speaking?
Split-brain, from a storage context is:
an out-of-lockstep condition that could occur when any active-active storage system consisting of two members (i.e. Vol1′ and Vol1”) becomes completely severed (or partitioned) from one another during concurrent diverse write operations for the presented volume (i.e. Vol1).
* The green line represents in-band communications (usually mirror traffic) and the orange line represents out-of-band communications (usually storage management). Most enterprise storage systems can use either of these for communicating volume state information. The solid red line represents the primary preferred path to Vol1 for Host Server 1. The solid blue line represents the primary preferred path to Vol1 for Host Server 2.
It is important to point out that it takes two to tango. The storage system alone cannot create this scenario. It involves action on the part of the host server(s)/application and several other conditions aligning perfectly for this to occur. Let’s take the definition of split-brain and breakout each condition individually:
- Using an active-active storage system (Tango Partner #1)
- Present if you value your data and want high-availability for that data
- Using a host-based cluster or cluster file system (Tango Partner #2)
- Present if you are using VMware ESX with VMFS or other cluster file system
- This is the BRAIN that actually gets split
- Writing to the same volume (concurrent writes) from multiple sources
- Present if you use VMFS or other clustered file system
- Writing to different storage controllers at the same time (diverse operations)
- Present if you want to leverage all available controller resources and I/O channels (i.e. Round-Robin load-balancing scheme), or if you have multiple hosts using different preferred controllers
- The host servers and storage controllers have become fully partitioned
- Present if you are running all inter-controller communications over a single non-redundant link and that link was just severed
Split-brain cannot occur unless all five conditions are present at the same time and the brain, in this case the application level intelligence or cluster, becomes split. Let’s take a look at each of these in a little more detail.
#1: The Active-Active Storage System
Who doesn’t want a system that provides an extremely high level of protection for their data. Data is everything. No business can do without it. So why not use a system that provides true high-availability (not just fault-tolerance, or simply component-level redundancy)? Rather than repeating myself about the details of active-active (or synchronous mirroring), please check out a previous post where I go into great detail about how this mechanism works.
#2: The Cluster and the Cluster File System
I started using VMware as an example, so let’s continue since it is very widely used in the industry. VMware ESX uses a special type of file system (VMFS) which allows simultaneous read/write access from many different sources (or host servers). The datastore (a raw volume formatted with VMFS) is a cluster file system logical volume in which the key locking mechanism (a function that prevents writing to the same place at the same time from multiple sources which would result in data corruption) applies to individual files rather than the entire volume. A single virtual machine file (VMDK) can’t be written to by more than one virtual machine at the same time (or corruption would certainly occur). However, multiple virtual machines writing to their own VMDK files residing on the same VMFS datastore can and certainly does occur at the same time.
#3 and #4: Concurrent Diverse Writes To The Same Volume
This is not against the law; in fact it is a powerful capability because it allows the use of twice as much resource to get the job done (i.e. double the cache, double the channel bandwidth, double the back-end disk throughput, etc.). This is perfectly acceptable when you are certain you have the proper redundancy in place. But, in the case where you are more susceptible to the dangers of split-brain, such as with stretch or metro-clusters, you will want to avoid concurrent diverse writes to the same volume. Stretch or Metro-clusters usually have WAN links involved, which tend to become the single point of failure. By the way, any single point of failure in an active-active configuration (regardless of the distance between the sites) is nothing short of a severe architectural design failure (my opinion of course).
#5: Controller Partitioning
Before we dive into various scenarios, one of which is controller partitioning, it is important to note that the likely reason you deployed active-active mirrors in the first place was to ensure data survivability when faced with the worst case scenario.
If your deployment model has fully redundant and fully diverse communication channels between the sites, then the risk of experiencing a partition event is very low. In this case you can choose to run full cluster mode (where all hosts are a part of the same cluster across both sites) or split cluster mode (where each site is a separate cluster) based on your requirements.
If however your deployment model is not fully redundant and not fully diverse, running in full cluster mode would not be advisable, but there is still hope. You can run in split cluster mode whereby a cluster exists at each site. This model allows for all host resources to be active at both sites while maintaining full site-level data redundancy.
Other methods exist such as using a witness node to attempt to settle which side has control, but of course it isn’t without its own issues, not to mention, it requires additional connectivity (which you could have used to add redundancy in the first place), complexity, and cost. The risk of partitioning in a fully redundant and diverse architecture is certainly less than that of a non-redundant and non-diverse architecture with a witness. In fact, it would be much better to architect correctly from the start rather than adding more moving parts and complexity to an architecture that started out with a major deficiency.
The major takeaway from this is that if you are going to spend the time and money to deploy an active-active solution across sites, then it would also be reasonable to ensure there is diverse and redundant communication between the sites. If you cannot guarantee this, then it would be wise to use a split-cluster deployment model. If you have deployed redundant and diverse communication channels, then it is up to you whether or not to use a full-cluster or split-cluster model.
Remember, it is the application that has the brain, not the storage, and therefore it is the application that must be dealt with to prevent this condition in the end.