I don’t want to be overly critical of what happened here, I certainly understand that ‘stuff happens’ (you may have seen this expressed another way), but this is a perfect example of a gross single point of failure. I believe it goes without saying, but I will say it here… power is everything. It doesn’t do much good in a data center environment to have redundant storage systems, redundant switches, redundant power supplies, redundant servers, redundant upstream providers, etc., with only a single power source for the entire facility. Where is the due diligence here?
I did find the following statement somewhat humorous and disappointing at the same time:
“Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators.”
Really? I don’t know all the details involved here since I am not intimately familiar with the power systems at this particular facility, but generators do not seamlessly restore power once utility power is lost. Most multi-megawatt generators take between 20-40 seconds to restore full power (this also assumes that the transfer switch is automatic). The only way this could happen ‘seamlessly’ in theory is if the generators were already running and the transfer switch was already in bypass mode (which the article doesn’t state and is not likely anyways). I also think Amazon should be greatly concerned if this provider (assuming that Amazon doesn’t own the facility) is supplying utility power directly to the systems in the data center without some sort of intermediate system (such as an enterprise UPS system).
Not only is it bad practice to provide very sensitive electronic equipment with directly utility power (which has a tendency to fluctuate radically), it also sets the tenants in this facility up for disappointment (when the power is lost). The better way to provide power to these systems is to have an enterprise UPS inline with utility power with an automatic transfer switch to generator power in the event that utility power is lost. In this case, all systems are technically running off of UPS power by default, providing clean and interruption-free power. Since servers will typically not survive through a second of power loss, the UPS provides the power buffer necessary to keep the systems running while the generators come up to full capacity.
No system is perfect or truly provides 100% guarantees against any possible scenario, but this appears to be a major deficiency in a critical area that I would have expected Amazon to discover during due diligence (again, assuming that Amazon doesn’t own the facility). It appears from the article that services could be down for up to 48 hours.