In February 2017 Amazon Web Services (AWS) experienced a catastrophic four hour outage. Companies such as Netflix, Salesforce, Adobe and even Amazon themselves were significantly affected and an estimated $310 million (£249 million) was lost by firms that rely on services provided by AWS.
So what was the cause of the outage? Not a hardware failure, major power interruption or a freak malfunction but simply an accidental typo in a routine operation that led to the unintentional shut down of “a larger set of servers … than intended”.
Investigations have proven that the number one cause of downtime is due to human error. According to the Uptime Institute, 70% of the problems that data centres experience are as a result of human error and poor management rather than hardware failure, natural disasters or cybercrime.
(Read Amazon’s apology, explanation and reassurance regarding corrective action at https://aws.amazon.com/message/41926/)
At DSM we recognise the importance of working to documented procedures which are maintained on an ongoing basis as part of the company culture. No work on any systems is undertaken without taking appropriate backups and performing risk assessments to ensure everything can be rolled back in the event of an unexpected outcome. Change control logs are updated so that, if there is some unacceptable impact that is not immediately recognised, it can be quickly identified and rectified.