One of the perks for working for a small non-profit is that we can decide to do simply do things that large companies never would. This week, Sage Bionetworks is completely closed, with a week-long summer vacation as our bonus for hitting last year’s company-wide objectives. Of course, leave it to Mr. Murphy to intrude on what seemed like a great morale-booster.
As our week off approached and people started heading out to start their vacations early, we got a first warning that shutting down completely is probably not an option if you’re operating an online service, even an early-stage beta that’s currently supporting a handful of users. The week before our break we had a brief service outage caused by a security certificate expiring on the Crowd server backing out authentication services. Although we corrected the problem quickly, that incident got our attention enough to decide that pushing out a new release with only minor bug fixes right before the break was probably not worth the risk. We decided to just leave everything alone and resume work once everyone was back in town.
Unfortunately, that first incident was only the precursor to the main event. Friday night, Amazon suffered a power outage at their Virginia data center, which also affected a number of other much higher profile sites like Netflix, Heroku, and Instagram. Saturday morning I woke to the realization that Synapse was completely non-functional. It took the better part of the day for us to trace the problem to our AWS RDS service being down, and recover the live system using a backup taken shortly before the event.
So, does this incident bring into question the decision to use cloud services and particularly Amazon as the foundation of our own service? What’s the alternative? In my previous job my team built a service using a stack of our own selected servers installed in a local colo facility. Several months in, after a couple power outages had taken us down, we were able to determine that this facility did not in fact have redundant power wired to our cage as they had promised us when we moved in. Instead, they had just plugged us straight into Seattle City light. Stung by this we found a new, more reputable hosting partner, located in a quality facility also serving the local ABC news affiliate. They had a big bank of UPS devices, multiple redundant connections to the power grid, and the ability to run the site indefinitely off backup generators in case of emergency. Still, at that facility we had a couple day outage caused by an electrical fire that cut power to our servers for a couple days. At Sage, I could buy physical servers and put them in the data center right next to my office at the Hutch. It might give me a warm fuzzy safe feeling, but I think that’s mostly due to a psychological tendency to imagine that bad things only happen to other people. I doubt they’d really be any safer than rented servers sitting in an AWS facility in Virginia.
So, I think the lesson here is not that cloud services are unreliable and you should build everything yourself. Rather, I think it’s a reminder to not believe the cloud marketing hype too strongly, and remember that cloud services are a leaky abstraction over the fact that you are programming someone else’s data center. Most of the time that abstraction is a useful simplification which allows you to work more efficiently, but occasionally, like last Saturday, that simplification breaks down. However, what I took out of responding to the crisis wasn’t a desire to move elsewhere, it was a set of improvements to the way we configure and maintain our service that could greatly improve our ability to respond to things like this in the future. There’s no magic here, just a set of engineering tasks that we need to triage against other work like adding new features or improving documentation to provide the best experience for our users. For example:
- When creating a service off a set of AWS services, it’s best to wire components together though cnames instead of directly to the public name of the service. We had done this with our app and web tiers running on Elastic Beanstalk, but not with RDS and our Cloud Search instances. Consequently, we had to change configuration on the production server instead of just swap in new services, slowing our response and increasing risk.
- We need better monitoring of our services. Both more extensive use of Cloud Watch metrics on key components of our services, as well as an external periodic smoke test of the application would have gotten us information about the problem more quickly.
- We need better automation of provisioning new environments. Some Synapse components require a manual set up process, documented via a wiki. Like most wikis we try to keep ours up to date, but things occasionally get missed. An automated script to build a new environment checked into version control is self documenting and accurate if you continuously use it to create new environments.
- We should think hard about the number of different components comprising our service. Some, like the Crowd server, offer only a bit of value for us, at the cost of making our overall environment more complex to administer and trouble-shoot. We need to continuously watch our environment and make sure we are engineering for reliability and ease of maintenance as we expand functionality.
However, in the end, I still think the flexibility and development speed we get from building off of Amazon’s building blocks is worth the cost. More days like last Saturday could change my mind about Amazon, but I’d likely be shopping for a more reliable vendor, not trying to bring more low level IT maintenance in house.