A lesson from ma.gnolia

February 7, 2009 at 5:03 pm | Posted in cloud computing, Software as a Service, Sofware Startup | 1 Comment
Tags: , , , , , , , ,

As the veil of silence surrounds the catastrophic data loss at ma.gnolia it gives us all a time to think about our approach to protecting our users & customers.  In the Internet arena, there are dozens of moving parts that no single company has complete control over.  There are hosting providers with power sources, data stores, network routers and firewalls, there are backbone providers, there are big switches in the sky, there are pipes under the sea, there are gremlins on treadmills, well you get the idea – the list goes on an on.  And if there weren’t enough variables already, we throw in cloud computing with its instant on, instant off dual personality.

With all of these moving parts and potential for disaster, it is a tremendous feat that more catastrophic data loss doesn’t occur.  Maybe it’s luck or maybe many companies are doing it right.  This failure at ma.gnolia is a reminder that as providers of Internet based services it is our burden to minimize the impact of failures or we shouldn’t be in business.

Since I am proponent of cloud computing, I want to focus on backups in the cloud for this article.  The cloud is a tricky monster.  It is ephemeral and unrelenting when it has a hiccup.  But the power of cloud computing is too tempting not to leverage it.  The fact that we can launch server after server for a dime an hour is down right amazing.  Small companies like mioworks.com don’t have to raise a million dollars just to get started.  We can create an account at Amazon Web Services and within an hour have a full data center up and operational for next to nothing.  But when you choose to use the cloud for your solution, you must be diligent in the way that you handle your applications and your data.  Simple backups are not enough anymore.  And testing must be done on a frequent basis.

I have to admit that I have come close to the brink of disaster with cloud computing  but through a stroke of luck was able to recover.  The scenario that bit us started out with outages at Amazon.com’s web services.  Due to automated shut downs of servers we lost our database.  Poof it was gone in an instant.  All the alerts and alarms went off and our recovery procedures kicked into action.  We all thought it would be ten minutes until the blip was over and operations would be back to normal.  Well that wasn’t the case.  During the restore we found out that one of our system administrators made a few changes to “improve” things.   According to the change logs, he successfully completed the backup solution, it was testing and working properly.  But when the catastrophic failure occurred the primary backup system didn’t restore the database image as planned.  Instead we had to take a different route to replay every single transaction that had occurred with our production database since it was created.

We were lucky that we had this data squirreled away in a cloud data store as a backup to our backup.  Yes it cost a hundred bucks a month to maintain this archive but it was well worth it.  The end result was a return to normal service for our customers, but unfortunately for some of them it was only after a three day outage. Ouch.

My error as the leader of the brigade was that I didn’t demand a full scale test of the disaster recovery on an ongoing basis.  As I ponder the entire episode it’s one of those moments when you say to yourself “you know better”.  Yet, it still went untested and unproven until it was exercised and failed.

So my advice for everyone who uses cloud computing.  Don’t go running scared and return to buying servers and disc arrays.  Spend an few extra weeks and a few extra dollars on disaster recovery planning, implementation and testing.  Implement competing backup solutions so that you have backup to your backup.  Go ahead and get fancy with your up to the minute on the fly backup scheme, but also implement the daily/hourly workhorse backup.   Make sure that your team has the backup and the restore fully figured out, and make them prove it to you. Don’t take their word for it. I’m SERIOUS here.  I know we all like to trust our techies, but sometimes they do make mistakes.

So, physically terminate the servers on them and watch them go to work. Or the friendlier approach is to have them launch a new set of servers in the cloud and replicate the environment from backups.  Have them do this on a bi-weekly basis.  They will get good at recovering the system and in the event it happens for real it won’t be a scramble.  With this extra workload the team may be unhappy at first, but it’s much better than suffering a fate like ma.gnolia.

1 Comment »

RSS feed for comments on this post. TrackBack URI

  1. I’d also hate the lesson from Magnolia to be anti-cloud. I think the cloud will help people not only have backups, but have backups in multiple places. That means we’re doing collectively better with backups now, not worse.

    You’re right that the most important part is recovering a backup. It’s like the old Seinfeld line about rental car reservations. It’s easy to take it. The hard part is making sure it does what it’s supposed to do.


Leave a comment

Blog at WordPress.com.
Entries and comments feeds.