Clouds, lightning and media thunder

June 12, 2009 at 4:11 pm | Posted in Sofware Startup | 2 Comments
Tags: , , , , , ,

As reported by CNET News “Lightning took down Amazon Cloud” and confirmed by Amazon there was a small outage in the Amazon EC2 infrastructure.  The media decided that this is a great reason to run another hype story and provide you with a headline that is really misleading.  The Amazon Cloud (which is a rather large infrastructure) did not go down.  The truth of the matter is that a small number of servers in a single location were impacted due to a lighting strike that disabled a power distribution unit.

From this media thunder I now have the opportunity to evangelize cloud computing.  Let’s start with the premise that you have a web application running on an Amazon EC2 image and you use Amazon EBS to locate your database. Finally let us suppose that your image was on one of those ill fated servers. So how should this all play out, let’s take a look.

Since you have external monitoring of your application, you would have immediately been notified when the lightning strike took out your Amazon EC2 instance.  At that moment in time you would spring into action.  You first check the AWS health status to find out what is going on.  You see that there is a problem isolated to a single availability zone.   So now you know you need to bring up a server elsewhere.

So you issue the command to Amazon to give you an instance in another availability zone.  After a few minutes the server is available and you can install your pre-configured Amazon Machine Image.  This image is basically an exact copy of the server that went down.  Within five or so minutes the new server is alive and well fully configured and running your application.  You run a pre-configured script that makes a few configuration changes around security groups and data store connections.  You then re-configure your Amazon Elastic IP to point to the new server and voila, you are back in business.

Compare that to the same outage in a data center that isn’t using virtualization.  When a server goes down a major process ensues.  The data center team has to physically locate the server and perform diagnostics (hours?).  If they can’t fix the problem, then they either assign a new server (if one is available) or they install a new server (days?)  Once the server is installed, the software must be installed (hours?) and then finally a new IP address is assigned and DNS changes must be propagated to the world to make the server accessible (hours?).

Now each data center may have different procedures, but you get my point.  When you have a virtual infrastructure and you actually plan your architecture properly, hiccups are just part of every day business.  With the Amazon cloud, applications like mioworks.com can rely on the overall strength of the data center and compete on a level playing field – or maybe, they have an advantage because of the cloud’s flexibility.

Monitoring applications in the cloud

March 10, 2009 at 3:22 am | Posted in cloud computing, Software as a Service, Sofware Startup | 3 Comments
Tags: , , , , , , , , , , , ,

Cloud computing is a rather powerful tool that allows even the smallest of businesses to provide an enterprise class environment for web applications.  In a nutshell, the cloud is nothing more than the ability to rent computer services on demand from a 3rd party provider.  At MioWorks.com we use Amazon Web Services, but there are several other services out there for you to explore.

Mastering the cloud takes a bit of work, a dash of experience and an openness to learn from others.  But once you do master it, the benefits are tremendous.  You’ll never have to order another server or rent a rack in a data center.  You’ll be able to fluidly control your environment by increasing and decreasing the services you need on the fly, saving time and money.

This power, flexibility and potential demands that you pay attention to the details.  You must anticipate that the cloud can have hiccups and that as quickly as a server comes to life, that server can disappear.  In previous blog posts I’ve already talked about the importance of backups and recovery drills, but let’s take a step back.  Today let’s talk about monitoring and how important it is to your survival.

Ok I’ll bite, why is monitoring so important

Let me sum this up in a single sentence: Monitoring can be the difference between “whew that was close” and “holy s$%t we are down”.  I lied –  I need another sentence…  Monitoring can also be the difference between a five minute outage and a five hour outage.

What to monitor

Every web based application environment in the cloud is a jigsaw puzzle of pieces.  At the core you have your virtual hardware followed by your operating system.  Each of your servers is then configured differently depending on its specific duty.  You may have application servers, web servers, search servers, database servers and the list goes on.  Each of these servers needs to be monitored from several points of view – both internally and externally.

Internal Monitoring

The big question isn’t “Is the server running?” it should be “Is the server and all of its pieces running correctly? Each virtual server in your setup is a maze of processes, files, directories and file systems.  At any given time a hiccup can occur within this delicate environment that will eventually disrupt the end user’s ability to use your service.   In our environment we use monit and munin (two open source tools) on the inside to provide us with critical monitoring, recovery & trending capabilities.

Monit provides systems monitoring and error recovery for our Unix systems.  In our environment at MioWorks.com we have configured monit to watch dozens of potential failure points.   Monit can start a process if it is not running and can kill/restart a process if it takes too many resources. Monit is also configurable as an intrusion detection system by watching for changes in files, directories and file systems.  By spending a little time learning and using Monit your system administrator has a great tool to keep a constant eye on all the pieces of the puzzle.

In addition to the direct monitoring and error recovery system, we also like to see the bigger picture.  We use Munin to aggregate information across our server pool.  Munin provides a graphical view that allows your team to quickly see what’s different from yesterday.   You can quickly determine your resource utlization and plan in ADVANCE any increase of capacity.

From the outside

Keeping track of all the pieces inside the cloud is very important, but you also need to know how your environment in the cloud is performing to the outside world.  There are more external monitoring services out there than I can count.  But I’ll tell you who we use.  Our favorite at the moment is monitis.com.  We like them because starting at just $10/month you get on demand fault & performance monitoring for your environment. This external watchdog system helps to keep everyone informed if/when the cloud is having issues.  It also provides us with important statistics on response time and application performance that we use to determine how to adjust our infrastructure.

Continuous improvement

Your monitoring program must become a living, breathing element of your systems administration.  As new problems arise or potential problems are identified, the monitoring system must be adjusted to be proactive.   The good news is that the more you adjust your monitoring and error recovery system, the less you’ll be surprised in the future.  It takes discipline to post mortem each problem and determine how to proactively detect for it in the future.  And this discipline will distinguish your application in the frenzy of the cloud.

Real world results of a good monitoring program

In the real world your monitoring system can be the difference between keeping your systems alive and thriving OR having unhappy customers and missed SLAs. It can help you pinpoint exactly what went wrong and reduce the time it takes for the first responders to identify and solve the issue.  There are lots of solutions in the marketplace including commerical  and open source alternatives.  It may seem overwhelming at first, but once you start the process and improve little by little, you’ll be amazed at the positive impact your monitoring program will have on your environment stability and your ability to get some sleep.

A lesson from ma.gnolia

February 7, 2009 at 5:03 pm | Posted in cloud computing, Software as a Service, Sofware Startup | 1 Comment
Tags: , , , , , , , ,

As the veil of silence surrounds the catastrophic data loss at ma.gnolia it gives us all a time to think about our approach to protecting our users & customers.  In the Internet arena, there are dozens of moving parts that no single company has complete control over.  There are hosting providers with power sources, data stores, network routers and firewalls, there are backbone providers, there are big switches in the sky, there are pipes under the sea, there are gremlins on treadmills, well you get the idea – the list goes on an on.  And if there weren’t enough variables already, we throw in cloud computing with its instant on, instant off dual personality.

With all of these moving parts and potential for disaster, it is a tremendous feat that more catastrophic data loss doesn’t occur.  Maybe it’s luck or maybe many companies are doing it right.  This failure at ma.gnolia is a reminder that as providers of Internet based services it is our burden to minimize the impact of failures or we shouldn’t be in business.

Since I am proponent of cloud computing, I want to focus on backups in the cloud for this article.  The cloud is a tricky monster.  It is ephemeral and unrelenting when it has a hiccup.  But the power of cloud computing is too tempting not to leverage it.  The fact that we can launch server after server for a dime an hour is down right amazing.  Small companies like mioworks.com don’t have to raise a million dollars just to get started.  We can create an account at Amazon Web Services and within an hour have a full data center up and operational for next to nothing.  But when you choose to use the cloud for your solution, you must be diligent in the way that you handle your applications and your data.  Simple backups are not enough anymore.  And testing must be done on a frequent basis.

I have to admit that I have come close to the brink of disaster with cloud computing  but through a stroke of luck was able to recover.  The scenario that bit us started out with outages at Amazon.com’s web services.  Due to automated shut downs of servers we lost our database.  Poof it was gone in an instant.  All the alerts and alarms went off and our recovery procedures kicked into action.  We all thought it would be ten minutes until the blip was over and operations would be back to normal.  Well that wasn’t the case.  During the restore we found out that one of our system administrators made a few changes to “improve” things.   According to the change logs, he successfully completed the backup solution, it was testing and working properly.  But when the catastrophic failure occurred the primary backup system didn’t restore the database image as planned.  Instead we had to take a different route to replay every single transaction that had occurred with our production database since it was created.

We were lucky that we had this data squirreled away in a cloud data store as a backup to our backup.  Yes it cost a hundred bucks a month to maintain this archive but it was well worth it.  The end result was a return to normal service for our customers, but unfortunately for some of them it was only after a three day outage. Ouch.

My error as the leader of the brigade was that I didn’t demand a full scale test of the disaster recovery on an ongoing basis.  As I ponder the entire episode it’s one of those moments when you say to yourself “you know better”.  Yet, it still went untested and unproven until it was exercised and failed.

So my advice for everyone who uses cloud computing.  Don’t go running scared and return to buying servers and disc arrays.  Spend an few extra weeks and a few extra dollars on disaster recovery planning, implementation and testing.  Implement competing backup solutions so that you have backup to your backup.  Go ahead and get fancy with your up to the minute on the fly backup scheme, but also implement the daily/hourly workhorse backup.   Make sure that your team has the backup and the restore fully figured out, and make them prove it to you. Don’t take their word for it. I’m SERIOUS here.  I know we all like to trust our techies, but sometimes they do make mistakes.

So, physically terminate the servers on them and watch them go to work. Or the friendlier approach is to have them launch a new set of servers in the cloud and replicate the environment from backups.  Have them do this on a bi-weekly basis.  They will get good at recovering the system and in the event it happens for real it won’t be a scramble.  With this extra workload the team may be unhappy at first, but it’s much better than suffering a fate like ma.gnolia.

Create a free website or blog at WordPress.com.
Entries and comments feeds.