Clouds, lightning and media thunder

June 12, 2009 at 4:11 pm | Posted in Sofware Startup | 2 Comments
Tags: , , , , , ,

As reported by CNET News “Lightning took down Amazon Cloud” and confirmed by Amazon there was a small outage in the Amazon EC2 infrastructure.  The media decided that this is a great reason to run another hype story and provide you with a headline that is really misleading.  The Amazon Cloud (which is a rather large infrastructure) did not go down.  The truth of the matter is that a small number of servers in a single location were impacted due to a lighting strike that disabled a power distribution unit.

From this media thunder I now have the opportunity to evangelize cloud computing.  Let’s start with the premise that you have a web application running on an Amazon EC2 image and you use Amazon EBS to locate your database. Finally let us suppose that your image was on one of those ill fated servers. So how should this all play out, let’s take a look.

Since you have external monitoring of your application, you would have immediately been notified when the lightning strike took out your Amazon EC2 instance.  At that moment in time you would spring into action.  You first check the AWS health status to find out what is going on.  You see that there is a problem isolated to a single availability zone.   So now you know you need to bring up a server elsewhere.

So you issue the command to Amazon to give you an instance in another availability zone.  After a few minutes the server is available and you can install your pre-configured Amazon Machine Image.  This image is basically an exact copy of the server that went down.  Within five or so minutes the new server is alive and well fully configured and running your application.  You run a pre-configured script that makes a few configuration changes around security groups and data store connections.  You then re-configure your Amazon Elastic IP to point to the new server and voila, you are back in business.

Compare that to the same outage in a data center that isn’t using virtualization.  When a server goes down a major process ensues.  The data center team has to physically locate the server and perform diagnostics (hours?).  If they can’t fix the problem, then they either assign a new server (if one is available) or they install a new server (days?)  Once the server is installed, the software must be installed (hours?) and then finally a new IP address is assigned and DNS changes must be propagated to the world to make the server accessible (hours?).

Now each data center may have different procedures, but you get my point.  When you have a virtual infrastructure and you actually plan your architecture properly, hiccups are just part of every day business.  With the Amazon cloud, applications like mioworks.com can rely on the overall strength of the data center and compete on a level playing field – or maybe, they have an advantage because of the cloud’s flexibility.

Monitoring applications in the cloud

March 10, 2009 at 3:22 am | Posted in cloud computing, Software as a Service, Sofware Startup | 3 Comments
Tags: , , , , , , , , , , , ,

Cloud computing is a rather powerful tool that allows even the smallest of businesses to provide an enterprise class environment for web applications.  In a nutshell, the cloud is nothing more than the ability to rent computer services on demand from a 3rd party provider.  At MioWorks.com we use Amazon Web Services, but there are several other services out there for you to explore.

Mastering the cloud takes a bit of work, a dash of experience and an openness to learn from others.  But once you do master it, the benefits are tremendous.  You’ll never have to order another server or rent a rack in a data center.  You’ll be able to fluidly control your environment by increasing and decreasing the services you need on the fly, saving time and money.

This power, flexibility and potential demands that you pay attention to the details.  You must anticipate that the cloud can have hiccups and that as quickly as a server comes to life, that server can disappear.  In previous blog posts I’ve already talked about the importance of backups and recovery drills, but let’s take a step back.  Today let’s talk about monitoring and how important it is to your survival.

Ok I’ll bite, why is monitoring so important

Let me sum this up in a single sentence: Monitoring can be the difference between “whew that was close” and “holy s$%t we are down”.  I lied –  I need another sentence…  Monitoring can also be the difference between a five minute outage and a five hour outage.

What to monitor

Every web based application environment in the cloud is a jigsaw puzzle of pieces.  At the core you have your virtual hardware followed by your operating system.  Each of your servers is then configured differently depending on its specific duty.  You may have application servers, web servers, search servers, database servers and the list goes on.  Each of these servers needs to be monitored from several points of view – both internally and externally.

Internal Monitoring

The big question isn’t “Is the server running?” it should be “Is the server and all of its pieces running correctly? Each virtual server in your setup is a maze of processes, files, directories and file systems.  At any given time a hiccup can occur within this delicate environment that will eventually disrupt the end user’s ability to use your service.   In our environment we use monit and munin (two open source tools) on the inside to provide us with critical monitoring, recovery & trending capabilities.

Monit provides systems monitoring and error recovery for our Unix systems.  In our environment at MioWorks.com we have configured monit to watch dozens of potential failure points.   Monit can start a process if it is not running and can kill/restart a process if it takes too many resources. Monit is also configurable as an intrusion detection system by watching for changes in files, directories and file systems.  By spending a little time learning and using Monit your system administrator has a great tool to keep a constant eye on all the pieces of the puzzle.

In addition to the direct monitoring and error recovery system, we also like to see the bigger picture.  We use Munin to aggregate information across our server pool.  Munin provides a graphical view that allows your team to quickly see what’s different from yesterday.   You can quickly determine your resource utlization and plan in ADVANCE any increase of capacity.

From the outside

Keeping track of all the pieces inside the cloud is very important, but you also need to know how your environment in the cloud is performing to the outside world.  There are more external monitoring services out there than I can count.  But I’ll tell you who we use.  Our favorite at the moment is monitis.com.  We like them because starting at just $10/month you get on demand fault & performance monitoring for your environment. This external watchdog system helps to keep everyone informed if/when the cloud is having issues.  It also provides us with important statistics on response time and application performance that we use to determine how to adjust our infrastructure.

Continuous improvement

Your monitoring program must become a living, breathing element of your systems administration.  As new problems arise or potential problems are identified, the monitoring system must be adjusted to be proactive.   The good news is that the more you adjust your monitoring and error recovery system, the less you’ll be surprised in the future.  It takes discipline to post mortem each problem and determine how to proactively detect for it in the future.  And this discipline will distinguish your application in the frenzy of the cloud.

Real world results of a good monitoring program

In the real world your monitoring system can be the difference between keeping your systems alive and thriving OR having unhappy customers and missed SLAs. It can help you pinpoint exactly what went wrong and reduce the time it takes for the first responders to identify and solve the issue.  There are lots of solutions in the marketplace including commerical  and open source alternatives.  It may seem overwhelming at first, but once you start the process and improve little by little, you’ll be amazed at the positive impact your monitoring program will have on your environment stability and your ability to get some sleep.

Blog at WordPress.com.
Entries and comments feeds.