Why Does the Sad Passing of Michael Jackson Prove the Need for Cloud and Dynamic Compute? By Chris Knowles published: Friday, June 26 2009
This has been one heck of a sad week for the
entertainment industry. Ed McMahon, Farrah Fawcett and Michael Jackson
passed away this week. Whenever significant cultural events like this occur,
there is an explosion in communication among people, wanting to know what
happened and further discuss it with their peers. In the past, this would have
been isolated to talking with your neighbors, family, and friends, either in
person or over a traditional POTS line. Fast forward to the 21st century
and we now have real time bidirectional communication between virtually anyone,
anywhere in the world.
When you have an unpredictable event like the death of a
societal icon or the launch of a new service that has the potential for
extremely rapid adoption, or at the very least high traffic due to
curiosity alone, it is very difficult, or practically impossible, to anticipate
the real world resources needed to support the inbound demand. This is
very clearly shown by the chart from Keynote Systems illustrating the
availability and performance impact of this event on news websites.
Image from: http://www.datacenterknowledge.com/archives/2009/06/25/michael-jackson-news-slows-web-sites/
TMZ.com was the first news
outlet to break the story of Michael Jackson's death, and consequently their
site collapsed from the unexpected workload. It's hard to fault the IT
team responsible for TMZ services delivery. After all, no one knew MJ was going
to pass away yesterday.
So where am I going with all of this? To the clouds of
course! If there was ever a real world example of where a cloud solution would have played nicely into the delivery of a service
impacted by transient high-intensity workloads that can come without warning,
this is it. Even a properly architected high volume application or service that
is designed to handle large increases in transient load has a
finite capacity. Now, what if TMZ.com had the ability to
automatically spin up cloud resources and shunt the new traffic load over to
the cloud during the media frenzy? It would have meant full performance and
availability during the peak of traffic and provided service quality as good as
their normal service levels (for the shunting, I'm a big fan of f5
gear
for ADN networking). Now, they could have done this manually I suppose.
When they saw the traffic coming they could have provisioned some AWS
instances, got their site/content up and running, and started routing traffic
through a change to their load balancers. That'll work, but it's also
manual and it's going to take them time to get it all implemented. In fact, by
the time they're set up their end users may have already hit a dead site and
gone to one of their competitors. So what to do? Automate!
Sounds easy, but we all know IT automation is complex,
costly, and out of scope for most SMBs and some enterprises. Or is it? This
entire scenario could have been easily automated with readily available and
cost effective solutions already on the market. In this case, up.time 5 (our
systems management solution) has a full bi-directional integration with VMware
Orchestrator. If you are a VMware shop, you get Orchestrator for free with
vCenter Server. If you are not familiar with Orchestrator, you can check it out
here. Essentially,
Orchestrator is a policy based workflow automation tool that you can use to
build automated scenarios to perform well pretty much
anything. Orchestrator has the concept of plug-ins that provide
Orchestrator with the know-how for specific vendor technologies to directly
interact with them. up.time is the first Systems Management solution to
deeply integrate with Orchestrator and provide this type of functionality.
So how does this play into the TMZ.com cloud scenario? Well,
it goes something like this:
-
End-user experience
for the website is being monitored by the logical service address using the
HTTP service or WATM monitor. (www.mynewssite.com)
-
When the end-user experience
begins to suffer or servers start to indicate they are becoming
overloaded, a workflow can be automatically triggered to avoid any end user
incidents that may occur due to insufficient resources.
-
With the automation
set up, the cloud is temporarily being used to handle all excess capacity while
the website is running hot. The automation in this case would not only include
the additional virtual resources, but also the monitoring of the newly spun up
capacity and applications. We're now sending traffic to our AWS cloud without
anyone ever having had to do anything other than the initial Orchestrator
configuration.
- It gets even better.
So, what happens as the traffic slowly decreases back to normal level? The
monitoring of the newly created resources notices that the cloud is no longer
needed and the private infrastructure can now handle the load. It triggers
automated workflows to decommission those cloud resources, saving money by only
using the cloud when needed and avoiding any sprawl issues.
-
Lastly, the IT
manager and system administrators have been receiving emails and alerts to let
them know that these automated actions were happening, so they can sit back and
watch as their IT Infrastructure evolves to handle whatever traffic comes their
way.
Pretty cool stuff. Hey, did we just make Cloud move from
"buzz word" to real business value!? I think so. So, with a little up front
configuration you can implement ‘Automated Incident Avoidance' to keep your
services running when they are faced with potential unforeseen transient
workloads. And the best part is, this is only one example out of literally
hundreds (dare I say thousands) of ways you can automate your infrastructure management to ensure you are
operating at the highest possible levels of efficiency both from a technology
and a resource standpoint.
Chris Knowles is a Solutions Architect at uptime software,
the makers of up.time.
|