|
Page 1 of 4 Five Critical Recovery Flaws Your Last DR Test Missed By Doron Pinhas published: Friday, August 08 2008
This special feature for our
Spotlight on Security section is a white paper written by Continuity Software
CTO Doron Pinhas. It is being presented in its entirety, without edits.
Introduction
Businesses today are
spending millions of dollars to develop and maintain disaster recovery (DR)
infrastructures that will ensure business continuity. But despite such huge
investments of time and resources, most IT professionals are still not completely confident in their ability
to recover in an emergency. With industry analysts citing DR failure
rates of at least 60 percent, there's good reason to be concerned.
Realists that they
are, most IT managers understand that the complexity and scale of today's
infrastructures, high change rates, the number of stakeholders tied to the
change management process, and DR testing costs make recovery exceedingly
difficult even in the best of circumstances. But the limitations of traditional
DR testing are putting IT organizations at an even greater disadvantage. At a
time when businesses are under more pressure than ever to ensure continuity and
minimize data loss, IT organizations have no way to accurately measure if their
DR plans will actually work when they need them.
This paper explores
the reasons why periodic DR testing and manual auditing is not enough to ensure
DR readiness. It takes a closer look at the challenges of traditional DR
testing and explains how and why most tests will miss the serious data
protection gaps and recovery vulnerabilities that are lurking in most
environments.
In addition, the
paper examines how automated DR testing and monitoring, a new approach to DR
management, is helping companies around the world make up for the shortcomings
of traditional DR testing. These solutions provide companies with the ability
to reduce the cost and operational disruptions caused by traditional testing
methods while delivering a consistent, up-to-date view of the environment.
Automation enables vulnerabilities to be detected and resolved immediately to
ensure the highest level of DR readiness and business continuity.
The Failure of Disaster
Recovery Testing
The Theory
A DR test should emulate how well business
operations can be transferred to a remote facility to get the organization back
online within a specified recovery time objective (RTO) and recovery point
objective (RPO).
A good DR test
requires considerable advance planning, along with a sizable investment in time
and resources. Large numbers of people in the IT organization need to be
involved. Network and storage resource mappings must be reconfigured not just
once but twice, first for the test and then again to restore normal operations.
And to simulate a real disaster - which is the only way to truly determine how
well the DR strategy works - mission-critical applications or the whole production
environment must be taken down during the test, a step which most businesses
are loathe to take.
When a test doesn't
work, the team must locate and fix the problems and then repeat the process.
The
Reality
DR tests are
difficult, costly and complicated. Most companies run lean IT
organizations that just don't have the time or resources to execute full,
by-the-book DR tests. Plus, simulating a disaster can be dangerous: upon
completion of a test, IT professionals often hold their breath, hoping that
production will be easily resumed. With such concerns and limitations, it's no
wonder the scope of DR tests is minimized. Shortcuts include:
-
Testing
just a few key portions of the infrastructure, rather than testing the full DR
environment. Companies may, for example, test very few business services and
postpone the rest to a future test.
-
Keeping
storage/database/application management servers and/or domain/name servers or
file servers online while performing the test.
-
Conducting
orderly system shutdowns to protect production systems, rather than simulating
the abrupt cessation of operations that would occur in a disaster.
-
Testing
failover servers but not applications.
-
Testing
applications but not simulating the actual load the application must bear following
a full site recovery.
-
Neglecting
to test dependencies, data inconsistencies and mapping errors that may exist
between SAN devices and hosts, or any of the other errors that can cause a
recovery to fail. This is important because most applications operate within a
federated architecture that includes complex interrelationships between
databases, applications, middleware, flat files and so forth. To ensure
successful recovery and data consistency, a DR test should ensure that all
components in the federated architecture can be recovered, or restarted, to the
same point-in-time, while ensuring write-order fidelity. However, most
businesses do not do this.
In the end, they have
test results that are at best incomplete and at worst worthless.
|