|
Page 2 of 4
Configuration Changes: Adding Fuel to the Fire
To complicate matters further, no one can test daily so a DR test only evaluates a company's recoverability at that instant in time. The moment a change is made to the infrastructure, the test results are thrown into question because there is no way to easily assess what impact that change may have on any other aspect of the environment.
Today's large datacenters are incredibly complex and often include hundreds of applications running on thousands of servers with multiple operating systems and databases. Clusters such as HP ServiceGuard, Veritas Cluster (VCS), and Microsoft Cluster Server (MSCS) require complex resource configuration. In addition, GeoClusters store data in different geographically dispersed locations.
With such a setup, configuration changes are a fact of life. However, any small configuration change - such as the addition of a new volume/database file or reconfiguration of replication processes - can create a gap between the production and the DR environments. Even the smallest gap can cause a recovery operation to fail.
Adding to the challenge is that these errors may not impact normal operations. And, since companies are minimizing the scope of their DR test, there's a good chance the errors won't be caught when a test is run. Instead, they will remain undetected - until an emergency strikes and recovery is derailed.
A Closer Look at What a DR Test Can Miss
Even when a test is conducted according to standard best practices, the number of gaps and errors it can miss is significant enough to pose a serious risk to the business. Given that most organizations have neither the time nor the resources to perform complete DR tests, the level of risk becomes downright frightening.
These risks fall into two categories:
Data Protection Risks. Application data, metadata and data links can be jeopardized by gaps in replication, setup, sequence of procedures, accessibility, mapping, zoning and more. Maintaining the completeness of the data and its internal structure consistency is a critical, but difficult, task. Direct impact: data loss and potential RPO violation if data is irrecoverable or recoverable but to a point in time that violates a required RPO.
Availability Risks. Standby hosts, DR servers and cluster members may be unable to fulfill their role because of erroneous configuration, incorrect mapping of replicated storage to standby hosts, standby host configuration errors, and other issues. Direct impact: extended recovery time and potential RTO violation.
Hidden Optimization Opportunities
A side benefit of running a DR test is finding optimization opportunities, including:
Underutilized Resources. Deployment gaps often result in excessive allocation of storage resources and inefficient use of Storage Area Networks (SAN) resources.
Best Practices. Best-practice violations, or the inability to conform to established best practices, are frequently discovered.
Let's take a closer look at just five of the more common errors that often go undetected. We'll explore why they occur, why a DR test fails to catch them, and how they can impact operations.
Replication Inconsistencies
(Different RDF Groups)
Risk: Data loss and increased time to recover
How does it happen?
This is a common gap found in large EMC SRDF/S and SRDF/A
environments where multiple RDF groups are needed. It occurs most often when storage volumes from different RDF groups are provisioned to the host and used by the same database. The provisioning tools do not alert or prevent this configuration. Each RDF group is associated with different replication adapters and potentially different network infrastructures. Rolling disaster scenarios can result in corrupted replicas at the disaster recovery site.
What is its impact?
A rolling disaster scenario is characterized by the gradual failure of hardware and network, as opposed to abrupt and immediate cessation. Most real-life disasters are rolling (for example, fire, flood, virus attacks, computer crime, etc.). In a rolling disaster, network components will not fail at exactly the same time, resulting in one RDF group being out of sync with the other RDF group. This will irreversibly corrupt, the database at the disaster recovery site. Data will need to be restored from a recent backup, increasing both the RTO and the RPO.
Why does the DR test miss this?
When a company conducts an orderly shutdown of applications, databases and hosts, it leaves data in a consistent state. Gradual/rolling disasters that bring systems or network elements down one by one are extremely difficult to emulate in a DR test.
Note: Many companies actually experience this problem but incorrectly assume it is the result of some network abnormality. However, unless the issue is properly diagnosed and corrected, it will reoccur.
|