Achieving Application-Consistent Recovery for Virtualized Environments

By Eric Burgener (Profile)
Share |
Monday, March 1st 2010
Advanced

When data recovery must be performed, administrators generally weigh the “speed” advantages of application-consistent recovery against the “data loss” advantages of recovering from the most recent crash-consistent point.  To create an application-consistent point, an application will flush all data in logs out to data files to enable fast, reliable recovery.  Crash-consistent points, on the other hand, still enable reliable recovery but support a slower restoration of application services since all data in logs must first be dealt with when an application is restarted.  A common approach is to recover from the most recent application-consistent point in time first to get an application service back up and running as quickly as possible. Dealing with the data loss is often a secondary issue.  In the Exchange world, for example, this would allow users to start sending and receiving e-mail again very quickly, but they may not have access to all their most recent e-mails until the administrator has completed recovery operations for all the Exchange updates since the last VSS snapshot point.

These recovery requirements apply in both physical and virtual machine environments, but legacy data protection technology can make them more challenging in virtualized environments.  This article explores recovery considerations for virtual machines and identifies a combination of newer data protection technologies that can meet recovery requirements.

Virtual Machines Present New Recovery Challenges

Most physical machines, particularly Windows and Linux servers, run at low utilization rates that are in the 20% - 35% range.  It can be very disruptive to re-provision physical servers to accommodate growth, and leaving significant headroom after initial configuration minimizes re-provisioning by providing a lot of room for growth.  Virtual servers are easier to provision, so administrators are willing to incur re-provisioning costs in high growth environments to obtain the economic advantages (lower costs, lower power and cooling, less floor space) associated with server consolidation projects that achieve high virtual server to physical server ratios.

Virtual machines generally run at much higher utilization ratios than physical servers, often approaching 85% or greater.  With heavily loaded virtual machines, there is less headroom available for administrative activities like data protection.  Conventional backup agents and host-based replication often generate too much load to be used in virtual environments.  This is why server virtualization vendors offer hypervisor-level APIs for backup purposes:  it provides a way to create disk-based snapshots for recovery that puts very little overhead on individual virtual machines.

But just generating snapshots isn’t enough.  Hypervisor-level APIs create crash-consistent snapshots, but do not create application-consistent snapshots because they have no application-consistency mechanism.  What this means is that when data recovery is required, it will not be optimized for speed.  As noted earlier, it takes longer to recovery data from crash-consistent recovery points than from application-consistent recovery points, and it also may involve more manual interaction, depending on the nature of the problem.

The other issue with snapshot backups is data loss.  Data loss on recovery is driven by the frequency of hypervisor-level snapshots.  When a recovery is required, the most recent snapshot is generally the best option, and that option inevitably leads to data loss.  Since snapshots are not based on real time, the most recent snapshot rarely reflects the most recent data state.