Considerations for High Availability Designs Used for Disaster Recovery

Written by Lee Clemmer on November 3, 2009

With more focus being placed on rapid recovery times for disaster recovery (DR) operations, much of the design, strategy, and practice work done for DR in the past has shifted more toward the high availability (HA) concept. For many businesses, an “always on, 24/7/365″ concept is key, so a recovery time of 48 hours is simply too long, and a data loss of an entire week would be catastrophic and considered a definite disaster in its own right. So, availability is now king–how do we achieve it? See my article on Virtualization, Replication, Storage and High Availability for introductory concepts on replication and how storage requirements increase, and on the general ideas behind clusters and replication.

Many of you here are from a Microsoft Exchange and therefore a Windows Server environment. While much has changed in the capabilities for Windows server clustering, especially in the Exchange area, many of the core concepts are the same regardless of what the latest features and options are. For example, block-level replication across drives on a SAN solution such as EMC’s SRDF/CE option is specifically designed to assist in replication of Windows databases such as SQL and Exchange, but the block-level replication works in essentially the same manner as DRBD does on Linux.

Generic SQL Geo-Cluster Architecture

Clustering conceptually is the same regardless of the platform or systems as well. Although that might seem to be heresy to those that are irrationally tied to one platform or the other, it’s true. It’s even more true for dealing with the considerations for multi-site clusters or geo-clusters. Round trip times and network latency limits tied to the speed of light for geographically distant systems can’t be ignored, regardless of the platform or application. Also, clustering solutions have to deal with defining fail-over and fail-back procedures, and the theory behind most of these solutions is the same. Nodes in a cluster communicate via a heartbeat, and there is often a tie-breaker or “witness” node present to assist in validating that the primary node in the cluster has failed. For multi-site or geo-clusters, this is especially important both in the design stage and in understanding the possible failure modes. If network communication is down between sites, but not to and from clients at a site, multi-site clusters may fail-over and present a “split brain” situation where each site’s believes it is the active one, that the other is down.

Does the likelihood of a network outage mean that we must change our expected recovery time to be greater than the acceptable down-time for the network listed in our network SLA? Probably? This is a key question. How long must communication between sites be down before the secondary site decides that the primary site is really down and takes over as active?  Do you believe that having alternate paths for the heartbeat connection will solve this? Could that create an even greater problem? Let’s look at it:

Multi-path Communication for Multi-site Clusters
The servers will likely have a subnet spanning (cross-site VLAN) solution where their heartbeat network interfaces communicate. This network path therefore includes distinct network adapters (NICs), cabling, possibly separate switching, and may take a different path to and from the remote site. If the sites communicate via a traditional WAN link, but clients connect between sites or to each site via separate Internet facing routers or VPN concentrators, the client path to the remote site and its server(s) in the cluster may be very different. Consider already that client communication on the primary site with the active node(s) may fail, but the different network path for the hearbeat and quorum info may have the cluster in a state where it is healthy, but unreachable.

If the cluster fails over due to heartbeat communications failing, but when clients can still reach the primary site’s active servers, very strange problems can arise. Depending on how DNS is configured, and on how the cluster’s IP address is managed, clients might be directed to the secondary site based on the interruption of communications on the heartbeat network. In fact, the primary site is still active. Depending on the SAN or replication solution, one or the other of the sites will be writable with the data, while the other is just being replicated to. The load-balancing or DNS management needs to align with which cluster site is active. If the heartbeat network goes down and the cluster fails over to the secondary site, but clients are still directed to the primary site by a load balancer or DNS, that site likely won’t have access to the disk volumes since the SAN will have failed over to the secondary. If the replication solution still allows write access, the data between sites will be inconsistent. The cluster will think the secondary site is active, yet data has been written to the primary. Granted, if things are set up correctly this should not happen. But it can. Be warned.

Liked this post? Share it!
  • Digg
  • Slashdot
  • del.icio.us
  • StumbleUpon
  • Mixx
  • Fleck
  • Furl
  • Ma.gnolia
  • MisterWong
  • NewsVine
  • Reddit
  • Spurl
  • Technorati
  • TwitThis
Subscribe to my RSS feed

Leave a Comment

Comment Policy