Troubleshooting DAG Failovers and Other Random Drops

Troubleshooting DAG Failovers and Other Random DropsThis latest tip comes courtesy of the Exchange Team at Microsoft. I’d like to think that somewhere deep in the Microsoft internal archives they have a file called “The Case of the Sleeping NIC.” Troubleshooting DAG Failovers and other random drops on Exchange or other applications running on Windows may come down to finding, and waking up, a sleepy NIC.

Problem:

Database Availability Groups seemingly fail over randomly from one member server to another. This can also effect other servers that seemingly go offline at random intervals or for no apparent reason. Console access remains up and functional.

Cause:

Sleeping NICs. No, seriously, NICs on servers going “to sleep” as a result of power-saving settings can cause DAGs to failover. They can also cause other application failovers for clusters, or just plain fails for standalone systems. There’s a power saving option on many NICs that in the GUI is found on the Power Management Tab and is called “Allow the computer to turn off this device to save power.” It makes perfect sense for this option to be enabled on laptops, and even on desktops, but on servers? Whether it makes sense or not, it’s an option that appears to be enabled frequently on servers and is causing random seeming drops in connectivity, DAG and cluster failovers, and other interruptions to connectivity. When a server is completely idle, like in the middle of the night after backups are done, there’s no updates to deploy, and users are all asleep, the operating system can shut down the NIC to save power. This in turn leads to the fail overs and outages that seem to have no clear reason for their cause.

Resolution:

There’s a few different ways you can fix this issue if it is happening to you. Frankly, you may want to proactively fix it now, before it does happen to you. There’s really no good reason for a server’s NIC to go to sleep, any more than for the server’s operating system to go to sleep. Here’s three ways to fix this.

PowerShell to the rescue

You can download a PowerShell script from TechNet called DisableNetworkAdapterPnPCapabilities that will take care of this for you. Consider combining it with a Get-Content file of all your servers and a For-Each to apply this to all of your servers at once. The script is available at http://gallery.technet.microsoft.com/scriptcenter/Disable-turn-off-this-f74e9e4a.

Manual intervention

You can run a “Microsoft Fix it” from http://support.microsoft.com/kb/2740020 to fix individual systems, or you can set the registry key yourself by following these steps, also from the KB above:

  1. Click Start, click Run, type regedit in the Open box, and then click OK.
  2. Locate and then click the following registry subkey:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Class\{4D36E972-E325-11CE-BFC1-08002bE10318}\DeviceNumber

NoteDeviceNumber is the network adapter number. If a single network adapter is installed on the computer, the DeviceNumber is 0001.

  1. Click PnPCapabilities.
  2. On the Edit menu, click Modify.
  3. In the Value data box, type 24, and then click OK. Note By default, a value of 0 indicates that power management of the network adapter is enabled. A value of 24 will prevent Windows from turning off the network adapter or let the network adapter wake the computer from standby.

On the File menu, click Exit.

See http://support.microsoft.com/kb/2740020 for more about the options available to you when manipulating this key.

Group Policy settings

You can use a GPO to configure power settings for your systems. Create a PowerManagement GPO and link it to each OU that contains servers in your environment. While there are a ton of power management settings in Computer Configuration | Policies | Administrative Templates | System | Power Management, none of them apply to network interfaces. You will have to use your GPO to push a registry key, such as the one detailed above. Configure your power management settings on a model server, then see http://technet.microsoft.com/en-us/library/cc753092.aspx for the steps to take that registry entry and create a GPO that will push the same out to all the other servers you want to configure.

Since you cannot provide your servers with No-Doze or a daily dose of Red Bull, the best thing you can do if you think sleepy NICs are causing you problems is to make sure they stay awake. If you are seeing random drops and failovers, check your NIC Power Management settings. It’s probably going to sleep. Fix that, and you will probably resolve the bigger issue.

 

Written by Casper Manes

I currently work as a Senior Messaging Consultant for one of the premier consulting firms in the world, I cut my teeth on Exchange 5.0, and have worked with every version of Microsoft’s awesome email package since then, as well as MHS, Sendmail, and MailEnable systems. I've written dozens of articles on behalf of my past employers, their partners, and others, and I finally decided to embrace blogging and social media, so please follow me on Twitter @caspermanes if you enjoy my posts.

1 Comment

  1. Dei · December 31, 2013

    I had to Google in order to know what failovers meant. Doing research for a college project highlighting the importance of Microsoft Exchange. Our group does not really deal with troubleshooting issues, but I got curious with the term failovers. I found out from Microsoft that failovers really mean that there’s something wrong or there is a particular failure in the databases. I’m quite interested in this issue, so if it’s all right with you, can you please come up with a separate post that explains failovers in full? It’s got no bearing on our project, but I’d love to learn more about it!

Leave A Reply