6 Causes of Email Downtime
Written by Mike Rede on June 29, 2010Every company attempts to minimize server downtime as any outages mean loss of productivity, potential loss of data and more importantly loss of revenue.
It has been estimated that forty-two percent of businesses had experienced database corruption in the year 2007. The risks of database corruption is cause for great concern in the data center particularly for email administrators who are responsible for protection of email content and for providing near continuous availability of email communications.
Without email communications companies can experience the same loss of productivity, loss of data and loss of revenue that is associated with database server downtime. Near continuous operation of email servers and communications is a necessity in order to maintain any company’s reputation with their customers and also as a competitive edge in their respective marketplace.
With these thoughts in mind here are some of the most common reasons for Email server downtime.
-
Server Patches
A recommended practice is to install your patches and fixes on to test servers before going into production with a new patch. This way you can avoid the downtime caused by patches that have not yet been fully tested or which create incompatibilities with other existing code on the system. Sometimes it is the patches themselves that are corrupted as a result of a link to a corrupted library when the patch was created. Testing patches within a virtual environment is another way to avoid potential downtime.
-
Dynamic Workloads
Companies which run multiple environments within an enterprise class server can experience unexpected or unplanned for changes in resource allocations which can lead to hung or downed systems. Sometimes workload changes can be due to large file transfers between servers that, if left unattended, can result in a crashed system. Microsoft Windows servers have been known to issue messages such as, “Drive 0 not found: Serial ATA, SATA – 0″ after crashing during large file transfers.
-
Database Corruption
As already mentioned, database corruption can lead to server downtime. And if there are problems with the underlying storage devices this can also result in email server crashes. It could be problems with a drive controller or RAID array problems. This is one of the reasons why virtualization of storage has become an important consideration in the enterprise environment as uptime of data has become as important as uptime of servers. Sometimes the downed server could be caused by writes to error logs that are themselves corrupted within the database.
-
Directory Problems
Sometimes a server will crash due to problems with Active Directory. Worse yet administrators can experience situations where not only are their email servers down but also all servers due to a domain-side Active Directory failure. Reboots are the last resorts to fixing such problems but when everything else fails they can sometimes be the only solution. When something like this happens only a concerted effort by all administrators involved will be required to bring up all systems online and hopefully within your recovery time objectives. I have seen four-tiered environments take hours upon hours before fully coming back up enough for end users to be able to log back in again and resume normal business activities.
-
Viruses
Some viruses are written with the intent purpose of crashes as many systems and servers as possible. Some use denial of service strategies hidden in unopened email attachments while others can surprisingly be unintentional holes in the fabric of the enterprise that are exploited by malicious hackers on the internet. One such example is the denial of service vulnerability that exists in how the Microsoft Server Message Block (SMB) client interacts with custom SMB responses. Hackers are able to exploit this vulnerability without using authentication by sending a custom SMB response to a client-initiated SMB request. If successful the result is that the server could be prevented from responding and would need a complete system restart to be able to resume normal business operations.
-
Configuration Errors
Sometimes changes are made to configuration settings that can lead to email servers experiencing unintended downtime. If the changes affect the WAN link settings then this can cause WAN link failures with undesirable effects on the email servers. Some of those effects can include Exchange servers that are in different geographies will become unavailable and begin to report delivery errors. An error message might indicate that the recipients could not be reached and that, “A configuration error in the e-mail system caused the message to bounce between two servers or to be forwarded between two recipients. Contact your administrator. “
This last problem is more perplexing because an administrator would expect a downed, crashed, or unavailable system of being unable to respond with error messages. The normal expectation would be that the unavailable system would simply queue the incoming messages and then resend them at a later time when the server returned to normal operations or when the WAN link configuration settings had been corrected. Well, that just leaves more fun for the troubleshooters among us.



September 10th, 2010 at 10:02 pm
One incredibly horrible office experience is when we were forced by management to run a patch on our servers without fully testing it. We were highly against the idea, but forces beyond our control felt that initial tests on the patch proved well enough to have it ready for official use. As soon as we ran it on our main servers, everything just began to fold up. Looking to patch? I agree. Run it on test servers first.
September 10th, 2010 at 10:06 pm
These are points are pretty helpful, thanks for this. We’ve actually been running into configuration errors with our WAN setup in the office. It was complete mess trying to get everything up and running again. I exaggerate of course, but the hour or so when the e-mail was down caused a pretty big uproar in and around the office. For a time, it felt like an eternity as we worked to getting our WAN up and running again.
September 10th, 2010 at 10:10 pm
Would it possible to see an article on how to actively prevent against these concerns? Some points were tackled in passing, while some not at all. I know that isn’t the point of this current post, but it would be extremely helpful to see what methods you guys suggest in preventing against each scenario. I’ll understand if it can’t be consolidated in a single post. I wouldn’t mind seeing separate articles on each respective scenario. Just a request, but great article!