Exchange Server SLAs, and Why You Need One

Written by Paul Cunningham on May 13, 2010

agreementThe worst possible time to define your uptime and availability requirements for an Exchange environment is when that environment is unavailable.  No email administrator wants to hear “We need this working within 2 hours” when they are looking at a dead server that is going to take all night to recover.

Uptime and availability should be defined within an SLA, or Service Level Agreement.  An Exchange Server SLA should exist in all organizations, even those that provide their own internal IT services.  The SLA is between the IT supplier or IT department and the rest of the business, and clearly defines what is an acceptable downtime or outage of the Exchange environment.

Why Are SLAs So Important?

The existence of an SLA supports many facets of the design and operation of the Exchange Server environment.

Budget – When a business defines their service level requirements they are making a commitment to providing the funds necessary to deliver those service levels.  An SLA is one of the best pieces of leverage the IT department has to secure those funds and implement an appropriate Exchange system.  Without the backing of an SLA the IT department may struggle to get approval for Enterprise server licensing, multiple servers for clustering, and other high availability components.

Server and Network Design – Exchange Server environments are designed to meet defined SLAs.  Certain uptime expectations can only be met with the right server design.  A business that is willing to go a day without email would not need the same infrastructure deployed as a bank that can’t go more than 15 minutes without email.  Clustering, redundancy, site-to-site failover, are all design points that would be included or excluded based on the SLA.

Third Party Warranty – In very resilient environments, such as those with clustered servers, this is less of an issue.  But for an environment with SLAs for single points of failure, the right warranty response times need to be in place for SLAs to be met.  A 4 hour return to service target will not always work if it is paired with a 4 hour vendor response time, because the vendor meets their target simply by showing up on site within 4 hours.  After they then spend time fixing or replacing failed components, the IT team then has to potentially deal with other software and data recovery processes.

Backups – The backup system will be heavily influenced by the SLAs that are in place.  If the backup system cannot restore all of the required data within the SLA timeframe then of course the SLA cannot be guaranteed.

Staffing – The SLA will define the service levels for different times of day, and this will impact staffing levels.  If 8×5 support is all that is required, then that is a different staff level and rostering schedule than 24×7 support would be.

It all starts with the SLA.  Sometimes an organization has trouble defining their requirements before an actual outage occurs.  For those without any SLA at the moment my suggestions would be:

  1. Analyze your current infrastructure and make an estimate as to how long a recovery would take under a variety of failure scenarios (e.g. single mailbox, single database, single server)
  2. Identify the business processes that email supports and is involved in
  3. Survey a sample of staff from various departments and teams, ensuring that each tier of employee is well represented in the survey

From that exercise you will gain an understanding of your business needs, technical capabilities, and the gaps that exist between them, and you can then begin work to formalise them as SLAs and implement changes in the environment to close those gaps.

Subscribe to my RSS feed

Leave a Comment

Comment Policy