Thursday, December 2, 2010

Disaster Recovery Is More Than Just Technology Part 2: The Alphabet Soup

In my previous blog post, I talked about high availablity and disaster recovery (HADR) and how it is more than just the underlying technology that keeps the entire strategy intact. In this blog post, I’ll describe a few acronyms – sometimes called buzzwords – that are commonly referred to in HADR projects and implementations (I know I use them a lot when addressing questions regarding HADR.) These acronyms fall under the second P in my PPT for HADR – PROCESS. Every HADR project or implementation should first be able to define these acronyms well before they even purchase the hardware, software and technologies they intend to use. Let’s get going.

Recovery Point Objective (RPO). Simply put, RPO answers the question, “How much data can we afford to lose?” Every HADR project should be able to determine the acceptable amount of data loss and is usually measured in units of time. For example, if a highly critical application runs 24X7 and the stakeholders have defined the RPO to be one (1) hour, if the database that stores the data for the application crashes at 5:45AM and you are running regular log backups (for SQL Server) or redo log backups (for Oracle) every hour starting at 12:30AM (I’m pretty sure the Oracle guys would jump up on me by using this as an example), the 15-minute data loss would be acceptable as you have an RPO value of one (1) hour. The 15-minute data loss was derived from hourly backups running starting at 12:30AM and that the last backup ran at 5:30AM – 15 minutes before the database crashed. Now, defining the hourly backups was not decided upon based on guesswork, although, I see a lot of DBAs simply just use it as a standard. If it was, it’s time to define your RPO and determine the amount of acceptable data loss and review your backup strategies.

Recovery Time Objective (RTO). RTO answers the question, “When is my application coming back online after a disruption?” Together with RPO, RTO is also measured in units of time. Looking back at the earlier example, if the stakeholders have defined the RTO to be two (2) hours, the database, the application and whatever is necessary to use the application should be back online by 7:45AM.

Service Level Agreement (SLA). As Wikipedia defines it, SLA is a part of a service contract where the level of service is formally defined. In my experience, this is commonly agreed upon by a customer and a service provider. You might be thinking that if you have an internal IT management team, chances are that you won’t have to deal with SLAs. However, bear in mind that in order for a computer application to be online, it relies on hardware which needs to be covered by a vendor warranty with associated service levels should the hardware needs to be serviced or replaced during a disaster, a media on which it can be accessed – either via the Internet or your local network – which also needs to be covered by vendor or in-house service level agreements. This is a very important item in your HADR project as anything external to your team will definitely affect your RPO and RTO. For example, if your highly critical application has been restored within two hours and the data loss was less than an hour, you may have met your RPO and RTO but if the Internet connection that allows your users to access the application is still not restored after two hours, forget about achieving your RPO and RTO. Technically, from the point of view of users, your application is still not accessible. Which is why when you’re dealing with vendors or service providers, make sure that your agreed upon SLAs meets your RPO and RTO.

I’ve only scratched the tip of the iceberg on the different components that make up the PROCESS part of an HADR project. What I find surprising is that whenever I start asking about RPO/RTO/SLA values from customers asking for an HADR solution, they immediately respond with “I want zero downtime and zero data loss for my application.” They simply think that their application deserves an RPO and RTO value of zero (0). What they don’t realize quite yet is that as RPO and RTO approaches zero (borrowing jargons from integral calculus that, as your limits approach zero), the cost exponentially increases. And when we start talking about costs, customers start re-evaluating their HADR strategies the way they are supposed to be. This is where I really like the discussions to go because they will look at each application and the corresponding database differently and categorize them accordingly – from not-so-critical to highly critical. And they start crunching numbers to determine how highly critical an application can be and if it does merit a near-zero RPO and RTO. Take for instance an e-commerce site that generates an average of 50 transactions per minute (which is a relatively low volume these days) at US$10 per transaction. That is equivalent to US$500 per minute. Losing an hour’s worth of data due to downtime or data loss would mean US$30,000. Having an HADR solution in place should be justifiable enough to protect a US$30,000 worth of transactions in an hour. Your strategy would also consider if the transactions are only coming in between 5:00AM until 9:00PM as you wouldn’t want to be investing a lot for a solution that doesn’t protect anything after those hours.

So when you plan your next HADR project, think about these concepts and define your RPO/RTO/SLA. It will definitely keep your perspectives right and plan accordingly. In my next blog post, I will be talking about high availability implementations, more examples and how they should address your RPO/RTO/SLA. Keep posted. Plus, if you’re in the Washington DC area this weekend, catch me do a presentation on SQL Server Disaster Recovery Techniques for SQL Saturday #61