Microsoft technologies and what I do for fun: I wish this was available years back - SQL Server 2005 Always On Technologies Part 2

One of the most underestimated feature of SQL Server is the FULL RECOVERY model in the database options. Database recovery models determine how much data loss is acceptable in case of a failure and what types of backup and restore functions are allowed. This has been an option which you can trace back from the earlier versions of SQL Server. We will not be going in detail with the different recovery models but it would be best to know that if you have a mission-critical database, it should be configured to use the FULL recovery model. A full description of the different recovery models for SQL Server 2000 (whch is also applicable to 2005) can be found in this MSDN article. Let's look at the same scenario we had in the first part of this series. The [Order Details] table, for example, in the Northwind database can be dropped successfully any time even if referential integrity is enforced in the database (try running the DROP TABLE [Order Details] comand and see what happens). Now, as I have mentioned, a mission-critical database should be configured with FULL recovery model combined with regular backups, either full database backups or a combination of full with differential and/or transaction log backups. So, I am assumming that you have at least a full database backup on a regular basis.

Contain the disaster

The first thing that you need to do when a user accidentally drops a table (or any object) in your database is to contain the disaster. You can do this by setting the database option to only be accessed by the dbo. You can do this either thru SQL Server Management Studio or running the ALTER DATABASE command

ALTER DATABASE Northwind
SET RESTRICTED_USER
WITH ROLLBACK IMMEDIATE

The WITH ROLLBACK IMMEDIATE option in the ALTER DATABASE command specifies to rollback all active transactions immediately while setting the access mode to RESTRICTED_USER, which in this case is the dbo. This will prevent any other user from connecting to the database, of course, with the goal of getting the database back online with the least amount of downtime.

Backup the transaction log

Since you have configured your database to be in FULL recovery model, you can backup what we call the "tail of the log" or the active portion of the transaction log. This will ive you the option to do a point-in-time restoration, the time before the DROP TABLE statement was executed. Since the database is technically offline at this point in time, you do not need to issue a WITH NO_TRUNCATE option in your BACKUP LOG command

BACKUP LOG Northwind
TO DISK = N'D:\NorthwindLOG.TRN'
WITH NAME = N'Transaction Log Backup', STATS = 10
GO

You will need both your full database backup and this transaction log backup to recover the dropped table.

Restore the database in RESTRICTED_USER mode from the full database backup and the transaction log backup
Since you have already got a full database and tail-of-the-log backup, you can now restore your database to a point before the table was dropped. This is going to be a bit tricky as you do not know exactly when the table was dropped (even if the developer called you up when the event happened, it would probably take minutes to log on to the server, open up your favorite query tool - be it Query Analyzer or Management Studio - and go thru steps 1 and 2 above). To make it a bit faster, just restore to a known good point-in-time just to get the database back online the soonest time possible and worry about trying to find when the table was dropped later.
USE master
GO

RESTORE DATABASE Northwind
FROM DISK = N'D:\NorthwindFull.bak'
WITH NORECOVERY, RESTRICTED_USER
GO

RESTORE LOG Northwind
FROM DISK = N'D:\NorthwindLOG.TRN'
WITH RESTRICTED_USER,
-- use a "known good" point in time
STOPAT = '2007-09-12 10:13:03.420', RECOVERY
GO
We are assumming here that the [Order Details] table was dropped before September 12, 2007 10:14AM, that's why we are taking 10:13:03.420 as a point-in-time restore point. Remember that if we restore to a point where the table has already been dropped, we have to start the restore process again. But we are not done yet. We don't know how many transactions have been committed between 10:13:03.420 and 10:14:00.000. On highly transactional databases, say averaging around 500 transactions per second or higher, you come up with around 28,500 transactions within that span of time. If a single transaction is about $0.05, you've lost about $1,425 within that span of time (and now you know the reason why database engineers and administrators get paid well, especially in financial institutions). Notice that the database was restored with the RESTRICTED_USER option. This means that the database is still inaccessible from external users. This is because we are not done yet.

Create a snapshot of the restored database and create a gap to store the recovered records later

We create a snapshot of the restored database which we will use for analyzing the records that need to be restored after getting our database back online. Remember, we only restored the database from a good point in time but probably way back before the table was dropped. That small difference is what we need to restore once the database is online. To create the database snapshot, we use the CREATE DATABASE command
CREATE DATABASE Northwind_RestorePointSnapshot
ON
( NAME = N'Northwind',
FILENAME = N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\Northwind_RestorePontSnapshot.mdf_snap')
AS SNAPSHOT OF [Northwind]
The snapshot should be in the same instance as the reference database. I will not be digging deeper into database snapshots in this entry but will probably spend some time in the future discussing database snapshots in SQL Server 2005. To create a gap in the [Order Details] table to store the recovered records later, we need to execute a DBCC CHECKIDENT command on the [Order Details] table. This is assumming we have an IDENTITY column defined in the table, which is usually the case (although the [Order Details] table in the Northwind database does not have an identity column, let's assume it does. Or better yet, add an identity column. I added one just for this purpose named OrdDetID).
DBCC CHECKIDENT ('[Order Details]') -- we will get the number of rows we have
GO
USE NorthwindGO
-- we will use a value big enough to insert the recovered records later
DBCC CHECKIDENT ('[Order Details]', RESEED, 4200000)
GO
The RESEED option specifies that the current identity value should be corrected. Let's say we get a value of 4000000 in the first DBCC CHECKIDENT command, we can create a gap of about 200000 to insert the recovered records later on (which explains the value of 4200000). We don't really care how big the gap is, what's more important is we have enough space to insert the records and we can get the database back online. The next DBCC CHECKIDENT command is to be executed on the original Northwind database as this is where we need to create the gap. Once we're done with this, we can now get the database back online.

Get the database back online

By now, we are ready to get the database back online. We can run the ALTER DATABASE command or use SQL Server Management Studio to do this.
USE master
GO

ALTER DATABASE Northwind
SET MULTI_USER
Now that the database is back online, we can start investigating the damaged data.

Restore the database with a different name for investigation
Since the production database is up and running, we can now go back and spend all the time we have to investigate the gap between what is and what should be in the production. Remember that we only restored the production database to a good known point in time. We have to identify the records that are still in the tail-of-the log than needs to be brought back to the production database. To restore the database backup to a different name,
USE master

RESTORE DATABASE [Northwind_Investigate]
FROM DISK = N'D:\Investigate\NorthwindBackup.bak'
WITH MOVE N'SalesDBData' TO N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\NorthwindData_Investigate.mdf',
MOVE N'SalesDBLog' TO N'C:\Program Files\Microsoft SQL Server\MSSQL.1\MSSQL\DATA\NorthwindLog_Investigate.ldf',
STANDBY = N'D:\Investigate\Northwind_UNDO.bak',
STOPAT = '2007-09-12 10:13:03.420', STATS
GO

RESTORE LOG [Northwind_Investigate]
FROM DISK = N'D:\Investigate\NorthwindBackup.TRN'
WITH STANDBY = N'D:\Investigate\Northwind_UNDO.bak',
STOPAT ='2007-09-12 10:13:03.420', STATS
GO

This is exactly what we did in Step #2 except that we are now restoring on a different database. We do the RESTORE LOG command until we come to a point-in-time before the table was dropped. This is a tricky one as it requires patience and attention to detail. If you just went beyond the point-in-time where the table was dropped, you will have to start over again. Play around with the value in the STOPAT option, moving forward either on a per second, per minute or per 5-minute intervals. You decide on the value. The ultimate test to find out whether or not the table has already been dropped is to execute a SELECT COUNT(*) FROM [Order Details] command. If this returned an error stating that the object does not exist, then you have restored to a point-in-time after the table was dropped and you have to repeat the process. Once you know you're done, you're ready to extract those records and push them to the production database.

Identify the records missing in the production database
Since we have created a gap in the production database using the RESEED option, we can identify how many records are in the investigate database but not in the production. We'll use the value we specified in Step #4, 4200000

--identify the number of records in production database before RESEED
SELECT OrdDetID
FROM Northwind.dbo.[OrderDetails] WHERE OrdDetID<4200000
--identify the number of records in the investigate database which are not in production
--if we get 4150000 from the first query, we use this value in the WHERE clause
SELECT OrdDetID
FROM Northwind_Investigate.dbo.[OrderDetails] WHERE OrdDetID>4150000

We'll get an idea on the number of records missing in the production database using this aproach. There are a lot of ways to push these records back in the production database and we'll look at some of them

Push the records back from the investigate to the production database
There are a lot of ways to do this. Let's look at some of them. One way is to do an INSERT..SELECT command from the investigate database to the production database. A sample query to do this is shown below.

--recover from missing rows
SET IDENTITY_INSERT Northwind.dbo.[Order Details] ON
GO

INSERT Northwind.dbo.[Order Details]
SELECT *
FROM [Northwind_Investigate].dbo.[Order Details] AS R
WHERE R.OrdDetID > 4150000
GO

The problem with this approach is that you are assuming only INSERT statements were executed between the gap. In reality, we don't really know. It could have been a couple of INSERTS, UPDATES and DELETES simultaneously occurring within this span of time which limits your assumption in using this approach. A more appropriate approach is to use the tablediff.exe utility. This utility is used to compare the data in two tables for non-convergence, and is particularly useful for troubleshooting non-convergence in a replication topology. Now you might be thinking otherwise as we do not have a replicated environment here. Well, we could think of it as one so we can use this utility. It can generate the corresponding INSERT, UPDATE and DELETE statements to execute to make replicated databases in sync. A full description of this utility is available in MSDN. Now, there are some limitations here. We don't know exactly how many tables were affected in a specific disaster. In this particular scenario, we only dealt with a single table being dropped by a user. What if there were more than one table which I believe all of the database I know of have. Since the tablediff utility generates the SQL statements only for a single table, we can create a dynamic SQL statement which will query the sysobjects table (or the sys.objects table in SQL Server 2005) for the names of the tables and generate a tablediff command with all of these parameters derived from extracting the names of the tables. This will be an appropriate approach to synchronize the investigate database and the production database. There is a GUI tool created by the guys from SQLTeam.com for those who use this tool often.

You have seen how to take disaster recovery (sometimes known as data recovery) in another level in this article. As I always emphasize, the process is more important than the technology that's why I highlighted the process in detail. The technology portion can be extracted from just about anywhere. Having a disaster recovery process is very important. Document your process and have a regular DR drill exercise to prepare you and your team when disaster strikes. In the next article in this series, we will look at other disaster recovery options available in SQL Server 2005

Microsoft technologies and what I do for fun

Monday, October 1, 2007

I wish this was available years back - SQL Server 2005 Always On Technologies Part 2

No comments:

FEEDJIT Live Traffic Feed

FEEDJIT Live Traffic Map