Have you ever experienced a power outage at work?  One that lasted several days?  One that affected the entire state?

Hey man, can you give me a jump?

Once you have, I imagine you will think of disaster recovery in a different way.  Most of the time, SQL Server performs perfectly, without any intervention from you.  However, to get the most out of it, you will need to take a few proactive steps.

I have seen power outages that lasted only a few minutes or even several days.  This could be caused by something as simple as someone tripping over a power cable, or a frozen pipe in a power plant a state away.

If they last long enough, your UPS could run out of power.  Or, if you are using a back-up generator, it may require refueling.

No matter how it happens, when SQL Server loses power, you have the possibility of corruption occurring.  You want to have a basic strategy in place to detect and deal with this.

Data Center

If you work in a large office, you may not normally have Data Center access.  However, when all hands are on deck for disaster recovery, you may spend significant time in there.  They tend to be cold, so bring a jacket or sweater.  Keep an extra one handy in your office so that when you need it, you will have it on-hand.

Did your servers come up before the network or switches?  If so, you may have to use a crash cart and the console to log-in to the servers.  This involves plugging a mouse and keyboard directly into the server racks and logging in.  Hopefully, the servers are clearly labeled so you will know which one is which.  It is also handy to have a rolling cart, a.k.a. Crash Cart, to hold the mouse and keyboard.  A rolling chair is nice, too.  Spending hours on end crouched down or bent over at weird angles can be hard on your body.

SAN Considerations

Did you servers come back online before the SAN did?  If so, you may need to start SQL Services manually.  Once the SAN is back online, you can use SQL Server Configuration Manager to start SQL Server.  Right-click on your instance and select Start.  Depending on your situation, you may need to start some of the other services, as well.  For example, SQL Server Agent (for your backup jobs) and SQL Server Browser (for your named instances).


SQL Server Configuration Manager

It is possible for write caching to come back to haunt you.  While it does provide a performance boost, when you are facing a disaster recovery scenario, it may not seems like a good idea anymore.

If you do not have a battery backup for your write cache, you may have data corruption and inconsistency in the event of a power failure.  If you do have battery backup, how long does it last, how often are you checking and changing the batteries?  These are not questions you want to consider when your manager is asking you about recovery time.

If you would like to know more about the storage systems underneath SQL Server, I highly recommend reading some of the stuff Wesley Brown (blog | twitter) has put out.

Starting SQL Server in Single-User Mode

Sometimes you need to start SQL Server in single-user mode to do some advanced troubleshooting.  You can do this by modifying the Startup Parameters in SQL Server Configuration Manager.

Right-click on your instance and select Properties.  Go to the Advanced tab, and look at the entry for Startup Parameters.  Place the following code at the beginning of the line: -m;

Be sure to include the semicolon to separate it from the rest of the commands.

Properties, Advanced Tab

Dedicated Administrator Connection

The Dedicated Administrator Connection is a system backdoor that allows you to connect to SQL Server, even when it becomes unresponsive to client connections.  This is a good one to file away in case you ever need it.

To connect, you must be logged onto the server that is hosting SQL Server.  You are not allowed to make a Dedicated Administrator Connection across the network.  When you open SQL Server Management Studio, open a new Query Window and specify ADMIN: in front of the server name.

Dedicated Administrator Connection

Note: if there is already a DAC in use, your connection will fail.

Preventative Maintenance

Backups

Be sure to use WITH CHECKSUM in your backup commands.  You are taking backups, aren’t you?  As a general rule, I like to take FULL backups once per week, DIFFerential backups daily, and LOG backups every five minutes.  You will need to consider your business requirements and devise a backup strategy that satisfies your needs.

BACKUP WITH CHECKSUM

When writing out a backup, the WITH CHECKSUM option makes SQL Server verify each page before it is written to the backup file.  If an error is detected, the backup will cease.  If you want the BACKUP to continue after an error is found, use the CONTINUE_AFTER_ERROR option.

PAGE_VERIFY CHECKSUM

When the PAGE_VERIFY_OPTION option is turned on, when each page is written out to disk a checksum is calculated and stored with the page header.  Then, when the page is subsequently read from disk, the checksum is calculated and compared to the value stored with the page header.  If the values are not the same, an error is logged to both the SQL Server error log and the Windows Application Event Log.  Look for Error Message 824.  This indicates a problem with your IO system.

One common misconception about this option, is that is automatically checks all of your pages.  This is incorrect.  The check is only made when a page is written to disk and then read back off of it.  Once you turn this on, it may take some time until all of your pages go through this cycle.

Schedule DBCC CHECKDB Regularly

Either use SQL Agent Jobs or the Maintenance Plan Wizard to run DBCC CHECKDB at least once per week on all of your databases.

If you have large databases, or 24/7 activity, it may not be feasible to run this on your Production systems.  If this sounds like your situation, there is a way out.  Do you have DEV and QA systems?  Do they get refreshed from Production regularly?

Run DBCC CHECKDB on your DEV and QA systems each time your refresh them from Production.  If you detect an error, raise the red flag immediately, and prepare to run DBCC CHECKDB on your production system.

Some DBA Commands To Know

DBCC CHECKDB

DBCC CHECKDB performs a variety of physical and logical checks on every object in your database.

Full version, this can take a long time depending on the size of your database.


dbcc checkdb(YOURDBNAME)
with all_errormsgs, no_infomsgs;
go

Some quicker versions, which don’t take as long:


dbcc checkdb(YOURDBNAME, noindex);
go


dbcc checkdb(YOURDBNAME)
with physical_only;
go

Repairing Problems

If DBCC CHECKDB detects some errors, it will make some recommendations at the end of its messages.

However, Microsoft recommends that you first try to restore from a good backup.  This way, you can try to compare the data from the backup and the corrupted database and decide which copy of the data to keep.  Most likely, this will be a long and painful process, but it is preferable to blindly losing production data.

If this is not an option, then you can try using the repair options listed by DBCC CHECKDB.  Note: This should be considered the last resort.

Before running a repair command, be sure to make a FULL backup.


backup database YOURDATABASE
to disk = 'c:\YOURDATABASE.bak'
with format, checksum, stats=1;
go

You must set the database into single-user mode.


alter database YOURDATABASE
set single_user;
go

Depending on the options you choose, you may need to set the database into emergency mode, as well.


alter database YOURDATABASE
set emergency;
go

Then, you can try some of the repair options.  Try this one first, it can fix some problems with your indexes.  This is not able to fix issues with FILESTREAM data.


dbcc checkdb(YOURDBNAME, repair_rebuild);
go

If that does not work, then you can try this one next.  You may experience some data loss with this one.  This should be your last resort.


dbcc checkdb(YOURDBNAME, repair_allow_data_loss);
go

Warning: this option will result in data loss.  Try this only after all other options have been exhausted.

Afterwards, set your database back to multi-user or normal, and take another FULL backup.


alter database YOURDATABASE
set multi_user;
go

alter database YOURDATABASE
set online;
go

backup database YOURDATABASE
to disk = 'c:\YOURDATABASE.bak'
with format, checksum, stats=1;
go

Be sure to take a FULL backup.

For more information about the DBCC CHECKDB command, please refer to the maser, Paul Randal (blog | twitter).

Summary

A power failure can be a stressful thing to deal with.  Everyone wants the systems back online as soon as possible.  However, it is important to remain calm and be sure to run some basic consistency checks on the databases before allowing users back online.

Help yourself sleep better at night.  Be sure to take regular BACKUPs, use CHECKSUM, and run DBCC CHECKDB weekly.

Advertisements