February | 2011 | Do It Right or Do It Over

Customer asks:

I hope you remember the discussion we had back in October 2010 around the need for backups for Exchange. Well, I feel like we are running in circles around Exchange 2010 backup and recovery. I felt relatively comfortable with the idea until recently when we started making some further changes to the environment in response to the growth of the log files. We are attempting to switch to circular logging because Avamar doesn’t reset the archive bit and thus the logs never get truncated and grow until the volume is full. Allegedly having 3 copies of the db’s is supported/recommended using circular logging but it has raised some questions:

* If we have 2 db servers running on vm’s in the Cincinnati data center, is it advisable to use the lag copy for the 3rd db server in Lexington or does this violate the “rule of 3″?

* Is it better to have all 3 in the same DAG with no lag at all vs. having the lag copy?

* In what scenario would having the lag copy be beneficial?

* If we enable circular logging on the active db server, do all of the other servers mirror that setting? Is this Circular Logging Continuos Replication (CLCR)?

* If the passive copies don’t pick up the change in log file settings, is it possible/advisable to have an alternate log mechanism on the other copies?

* Now that we have increased our capacity in Avamar, what would a reasonable DR backup plan look like? Currently we are using weekly.

* How do SAN snapshots fill in the gaps between backups and log files? We use Dell/Equallogic storage today and while the auto-snapshot feature sounds nice, in reality the snapshots are huge and thus we can only keep one snapshot per db. I’m sure you can comment on how much better EMC’s tools are and I actually welcome that discussion.

Ok… Here’s what I wrote back:

First, thank you for your message. I have escalated your issue with the Avamar client. Not truncating logs at the end of a successful backup is errant behavior — we need to find out why this this is not working for you.

Second… I’ll take your questions in a Q&A format:

Q * If we have 2 db servers running on vm’s in the Cincinnati data center, is it advisable to use the lag copy for the 3rd db server in Lexington or does this violate the “rule of 3″?

A * A lagged copy violates the “rule of three” — the three copies become block-for-block voting members in repairing torn pages (likely if running on SATA Direct Attached/non-intelligent storage devices). A lagged copy will not be counted as a voting member.

Q * Is it better to have all 3 in the same DAG with no lag at all vs. having the lag copy?

A * If you are running on an intelligent array (like ALL of EMC’s arrays which perform advanced CRC tracking and scanning), you will not need Exchange to vote on page repairs — there won’t be any! — so a lagged copy can be used to backwards in time. Technically, you would need two “near” copies, and one “far” lagged copy to satisfy your temporal repair issues. More on this below.

Q * In what scenario would having the lag copy be beneficial?

A * a corruption event that effects the Active copies of the database — you could use the lagged copy with its unplayed logs to isolate the corruption within an unplayed log and prevent the corruption from infecting the only unaffected copy you have left… This is why a backup becomes important…

Q * If we enable circular logging on the active db server, do all of the other servers mirror that setting? Is this Circular Logging Continuos Replication (CLCR)?

A * In a word, YES. See http://technet.microsoft.com/en-us/library/dd876874.aspx#FleMaiPro

Q * If the passive copies don’t pick up the change in log file settings, is it possible/advisable to have an alternate log mechanism on the other copies?

A * no

Q * Now that we have increased our capacity in Avamar, what would a reasonable DR backup plan look like? Currently we are using weekly.

A * Excellent Question!! Microsoft’s Exchange Product group has begun to intermix the notions of HA, Backup/Recovery, and DR. In a perfect world, you would have five copies of each database (sounds like SAP eh?) Two at the primary data center (PDC), three at the backup data center (BDC), plus one more lagged copy at the BDC. In that architecture, you can use Exchange Native Backup, but that doesn’t answer your question. Assuming that we can get Avamar to truncate your logs, a DR plan would look like this: Use a two-site DAG with two local copies and one remote copy — all Active copies. You will place a database activation block on the remote databases to prevent accidental routing of mail to the remote databases. Now… Install the Avamar client on the remote mailbox server and back it up every night (full only!!). Keep the backups for 14-days. Also, at the PDC, configure Reserve LUN Pool space for one of the mailbox servers. It does not matter if the server is Active or Passive. Configure Replication Manager to create VSS snapshots of all the databases on that server. Execute snaps once per day. Never back them up. Expire the snapshots after eight days. This will provide:

*Local HA in real-time via DAG-Replication — the ability to perform patch management every week with no downtime,

*Local BC at less than 24-hour RPO. This will allow you to recover your database in the strange event that a rolling corruption event makes it to the remote data center.

*Remote failover for DR/BC and Site Maintenance.

*Remote BC/DR in the event of a smoking hole during a corruption event.

Q * How do SAN snapshots fill in the gaps between backups and log files? We use Dell/Equallogic storage today and while the auto-snapshot feature sounds nice, in reality the snapshots are huge and thus we can only keep one snapshot per db. I’m sure you can comment on how much better EMC’s tools are and I actually welcome that discussion.

A * Hmmm… Yes. EMC’s snapshots on CX4 and VMAX require only the snapshot space required to hold the deltas, not the entire LUN. Array-based snapshots do not actually fill a gap between log files and backups, however they provide an additional layer of recoverability that closes a temporal gap associated with recovery from backup. For example, I can recover a 1TB database from a snapshot in about 4-minutes. I can recover a 1TB database from DISK (B2D) in about 14-hours!!

I hope this helps!!

Do It Right or Do It Over

Protocol isn't everything — but it's worth considering.

Monthly Archives: February 2011

Should I use a Lagged Copy in my DAG?