Should I use a Lagged Copy in my DAG?

Customer asks:

I hope you remember the discussion we had back in October 2010 around the need for backups for Exchange.  Well, I feel like we are running in circles around Exchange 2010 backup and recovery.  I felt relatively comfortable with the idea until recently when we started making some further changes to the environment in response to the growth of the log files.  We are attempting to switch to circular logging because Avamar doesn’t reset the archive bit and thus the logs never get truncated and grow until the volume is full.  Allegedly having 3 copies of the db’s is supported/recommended using circular logging but it has raised some questions:
*   If we have 2 db servers running on vm’s in the Cincinnati data center, is it advisable to use the lag copy for the 3rd db server in Lexington or does this violate the “rule of 3″?
*   Is it better to have all 3 in the same DAG with no lag at all vs. having the lag copy?
*   In what scenario would having the lag copy be beneficial?
*   If we enable circular logging on the active db server, do all of the other servers mirror that setting?  Is this Circular Logging Continuos Replication (CLCR)?
*   If the passive copies don’t pick up the change in log file settings, is it possible/advisable to have an alternate log mechanism on the other copies?
*   Now that we have increased our capacity in Avamar, what would a reasonable DR backup plan look like?  Currently we are using weekly.
*   How do SAN snapshots fill in the gaps between backups and log files?  We use Dell/Equallogic storage today and while the auto-snapshot feature sounds nice, in reality the snapshots are huge and thus we can only keep one snapshot per db.  I’m sure you can comment on how much better EMC’s tools are and I actually welcome that discussion.

Ok… Here’s what I wrote back:

First, thank you for your message.  I have escalated your issue with the Avamar client.  Not truncating logs at the end of a successful backup is errant behavior — we need to find out why this this is not working for you.
Second… I’ll take your questions in a Q&A format:
Q *   If we have 2 db servers running on vm’s in the Cincinnati data center, is it advisable to use the lag copy for the 3rd db server in Lexington or does this violate the “rule of 3″?
A *  A lagged copy violates the “rule of three” — the three copies become block-for-block voting members in repairing torn pages (likely if running on SATA Direct Attached/non-intelligent storage devices).  A lagged copy will not be counted as a voting member.
Q  *   Is it better to have all 3 in the same DAG with no lag at all vs. having the lag copy?
A * If you are running on an intelligent array (like ALL of EMC’s arrays which perform advanced CRC tracking and scanning), you will not need Exchange to vote on page repairs — there won’t be any! — so a lagged copy can be used to backwards in time.  Technically, you would need two “near” copies, and one “far” lagged copy to satisfy your temporal repair issues.  More on this below.
Q  *   In what scenario would having the lag copy be beneficial?
A * a corruption event that effects the Active copies of the database — you could use the lagged copy with its unplayed logs to isolate the corruption within an unplayed log and prevent the corruption from infecting the only unaffected copy you have left…  This is why a backup becomes important…
Q  *   If we enable circular logging on the active db server, do all of the other servers mirror that setting?  Is this Circular Logging Continuos Replication (CLCR)?
A *   In a word, YES.  See http://technet.microsoft.com/en-us/library/dd876874.aspx#FleMaiPro
Q *   If the passive copies don’t pick up the change in log file settings, is it possible/advisable to have an alternate log mechanism on the other copies?
A *  no
Q  *   Now that we have increased our capacity in Avamar, what would a reasonable DR backup plan look like?  Currently we are using weekly.
A *  Excellent Question!!  Microsoft’s Exchange Product group has begun to intermix the notions of HA, Backup/Recovery, and DR.  In a perfect world, you would have five copies of each database (sounds like SAP eh?) Two at the primary data center (PDC), three at the backup data center (BDC), plus one more lagged copy at the BDC.  In that architecture, you can use Exchange Native Backup, but that doesn’t answer your question.  Assuming that we can get Avamar to truncate your logs, a DR plan would look like this: Use a two-site DAG with two local copies and one remote copy — all Active copies.  You will place a database activation block on the remote databases to prevent accidental routing of mail to the remote databases.  Now… Install the Avamar client on the remote mailbox server and back it up every night (full only!!).  Keep the backups for 14-days.  Also, at the PDC, configure Reserve LUN Pool space for one of the mailbox servers.  It does not matter if the server is Active or Passive.  Configure Replication Manager to create VSS snapshots of all the databases on that server.  Execute snaps once per day.  Never back them up.  Expire the snapshots after eight days.  This will provide:
*Local HA in real-time via DAG-Replication — the ability to perform patch management every week with no downtime,
*Local BC at less than 24-hour RPO.  This will allow you to recover your database in the strange event that a rolling corruption event makes it to the remote data center.
*Remote failover for DR/BC and Site Maintenance.
*Remote BC/DR in the event of a smoking hole during a corruption event.
Q *   How do SAN snapshots fill in the gaps between backups and log files?  We use Dell/Equallogic storage today and while the auto-snapshot feature sounds nice, in reality the snapshots are huge and thus we can only keep one snapshot per db.  I’m sure you can comment on how much better EMC’s tools are and I actually welcome that discussion.
A *  Hmmm… Yes.  EMC’s snapshots on CX4 and VMAX require only the snapshot space required to hold the deltas, not the entire LUN.  Array-based snapshots do not actually fill a gap between log files and backups, however they provide an additional layer of recoverability that closes a temporal gap associated with recovery from backup.  For example, I can recover a 1TB database from a snapshot in about 4-minutes.  I can recover a 1TB database from DISK (B2D) in about 14-hours!!
I hope this helps!!

Can I boot my Hyper-V Guests from SAN?

Yes, you can (and should) boot your Hyper-V children/guests from your shared storage array.  Here’s how:

Create a Hyper-V cluster of at least two Hyper-V Parent machines using a third File Share Witness (I used a share off my DC).  After you create your Failover Cluster, zone and mask storage to both nodes of your new cluster.  Use the Failover Cluster Manager to add the newly zoned storage as a cluster resource, then convert the volume to a Cluster Shared Volume.  This is far easier than I make it sound.

Now, the Hyper-V Manager to edit your default properties to store machines and virtual machine definition files on the newly created CSV.  Now, use the Failover Cluster Manager to create a new Virtual Machine — and make sure you are storing it on the CSV.  Viola.  Done.  You are now booting your VM from SAN.

The data partitions, can now be stored in the same CSV (NOT recommended!!!), stored on a second or third VM, stored on a SCSI Passthrough volume, or stored via iSCSI initiator — all on the SAN — in separate storage pools, RAID Groups, etc.

A Primer for Laptop SSD Buyers (draft)

This blog entry attempts to assist the mobile user as he/she begins to search for an SSD replacement for his/her existing rotational drive.

Components:

Every drive has it’s own Controller, MLC NAND, and firmware features.  Every manufacturer puts each drive line together using using these building blocks.  Each aspect of the Controller, the NAND, and the Firmware have profound effects on a drive’s performance, longevity, and interoperability with the operating system supporting it.

The controllers are manufactured by: Sandforce, Samsung, Indilinx (Barefoot & Amigos), and Jmicron.  Each controller manufacturer may have several models of controllers to pick from.  The most popular consumer controllers at the moment are: Indilinx IDX110M00-FC “Barefoot”, Intel PC29AS21AA0, JMicron JMF612, Toshiba T6UG1XBG, Samsung S3C29RBB01-YK40, Marvell 88SS8014-BHP2, SandForce SF-1200/1500, and now the Marvell 88SS9174-BJP2 SATA-III SSD controller.  Samsung & Indilinx seem to have the best reputation for “studder-free” performance.  Samsung’s controller is in it’s second version.  Indilinx has been around the longest and is referred to as “stutter free, has integrated cache, and provides “smooth operation”.  Sandforce has introduced “DuraClass Technology” in its controllers (SF-1500, and the lower cost, lower performance SF-1200)  and firmware set to offer an entirely new way to commit data into the NAND Flash Array (more on this below).  It is so revolutionary that it virtually eliminates the need for TRIM.

The Multi Layer Cell NAND comes from: Jmicron, Samsung, & Intel.  Each NAND manufacturer, of course, has various MLC sets of varying speeds and densities.

The firmware can contain leveling code in addition to support for TRIM (an a standardized approach to “telling” the OS which cells are available for writing.  If the OS supports TRIM, then the drive does not need to manage cell usage, the OS will take over.)  Other features include garbage collection (the process of moving data from “used” cells into other cells to make incoming writes more efficient), power management, TRIM management, data protection algorithms for reducing data loss in light of power failure, data management to reduce maximum write latency, “full drive” data management features to increase performance as the drive reaches capacity.

Examples of controllers and which drives use them:

Samsung S3C29RBB01-YK40 (second generation Samsung controller):
Samsung PB22-J
Corsair Performance Series (seems to actually BE the Samsung PB22-J)

SandForce SF-1200 or SF-1500:
A-Data S599 Series
Corsair Force Series
Mushkin Callisto Series
OCZ Vertex LE Series
OCZ Vertex 2 Series
OCZ Agility Series
RunCore Pro-V Series
Super Talent TeraDrive Series
Unigen
AMP SATAsphere Series
IBM
Viking

Indilinx Barefoot:
Nearly every “mainstream” second-generation drive uses the Barefoot
OCZ Vertex Series
Crucial M225
Corsair Nova (V) and Reactor

Here is a matrix of the OCZ drives, the speeds, and controllers they use: http://www.ocztechnology.com/res/manuals/OCZ_%20SSDs_quick_compare_5.pdf
 

What to Look for in a drive:

Drives should be of the appropriate size for only those things that you want to run quickly.  For example, you may not want to store your 90GB music files on an SSD, but you will want to store your Operating System and paging file on an SSD.  You may want to store your multi-megabyte jpeg and raw photo files on SSD depending on what you are doing with them, i.e. editing, tagging, etc.

Drives should have an algorithm to avoid pre-mature aging.  Some drives have Background garbage collection routines.  While BGC does increase the overall write performance of your drive over time, it also shortens its life by over-using the cells as it relocates data.  TRIM is an OS-enabled partnership with the drive that allows the OS to understand which cells are “best” to write to at any moment.  TRIM promises to extend the life of drives by avoiding cell over-use and premature cell aging.  TRIM is loaded with faults however and is really intended for workstations and laptops.  TRIM does nothing to solve the long-term performance and durability of drives installed inservers and under RAID controllers.  To solve the issues of long-term performance stability and drive durability, other controller manufacturers have approached the issue from a completely new direction — see Write Amplification below.

Speed is certainly a concern for every SSD.  Several controllers directly address performance enhancements in several ways: 1) Throughput in IOs/s, 2) Bandwidth in MB/s, and then the notion of sustained performance.  Every controller manufacturer has new controllers (since mid-2009) that bring read and write performance into the 200MB/s range.  A small number of controller and Single Layer Cell combinations can bring performance over 300MB/s.

Firmware support is crucial (no pun intended).  Firmware updates allow the manufacturers to add features like TRIM, power management, and ECC recovery algorithms.  Almost every drive manufacturer has allowed for firmware updates to their drives.  The downside is that nearly every vendor currently dictates a complete re-initialization of the drive — total loss of every bit of data on the drive — a complete reload is necessary after the firmware is installed.  Manufacturers such as Crucial are working to avoid this inconvenience in the future firmware releases.

Price is always a concern.  SandForce, for example has released a new version of their amazing SF-1500 controller and firmware set in an attempt to compete with lower cost controllers from Indilinx (for example).  the new SF-1200 controller offers reduced performance compared to the SF-1500, but still offers SandForce’s DuraClass write leveling technology at a substantially lower cost.

A look at Specific Drives:

Crucial, Corsair, Intel, and OCZ all have their latest firmware sets, some drives were able to receive new firmware, other previous generation drives could not implement all the new feature sets.

Below is an example of firmware updates from Crucial.  The Crucial M225 and RealSSD C300 got firmware updates this year (January and May respectively).  Both drives have added support for TRIM.  You might notice that the M225 has added/modified their “wear leveling algorithm”.  These algorithms go above and beyond what TRIM provides.

RealSSD C300 (Marvel SATA-III controller)

Release Date: 5/20/2010
Change Log:
Improved Power Consumption
Improved TRIM performance
Enabled the Drive Activity Pin (Pin 11)
Improved Robustness due to unexpected power loss
Improved data management to reduce maximum write latency
Improved Performance of SSD as it fills up with data
Improved Data Integrity

M225 (Indilinx Barefoot controller)
Release Date: 1/21/2010
Change Log:
Fixed issue that sometimes causes firmware download problem
Fixed issue that could cause 256GB to be corrupted
Eliminated performance degradation over time with Wiper with 1819 FW
Fixed issue where the power cycle count was incorrectly being reported with 1819 FW
Fixed issue where some SATA 1 hosts weren’t correctly identifying the hardware
Fixed issue found in simulation (not in the field) where the free block count was incorrectly being reported
Fixed issue with remaining life not being properly displayed on SMART information
Added support for additional NAND manufactures and capacities
Made further improvements to wear leveling algorithm

Corsair Performance Series (P256)
Has a Samsung controller and Samsung-provided firmware

The “Force” series of drives from Corsair are based on the SandForce SF-1200; they were introduced in May of 2010 so this technology is really new and revolutionary. http://www.sandforce.com/index.php?id=19&parentId=2
OCZ also uses the SandForce controllers (the Vertex Limited Edition line uses the higher-end SF-1500 controller, while the Vertex 2, and Agility 2 lines use the less expensive SF-1200 controller).

Corsair has a great blog entry describing “write amplification”: http://blog.corsair.com/?p=3044

The SandForce controller mentioned above brings DuraClass Technology to market… Here’s an excerpt from Kevin Conley’s blog at Corsair:SandForce demonstrated that through its innovative DuraClass technology, Write Amplification factors below 1 could actually be achieved. Not only that but also without the use of a large (and expensive) external Data Cache. As noted in some other blogs, this data intelligence utilitizes data-dependent compression techniques coupled with other “secret sauce” algorithms to reduce the amount of data to write in the first place, in some cases quite significantly. The SandForce SSD processor then manages the programming of data using very efficient Page Management algorithms that prevent the need for Garbage Collection down the road. The net result of this is a Write Amplification much lower than other SSD controllers achieve, and thus the screaming fast write performance demonstrated by the Corsair Force Series solid-state drives. This is an even more amazing feat when considering these SSDs use MLC memory but compete with enterprise-class solutions utilizing much faster SLC memory.

Drives I would buy (based on price, speed, and durability/longevity:
OCZ Vertex 2 (SF-1500) – up to 50,000 4k IOPS
OCZ Vertex LE (SF-1500 LE)
OCZ Agility 2 (SF-1200) – up to 10,000 4k IOPS
Mushkin Callisto (SF-1200)
Corsair Force (SF-1200)



.vhd, direct iSCSI, and SCSI Paththrough

Many of my peers have debated the three basic storage device connectivity options for Hyper-V for many months. After much debate, I decided to jot-down some ideas to directly address concerns regarding SCSI-passthrough vs. iSCSI in-guest initiator access vs. VHD. I approach the issues from two vantage points, then make some broad generalizations, conclusions, and offer my sage wisdom 😉
  1. Device management
  2. Capacity limitations
  3. Recommendations


Device management:

  • SCSI-passthrough devices are drives presented to the parent partition — they are assigned to a specific child VM; the child VM then “owns” the disk resource. The issues that come from this architecture have to do with the “protection” of the device. Because not ALL SCSI instructions are passed into the child (by default), array-based management techniques cannot be used. Along comes EMC Replication Manager.  Thanks to the vigilant work of the EMC RM team, they have discovered the Windows Registry Entry for filtering SCSI commands and provided instructions for turning SCSI filtering off for the LUNs you need to snap and clone.  This is big news because Windows Server 2008 used to break SAN-based tools. For example, prior to this update you could not snap/clone the array’s LUNs because the array could not effectively communicate with the child VM. Now, array-based replication technologies CAN still be used. In addition to clones and snaps, the SCSI-passthrough device can be failed-over to a surviving Hyper-V node — either locally for High Availability or remotely for Disaster Recovery. Both RecoverPoint and MirrorView support Cluster Enabled automated failover.
  • …and now the rest of the story — Both Fibre Channel and iSCSI arrays can present storage devices to a Hyper-V parent, however differences is total bandwidth ultimately divide these two technologies. iSCSI is dependent on two techniques for increasing bandwidth past the 1Gbps (60MB/s) connection speed of a single pathway: 1.) iSCSI Multiple Connections per Session (MCS) and 2.) NIC-teaming. Most iSCSI targets (arrays) are limited to 4-iSCSI pathways per controller. When MCS or NIC-teaming is used, the maximum bandwidth the parent can bring to its child VMs is 240MB/s — a non-trivial amount, but 240MB/s is a “four NIC total — for the entire HV node — not just the HV child! On the other hand (not the Left Hand…), Fibre Channel arrays and HBA’s are equipped with dual-8Gbps interfaces — each interface can produce a whopping 720MB/s of sustained bandwidth when copying large block IO. In fact, 8Gbps interfaces can carry over 660MB/s when carrying 64KB IOs and slightly less as IO sizes drop to 8KB and below. When using Hyper-V with EMC CLARiiON arrays, EMC Powerpath software provides advanced pathway management and “fuses” the two 8GBps links together — bringing more than 1400GBps to the parent and child VMs. In addition, because FC uses a purpose-built lossless network, there is never competition for the network, switch backplane, or CPU.
  • iSCSI in-guest initiator presents the “data” volume to child VMs via in-parent networking out to an external storage device — CLARiiON, Windows Storage Server, NAS device, etc. iSCSI in-guest device mapping is Hyper-V’s “expected” pathway for data volume presentation to virtual machines — it truly offers the richest “features” from a storage perspective — Array-based clones and snaps can be taken with ease, for example. With iSCSI devices, there are no management limitations for Replication Manager: snaps and clones can be directly managed by the RM server/array. Devices can be copied and/or mounted to backup VMs, presented to Test/Dev VMs, and replicated to DR sites for remote backup.
  • …and now, the rest of the story — an iSCSI in-guest initiator must use the CPU of the parent in order to packetize/depacketize the data from the IP stream (or use the dedicated resources of a physical TCP Offloading NIC placed in the HV host) — this additional overhead is usually not noticed, except when performing high IO operations such as backups/restores/data loads/data dumps — keep in mind that Jumbo frames must be passed from the storage array, through the network layer, into each guest. Furthermore, each guest/child must use 4 or more virtual NICs to obtain iSCSI bandwidth near the 240MB/s target. The CPU cycles an in-guest initiator can consume are often 3-10% of the child’s CPU usage — the more child VMs, the more parent CPU will be devoted to packetizing data.

Capacity limitations:

  • VHDs have a well-known limit of 2TB, iSCSI and SCSI-passthrough devices are not limited to 2TB, and can be formatted for 16TB or more depending on the file system chosen. Beyond Hyper-V’s use of three basic VM connectivity types, there is the concept of the Clustered Shared Volume (CSV). Multiple CSVs can be deployed, but there primary goal for Hyper-V is to store virtual machines, not child VM data. CSVs can be formatted with GPT and allowed to grow to 16TB.
  • …and now, the rest of the story — Of course, in-guest iSCSI and SCSI Passthrough are exclusive of CSVs. VHDs can sit on CSV, but CSVs cannot present “block storage” to a child. Using a CSV implies that nothing on it will be more than 2TB in size. Furthermore… at more-than 2TB, recovery becomes more important than the size of the volume. Recovering a >2TB device at 240MB/s, for example, will take as little as 2.9 hours and usually as much as 8.3 hours — depending greatly on the number of threads the restoration process can run. >2TB restorations can take more than 24 hours if threading cannot be maximized. To address capacity issues related to file serving environments, a Boston-based company called Sanbolic has release a file system alternative to Microsoft’s CSV called Melio 2010. Melio is purpose-built to address clustered storage presented to Hyper-V servers that serve files. Meilo is multi-locking, and provides QoS, and enterprise reporting. http://www.sanbolic.com/Hyper-V.htm Melio is amazing technology, but honestly does nothing to “fix” the 2TB limit of VHDs.


Conclusion/Recommendations

  • iSCSI in-guest initiators should be used where cloning and snapping of data volumes is paramount to the operations of the VM under consideration. SQL Server and Sharepoint are two primary examples.
  • FC-connected SCSI devices should be used when high bandwidth applications are being considered.
  • Discrete array-based LUNs should always be presented for all valuable application data. Array-based LUNs allow cluster failover of discrete VMs with their data as well as array-based replication options.
  • CSVs should be used for “general purpose” storage of Virtual Machine boot drives and configuration files.
  • sanbolic Melio FS 2010 should be considered for highly versatile clustered shared storage.