.vhd, direct iSCSI, and SCSI Paththrough

Many of my peers have debated the three basic storage device connectivity options for Hyper-V for many months. After much debate, I decided to jot-down some ideas to directly address concerns regarding SCSI-passthrough vs. iSCSI in-guest initiator access vs. VHD. I approach the issues from two vantage points, then make some broad generalizations, conclusions, and offer my sage wisdom 😉
  1. Device management
  2. Capacity limitations
  3. Recommendations


Device management:

  • SCSI-passthrough devices are drives presented to the parent partition — they are assigned to a specific child VM; the child VM then “owns” the disk resource. The issues that come from this architecture have to do with the “protection” of the device. Because not ALL SCSI instructions are passed into the child (by default), array-based management techniques cannot be used. Along comes EMC Replication Manager.  Thanks to the vigilant work of the EMC RM team, they have discovered the Windows Registry Entry for filtering SCSI commands and provided instructions for turning SCSI filtering off for the LUNs you need to snap and clone.  This is big news because Windows Server 2008 used to break SAN-based tools. For example, prior to this update you could not snap/clone the array’s LUNs because the array could not effectively communicate with the child VM. Now, array-based replication technologies CAN still be used. In addition to clones and snaps, the SCSI-passthrough device can be failed-over to a surviving Hyper-V node — either locally for High Availability or remotely for Disaster Recovery. Both RecoverPoint and MirrorView support Cluster Enabled automated failover.
  • …and now the rest of the story — Both Fibre Channel and iSCSI arrays can present storage devices to a Hyper-V parent, however differences is total bandwidth ultimately divide these two technologies. iSCSI is dependent on two techniques for increasing bandwidth past the 1Gbps (60MB/s) connection speed of a single pathway: 1.) iSCSI Multiple Connections per Session (MCS) and 2.) NIC-teaming. Most iSCSI targets (arrays) are limited to 4-iSCSI pathways per controller. When MCS or NIC-teaming is used, the maximum bandwidth the parent can bring to its child VMs is 240MB/s — a non-trivial amount, but 240MB/s is a “four NIC total — for the entire HV node — not just the HV child! On the other hand (not the Left Hand…), Fibre Channel arrays and HBA’s are equipped with dual-8Gbps interfaces — each interface can produce a whopping 720MB/s of sustained bandwidth when copying large block IO. In fact, 8Gbps interfaces can carry over 660MB/s when carrying 64KB IOs and slightly less as IO sizes drop to 8KB and below. When using Hyper-V with EMC CLARiiON arrays, EMC Powerpath software provides advanced pathway management and “fuses” the two 8GBps links together — bringing more than 1400GBps to the parent and child VMs. In addition, because FC uses a purpose-built lossless network, there is never competition for the network, switch backplane, or CPU.
  • iSCSI in-guest initiator presents the “data” volume to child VMs via in-parent networking out to an external storage device — CLARiiON, Windows Storage Server, NAS device, etc. iSCSI in-guest device mapping is Hyper-V’s “expected” pathway for data volume presentation to virtual machines — it truly offers the richest “features” from a storage perspective — Array-based clones and snaps can be taken with ease, for example. With iSCSI devices, there are no management limitations for Replication Manager: snaps and clones can be directly managed by the RM server/array. Devices can be copied and/or mounted to backup VMs, presented to Test/Dev VMs, and replicated to DR sites for remote backup.
  • …and now, the rest of the story — an iSCSI in-guest initiator must use the CPU of the parent in order to packetize/depacketize the data from the IP stream (or use the dedicated resources of a physical TCP Offloading NIC placed in the HV host) — this additional overhead is usually not noticed, except when performing high IO operations such as backups/restores/data loads/data dumps — keep in mind that Jumbo frames must be passed from the storage array, through the network layer, into each guest. Furthermore, each guest/child must use 4 or more virtual NICs to obtain iSCSI bandwidth near the 240MB/s target. The CPU cycles an in-guest initiator can consume are often 3-10% of the child’s CPU usage — the more child VMs, the more parent CPU will be devoted to packetizing data.

Capacity limitations:

  • VHDs have a well-known limit of 2TB, iSCSI and SCSI-passthrough devices are not limited to 2TB, and can be formatted for 16TB or more depending on the file system chosen. Beyond Hyper-V’s use of three basic VM connectivity types, there is the concept of the Clustered Shared Volume (CSV). Multiple CSVs can be deployed, but there primary goal for Hyper-V is to store virtual machines, not child VM data. CSVs can be formatted with GPT and allowed to grow to 16TB.
  • …and now, the rest of the story — Of course, in-guest iSCSI and SCSI Passthrough are exclusive of CSVs. VHDs can sit on CSV, but CSVs cannot present “block storage” to a child. Using a CSV implies that nothing on it will be more than 2TB in size. Furthermore… at more-than 2TB, recovery becomes more important than the size of the volume. Recovering a >2TB device at 240MB/s, for example, will take as little as 2.9 hours and usually as much as 8.3 hours — depending greatly on the number of threads the restoration process can run. >2TB restorations can take more than 24 hours if threading cannot be maximized. To address capacity issues related to file serving environments, a Boston-based company called Sanbolic has release a file system alternative to Microsoft’s CSV called Melio 2010. Melio is purpose-built to address clustered storage presented to Hyper-V servers that serve files. Meilo is multi-locking, and provides QoS, and enterprise reporting. http://www.sanbolic.com/Hyper-V.htm Melio is amazing technology, but honestly does nothing to “fix” the 2TB limit of VHDs.


Conclusion/Recommendations

  • iSCSI in-guest initiators should be used where cloning and snapping of data volumes is paramount to the operations of the VM under consideration. SQL Server and Sharepoint are two primary examples.
  • FC-connected SCSI devices should be used when high bandwidth applications are being considered.
  • Discrete array-based LUNs should always be presented for all valuable application data. Array-based LUNs allow cluster failover of discrete VMs with their data as well as array-based replication options.
  • CSVs should be used for “general purpose” storage of Virtual Machine boot drives and configuration files.
  • sanbolic Melio FS 2010 should be considered for highly versatile clustered shared storage.

Advertisements

Virtual Machine Licensing

It is a widely know “fact” that the Datacenter Edition of Windows Server 2008 allows an “unlimited” number of Virtual Machines to be installed and run on the machine for which WS2k8DCE was licensed. When you buy Datacenter Edition, you pay for each physical processor (sockets filled with CPUs) you are currently running in the server hardware. For example, if you have a Dell 1950 III, you might have two processors with four cores on each processor. In this case, you need to pay for, of course, two processors. WS2k8DC has an MSRP of $2999/proc — So, the OS will cost you $5998.
The good news is that you can now install any number of Windows Servers on that Dell 1950 III as virtual machines (running on the Hyper-V layer). So… if I were a clever admin, I’d put 2TB of PC2-5300 into the box and install a gajillion servers on it.
…and now the rest of the story:
First of all, I can’t put 2TB of RAM into the server. But let’s just say I could. If I want support from Microsoft, I can only place 384 child machines on a Datacenter server.
but wait, there’s more… if I like Live Migration and Failover Clustering, I can buy another 1950 and “join” it to my first DC in a MNS cluster (let’s forget about the network, the storage, etc. for now). I would need another licence of WS2k8DC at $5998. My OS license cost is now $11,996 (list) and I can run as many machines as I can shake a stick at… or can I?!
The part of the story that you won’t hear too often is the support limitations regarding virtual machine densities for stand-alone servers versus clustered servers. Net/Net: Clustered servers are only supported with less than 65 virtual machines per cluster node. In a 16-server cluster, you can run 1024 machines — that’s it, period. So — let’s go back to the CFO to pay for this… 16-nodes of Datacenter on Dell 1950 III’s will set me back $95,968. I can run 1024 servers — each server costs me $93.72. Amazing — really amazing. And let’s not forget — this INCLUDES the cost of the Hypervisors and rudimentory management software. It’s really amazing. — Singularly licensed servers would have cost me $999 each — for a total of $1,022,976. To be fair, no one would buy servers that way — with Open, Select, AE, etc… the cost of 1024 servers wouldn’t cost $999, it would be half that. However, it’s easy to see a difference between $93.72 and $500!
Let’s also consider the “flexibility” of the cost model: If I was willing to pay $500 for each server license, what is the minimum number of servers I can deploy and still justify the cost of WS2k8DC licenses in my 16-node cluster? Simple — we’ll divide $95968 by $500 and see that if we implement less than 192 virtual machines (as few as 12 servers per node), we need to start dropping nodes to make the math work.