A Primer for Laptop SSD Buyers (draft)

Posted on June 17, 2010 by Jimbo Jones

This blog entry attempts to assist the mobile user as he/she begins to search for an SSD replacement for his/her existing rotational drive.

Components:

Every drive has it’s own Controller, MLC NAND, and firmware features. Every manufacturer puts each drive line together using using these building blocks. Each aspect of the Controller, the NAND, and the Firmware have profound effects on a drive’s performance, longevity, and interoperability with the operating system supporting it.

The controllers are manufactured by: Sandforce, Samsung, Indilinx (Barefoot & Amigos), and Jmicron. Each controller manufacturer may have several models of controllers to pick from. The most popular consumer controllers at the moment are: Indilinx IDX110M00-FC “Barefoot”, Intel PC29AS21AA0, JMicron JMF612, Toshiba T6UG1XBG, Samsung S3C29RBB01-YK40, Marvell 88SS8014-BHP2, SandForce SF-1200/1500, and now the Marvell 88SS9174-BJP2 SATA-III SSD controller. Samsung & Indilinx seem to have the best reputation for “studder-free” performance. Samsung’s controller is in it’s second version. Indilinx has been around the longest and is referred to as “stutter free, has integrated cache, and provides “smooth operation”. Sandforce has introduced “DuraClass Technology” in its controllers (SF-1500, and the lower cost, lower performance SF-1200) and firmware set to offer an entirely new way to commit data into the NAND Flash Array (more on this below). It is so revolutionary that it virtually eliminates the need for TRIM.

The Multi Layer Cell NAND comes from: Jmicron, Samsung, & Intel. Each NAND manufacturer, of course, has various MLC sets of varying speeds and densities.

The firmware can contain leveling code in addition to support for TRIM (an a standardized approach to “telling” the OS which cells are available for writing. If the OS supports TRIM, then the drive does not need to manage cell usage, the OS will take over.) Other features include garbage collection (the process of moving data from “used” cells into other cells to make incoming writes more efficient), power management, TRIM management, data protection algorithms for reducing data loss in light of power failure, data management to reduce maximum write latency, “full drive” data management features to increase performance as the drive reaches capacity.

Examples of controllers and which drives use them:

Samsung S3C29RBB01-YK40 (second generation Samsung controller):
Samsung PB22-J
Corsair Performance Series (seems to actually BE the Samsung PB22-J)

SandForce SF-1200 or SF-1500:
A-Data S599 Series
Corsair Force Series
Mushkin Callisto Series
OCZ Vertex LE Series
OCZ Vertex 2 Series
OCZ Agility Series
RunCore Pro-V Series
Super Talent TeraDrive Series
Unigen
AMP SATAsphere Series
IBM
Viking

Indilinx Barefoot:
Nearly every “mainstream” second-generation drive uses the Barefoot
OCZ Vertex Series
Crucial M225
Corsair Nova (V) and Reactor

Here is a matrix of the OCZ drives, the speeds, and controllers they use: http://www.ocztechnology.com/res/manuals/OCZ_%20SSDs_quick_compare_5.pdf

What to Look for in a drive:

Drives should be of the appropriate size for only those things that you want to run quickly. For example, you may not want to store your 90GB music files on an SSD, but you will want to store your Operating System and paging file on an SSD. You may want to store your multi-megabyte jpeg and raw photo files on SSD depending on what you are doing with them, i.e. editing, tagging, etc.

Drives should have an algorithm to avoid pre-mature aging. Some drives have Background garbage collection routines. While BGC does increase the overall write performance of your drive over time, it also shortens its life by over-using the cells as it relocates data. TRIM is an OS-enabled partnership with the drive that allows the OS to understand which cells are “best” to write to at any moment. TRIM promises to extend the life of drives by avoiding cell over-use and premature cell aging. TRIM is loaded with faults however and is really intended for workstations and laptops. TRIM does nothing to solve the long-term performance and durability of drives installed inservers and under RAID controllers. To solve the issues of long-term performance stability and drive durability, other controller manufacturers have approached the issue from a completely new direction — see Write Amplification below.

Speed is certainly a concern for every SSD. Several controllers directly address performance enhancements in several ways: 1) Throughput in IOs/s, 2) Bandwidth in MB/s, and then the notion of sustained performance. Every controller manufacturer has new controllers (since mid-2009) that bring read and write performance into the 200MB/s range. A small number of controller and Single Layer Cell combinations can bring performance over 300MB/s.

Firmware support is crucial (no pun intended). Firmware updates allow the manufacturers to add features like TRIM, power management, and ECC recovery algorithms. Almost every drive manufacturer has allowed for firmware updates to their drives. The downside is that nearly every vendor currently dictates a complete re-initialization of the drive — total loss of every bit of data on the drive — a complete reload is necessary after the firmware is installed. Manufacturers such as Crucial are working to avoid this inconvenience in the future firmware releases.

Price is always a concern. SandForce, for example has released a new version of their amazing SF-1500 controller and firmware set in an attempt to compete with lower cost controllers from Indilinx (for example). the new SF-1200 controller offers reduced performance compared to the SF-1500, but still offers SandForce’s DuraClass write leveling technology at a substantially lower cost.

A look at Specific Drives:

Crucial, Corsair, Intel, and OCZ all have their latest firmware sets, some drives were able to receive new firmware, other previous generation drives could not implement all the new feature sets.

Below is an example of firmware updates from Crucial. The Crucial M225 and RealSSD C300 got firmware updates this year (January and May respectively). Both drives have added support for TRIM. You might notice that the M225 has added/modified their “wear leveling algorithm”. These algorithms go above and beyond what TRIM provides.

RealSSD C300 (Marvel SATA-III controller)

Release Date: 5/20/2010
Change Log:
Improved Power Consumption
Improved TRIM performance
Enabled the Drive Activity Pin (Pin 11)
Improved Robustness due to unexpected power loss
Improved data management to reduce maximum write latency
Improved Performance of SSD as it fills up with data
Improved Data Integrity

M225 (Indilinx Barefoot controller)
Release Date: 1/21/2010
Change Log:
Fixed issue that sometimes causes firmware download problem
Fixed issue that could cause 256GB to be corrupted
Eliminated performance degradation over time with Wiper with 1819 FW
Fixed issue where the power cycle count was incorrectly being reported with 1819 FW
Fixed issue where some SATA 1 hosts weren’t correctly identifying the hardware
Fixed issue found in simulation (not in the field) where the free block count was incorrectly being reported
Fixed issue with remaining life not being properly displayed on SMART information
Added support for additional NAND manufactures and capacities
Made further improvements to wear leveling algorithm

Corsair Performance Series (P256)
Has a Samsung controller and Samsung-provided firmware

The “Force” series of drives from Corsair are based on the SandForce SF-1200; they were introduced in May of 2010 so this technology is really new and revolutionary. http://www.sandforce.com/index.php?id=19&parentId=2
OCZ also uses the SandForce controllers (the Vertex Limited Edition line uses the higher-end SF-1500 controller, while the Vertex 2, and Agility 2 lines use the less expensive SF-1200 controller).

Corsair has a great blog entry describing “write amplification”: http://blog.corsair.com/?p=3044

The SandForce controller mentioned above brings DuraClass Technology to market… Here’s an excerpt from Kevin Conley’s blog at Corsair:SandForce demonstrated that through its innovative DuraClass technology, Write Amplification factors below 1 could actually be achieved. Not only that but also without the use of a large (and expensive) external Data Cache. As noted in some other blogs, this data intelligence utilitizes data-dependent compression techniques coupled with other “secret sauce” algorithms to reduce the amount of data to write in the first place, in some cases quite significantly. The SandForce SSD processor then manages the programming of data using very efficient Page Management algorithms that prevent the need for Garbage Collection down the road. The net result of this is a Write Amplification much lower than other SSD controllers achieve, and thus the screaming fast write performance demonstrated by the Corsair Force Series solid-state drives. This is an even more amazing feat when considering these SSDs use MLC memory but compete with enterprise-class solutions utilizing much faster SLC memory.

Drives I would buy (based on price, speed, and durability/longevity:
OCZ Vertex 2 (SF-1500) – up to 50,000 4k IOPS
OCZ Vertex LE (SF-1500 LE)
OCZ Agility 2 (SF-1200) – up to 10,000 4k IOPS
Mushkin Callisto (SF-1200)
Corsair Force (SF-1200)

.vhd, direct iSCSI, and SCSI Paththrough

Posted on May 27, 2010 by Jimbo Jones

Many of my peers have debated the three basic storage device connectivity options for Hyper-V for many months. After much debate, I decided to jot-down some ideas to directly address concerns regarding SCSI-passthrough vs. iSCSI in-guest initiator access vs. VHD. I approach the issues from two vantage points, then make some broad generalizations, conclusions, and offer my sage wisdom 😉

Device management
Capacity limitations
Recommendations

Device management:

SCSI-passthrough devices are drives presented to the parent partition — they are assigned to a specific child VM; the child VM then “owns” the disk resource. The issues that come from this architecture have to do with the “protection” of the device. Because not ALL SCSI instructions are passed into the child (by default), array-based management techniques cannot be used. Along comes EMC Replication Manager. Thanks to the vigilant work of the EMC RM team, they have discovered the Windows Registry Entry for filtering SCSI commands and provided instructions for turning SCSI filtering off for the LUNs you need to snap and clone. This is big news because Windows Server 2008 used to break SAN-based tools. For example, prior to this update you could not snap/clone the array’s LUNs because the array could not effectively communicate with the child VM. Now, array-based replication technologies CAN still be used. In addition to clones and snaps, the SCSI-passthrough device can be failed-over to a surviving Hyper-V node — either locally for High Availability or remotely for Disaster Recovery. Both RecoverPoint and MirrorView support Cluster Enabled automated failover.
…and now the rest of the story — Both Fibre Channel and iSCSI arrays can present storage devices to a Hyper-V parent, however differences is total bandwidth ultimately divide these two technologies. iSCSI is dependent on two techniques for increasing bandwidth past the 1Gbps (60MB/s) connection speed of a single pathway: 1.) iSCSI Multiple Connections per Session (MCS) and 2.) NIC-teaming. Most iSCSI targets (arrays) are limited to 4-iSCSI pathways per controller. When MCS or NIC-teaming is used, the maximum bandwidth the parent can bring to its child VMs is 240MB/s — a non-trivial amount, but 240MB/s is a “four NIC total — for the entire HV node — not just the HV child! On the other hand (not the Left Hand…), Fibre Channel arrays and HBA’s are equipped with dual-8Gbps interfaces — each interface can produce a whopping 720MB/s of sustained bandwidth when copying large block IO. In fact, 8Gbps interfaces can carry over 660MB/s when carrying 64KB IOs and slightly less as IO sizes drop to 8KB and below. When using Hyper-V with EMC CLARiiON arrays, EMC Powerpath software provides advanced pathway management and “fuses” the two 8GBps links together — bringing more than 1400GBps to the parent and child VMs. In addition, because FC uses a purpose-built lossless network, there is never competition for the network, switch backplane, or CPU.
iSCSI in-guest initiator presents the “data” volume to child VMs via in-parent networking out to an external storage device — CLARiiON, Windows Storage Server, NAS device, etc. iSCSI in-guest device mapping is Hyper-V’s “expected” pathway for data volume presentation to virtual machines — it truly offers the richest “features” from a storage perspective — Array-based clones and snaps can be taken with ease, for example. With iSCSI devices, there are no management limitations for Replication Manager: snaps and clones can be directly managed by the RM server/array. Devices can be copied and/or mounted to backup VMs, presented to Test/Dev VMs, and replicated to DR sites for remote backup.
…and now, the rest of the story — an iSCSI in-guest initiator must use the CPU of the parent in order to packetize/depacketize the data from the IP stream (or use the dedicated resources of a physical TCP Offloading NIC placed in the HV host) — this additional overhead is usually not noticed, except when performing high IO operations such as backups/restores/data loads/data dumps — keep in mind that Jumbo frames must be passed from the storage array, through the network layer, into each guest. Furthermore, each guest/child must use 4 or more virtual NICs to obtain iSCSI bandwidth near the 240MB/s target. The CPU cycles an in-guest initiator can consume are often 3-10% of the child’s CPU usage — the more child VMs, the more parent CPU will be devoted to packetizing data.

Capacity limitations:

VHDs have a well-known limit of 2TB, iSCSI and SCSI-passthrough devices are not limited to 2TB, and can be formatted for 16TB or more depending on the file system chosen. Beyond Hyper-V’s use of three basic VM connectivity types, there is the concept of the Clustered Shared Volume (CSV). Multiple CSVs can be deployed, but there primary goal for Hyper-V is to store virtual machines, not child VM data. CSVs can be formatted with GPT and allowed to grow to 16TB.
…and now, the rest of the story — Of course, in-guest iSCSI and SCSI Passthrough are exclusive of CSVs. VHDs can sit on CSV, but CSVs cannot present “block storage” to a child. Using a CSV implies that nothing on it will be more than 2TB in size. Furthermore… at more-than 2TB, recovery becomes more important than the size of the volume. Recovering a >2TB device at 240MB/s, for example, will take as little as 2.9 hours and usually as much as 8.3 hours — depending greatly on the number of threads the restoration process can run. >2TB restorations can take more than 24 hours if threading cannot be maximized. To address capacity issues related to file serving environments, a Boston-based company called Sanbolic has release a file system alternative to Microsoft’s CSV called Melio 2010. Melio is purpose-built to address clustered storage presented to Hyper-V servers that serve files. Meilo is multi-locking, and provides QoS, and enterprise reporting. http://www.sanbolic.com/Hyper-V.htm Melio is amazing technology, but honestly does nothing to “fix” the 2TB limit of VHDs.

Conclusion/Recommendations

iSCSI in-guest initiators should be used where cloning and snapping of data volumes is paramount to the operations of the VM under consideration. SQL Server and Sharepoint are two primary examples.
FC-connected SCSI devices should be used when high bandwidth applications are being considered.
Discrete array-based LUNs should always be presented for all valuable application data. Array-based LUNs allow cluster failover of discrete VMs with their data as well as array-based replication options.
CSVs should be used for “general purpose” storage of Virtual Machine boot drives and configuration files.
sanbolic Melio FS 2010 should be considered for highly versatile clustered shared storage.

Virtual Machine Licensing

Posted on May 27, 2010 by Jimbo Jones

It is a widely know “fact” that the Datacenter Edition of Windows Server 2008 allows an “unlimited” number of Virtual Machines to be installed and run on the machine for which WS2k8DCE was licensed. When you buy Datacenter Edition, you pay for each physical processor (sockets filled with CPUs) you are currently running in the server hardware. For example, if you have a Dell 1950 III, you might have two processors with four cores on each processor. In this case, you need to pay for, of course, two processors. WS2k8DC has an MSRP of $2999/proc — So, the OS will cost you $5998.
The good news is that you can now install any number of Windows Servers on that Dell 1950 III as virtual machines (running on the Hyper-V layer). So… if I were a clever admin, I’d put 2TB of PC2-5300 into the box and install a gajillion servers on it.

…and now the rest of the story:
First of all, I can’t put 2TB of RAM into the server. But let’s just say I could. If I want support from Microsoft, I can only place 384 child machines on a Datacenter server.
but wait, there’s more… if I like Live Migration and Failover Clustering, I can buy another 1950 and “join” it to my first DC in a MNS cluster (let’s forget about the network, the storage, etc. for now). I would need another licence of WS2k8DC at $5998. My OS license cost is now $11,996 (list) and I can run as many machines as I can shake a stick at… or can I?!

The part of the story that you won’t hear too often is the support limitations regarding virtual machine densities for stand-alone servers versus clustered servers. Net/Net: Clustered servers are only supported with less than 65 virtual machines per cluster node. In a 16-server cluster, you can run 1024 machines — that’s it, period. So — let’s go back to the CFO to pay for this… 16-nodes of Datacenter on Dell 1950 III’s will set me back $95,968. I can run 1024 servers — each server costs me $93.72. Amazing — really amazing. And let’s not forget — this INCLUDES the cost of the Hypervisors and rudimentory management software. It’s really amazing. — Singularly licensed servers would have cost me $999 each — for a total of $1,022,976. To be fair, no one would buy servers that way — with Open, Select, AE, etc… the cost of 1024 servers wouldn’t cost $999, it would be half that. However, it’s easy to see a difference between $93.72 and $500!

Let’s also consider the “flexibility” of the cost model: If I was willing to pay $500 for each server license, what is the minimum number of servers I can deploy and still justify the cost of WS2k8DC licenses in my 16-node cluster? Simple — we’ll divide $95968 by $500 and see that if we implement less than 192 virtual machines (as few as 12 servers per node), we need to start dropping nodes to make the math work.

Do It Right or Do It Over

Protocol isn't everything — but it's worth considering.