July | 2010 | Do It Right or Do It Over

Microsoft and its Parallel Data Warehouse (Madison) — a product derived from Microsoft’s purchase of DATAllegro — seems to be struggling to convince anyone that they actually understands what data warehousing really is.

As an example, Madison has no workload manager — a key element to data warehouses that allows the business to define “roles” that have access to the resources of the DW. Without a workload manager, all users/roles share the access to the valuable DW without prejudice.

In a mature (read “true”) data warehouse, “roles” are defined that allow the business to state, for example, all VIPs can run queries any time, but they cannot consume more than x resources on the system, they can only return x number of rows, they can only process x queries at one time, etc. Likewise, there is an adhoc group that might be governed more stringently and an analytics group that has higher privileges because the DW Admins can trust that the queries that the analytics group submits will not bring the DW to its knees.

Microsoft’s implementation of PDW 1.0 allows none of these controls. Instead they govern the entire system to 32 simultaneous queries. Each query can consume (theoretically) 1/32 of the Compute nodes’ resources. So there is no means to escalate, nor a means to sublimate queries. This may seem like nit-picking, but what DW Admins will find themselves doing (more than reading novels or starting ETL jobs) is killing run-away queries so that the analytics groups can run the queries the business needs.

Another example of Microsoft’s oversimplification of the DW space, PDW utilizes a “landing zone” server to stage data during ingestation. The landing zone server has been doubted for many years as a major bottleneck. And with good reason; a typical ETL datastream device like an Informatica server is a single device that outputs to Named Pipes. Many ETL implementations rely on Informatica’s engine to pump data at multi-Gbps speeds. Since most ETLs depend on Ethernet, this has dictated multiple Infiniband connections to obtain the throughput many organizations need.

In tests widely confirmed by Microsoft, the landing zone was able to pump “hundreds of GBs per hour”. For comparison purposes, a typical Windows 2008 File Server can pump about 275-300GB/hour — coincidence? Here’s the rub, the entire array of PDW servers uses Infiniband to link them together. It uses another fabric, Fibre Channel, to link the servers to storage. The system is flush with bandwidth — what is doesn’t have is a mechanism to avoid OS-level bottlenecks.

On the other hand, a purpose-built ETL server — like the ones used at eBay — written in lightweight C++ and designed to avoid the OS as it pumps hundreds of rows a second, can produce upwards of 4TBs/hour. A difference of about 16x. Not 2x, not 4x; 16x!

In typical “Microsoft fashion”, The SQL Server product group is attempting to show the enterprise that it has the world on a string by leveraging well-known infrastructure to solve even more complex business problems. Who could blame them? They are simply reacting to the same market indicators that Teradata, Oracle, and Netezza are seeing. Oracle brought RAC to the world many years ago and they are STILL struggling to master the DW dragon. We should applaud the efforts of Microsoft. Let’s all stand up and clap. Ok, that was fun. Now let’s take a deep breath and realize that Microsoft has years to go before they can acquire all of the technology they will need to simply enter the MPP/SN space of the truly massive data warehouse world. PDW is version 1.0. In 2015, PDW 3.0 will ship and… Thanks for reading.

Do It Right or Do It Over

Protocol isn't everything — but it's worth considering.

Monthly Archives: July 2010

Microsoft makes Massively Parallel Process Database available for the masses