September 11th was a bad day, and on Sept 12th, things got worse.

It is September 11th, 2014 today. A day that affected a huge number of people. I’m not going to tell you that I was affected more than the next guy, because I just wasn’t. I can tell you HOW I was affected though. Side-stepping for a moment the issues of human suffering, loss of life, and the sacrifice that literally thousands of Americans made on that day and the following WEEKS, I’d like to focus, if you’ll allow, on the IT administrators who were affected on September 11th, 2001.

Back on September 11, 2001, we didn’t have robust systems or technologies that allowed us to protect our data centers. Actually, that’s not true. We had them, and they were very very expensive. Windows Clustering existed, sure. Was it wide spread in 2001? Not exactly. Did it work? Well… sure. But did it work to protect the systems and the data in a way that mattered on Sept 11, 2001? Largely, no.  So, the point is, that on September 11, 2001, most of the Microsoft Windows-based systems, including Exchange Server and SQL Server, were not replicated or protected by any high availability techniques.

Since 2001, and over the past ten years, many IT suppliers have worked tirelessly to create and proliferate data protection systems. EMC, in particular, has continued to innovate, perhaps more than any other IT supplier. Back in 2005, I moved from Microsoft to EMC… Speaking for EMC, we saw our data protection portfolio erupt with new entries since 9/11. During the 2000’s, we had developed point solutions for file systems, for backup targets, for hosts, for arrays. Some arrays had multiple choices for replication and data protection. At one point in 2009, we had more than 19 different ways to replicate data. Seriously. Nineteen.  But back in 2001, most Microsoft customers were using nothing to protect their systems other than backups to TAPE.  These tapes were place in a large steel box and carried off-site every day to a storage facility several miles from the originating tape drives.

So why was Sept 12th worse than Sept 11th? It was on Sept 12th that the magnitude of the loss actually registered with the survivors. For the souls lost on Sept 11th, it was the first day experiencing the afterworld — a place filled with none of their living friends or family. For those still on Earth, it was also the first day. It was the first day without their mom, or their dad, or their son, or their daughter.  For the IT folks, not only dealing with personal loss, they were now dealing with the loss of their systems.  This wasn’t necessary a sad thing, it was however a challenging thing.

I was a Technical Account Manager at Microsoft back in 2001.  My only account was a global financial services company located, ironically, across the street from the Twin Towers in the World Financial Center (WFC) on the West Side Highway.  Most of this firm’s financial systems were based on IBM mainframes — zSeries and the like.  The storage infrastructure was all EMC Symmetrix and replicated with EMC’s Symmetrix Remote Data Facility (SRDF) to another capable data center on Staten Island.

The systems that were not protected were the global email infrastructure and dozens of SQL Servers. Because their Messaging Bridgehead was located in WFC South while the redundant servers were in WFC North.  Not a great plan, in hind site.  The reason that Sept 12th was worse (for those systems) is that at 10:00pm on Sept 11th, all power was cut to the entire downtown electrical grid.  The entire southern tip of Manhattan went black. All the uninterruptible power systems were depleted by midnight.  All computers were not only down, they could not be brought back up.

All access to WFC had been blocked due to the recovery efforts.  People are, in fact, more important than computers after all.  Somehow we have not forgotten that.  All of the backup tapes were at off-site facilities, but gaining access to them was just as difficult as getting to the servers because the retrieval queues were days deep.  It would take at least five days to retrieve the tapes, and another several days to weed through and restore them.

The “bosses” could not wait.  We set up “camp” at the data center on the other side of New Jersey — I shouldn’t indicate exactly where in an attempt to preserve the privacy of this financial firm.  Let’s just say for now that it was “near Philly”.  This data center had some spare servers and enough SAN storage to allow a complete re-installation of the worldwide email infrastructure.  This effort was expected to take several days. In the end, it required nearly six weeks of 24×7 involvement of more than a dozen people.

So, I went to the data center “near Philly” with several people from the firm’s IT department and began to rebuild their Exchange infrastructure from scratch.  All tallied, we invested nearly 13,000 hours of labor during a five week period.  Every day, it was our pleasure to do the work.  We were delighted to be helping in some way.  With the loss of more than 3,000 lives, we felt privileged to be working even if we weren’t doing anything to help those who had fallen.  The work was not, however, fun.

The effort came to a head for me about five weeks into this 140-hour — not 40-hour — 140-hour work week; this is to say that I had logged 700 hours of work in five weeks.  I was so tired that I could not make reasonable decisions any longer.  I knew I was affected on October 14th following Mass on Sunday morning. I was with my, now, wife in the church’s parking lot.  There was a flea market of sorts raising money for the parish.  I was wandering around the parking lot looking at cups and saucers when my mobile phone rang.  I think it was a Motorola clam-shell device — the StarTac, I think.  It was the Microsoft Global Account Rep for this financial firm.  He asked me “one more favor”.  His name was Rob — great guy.  He asked, “Jim, there’s this Director who deleted a mail message”.  Keep in mind, this is the new system that we had just built, but we were still trying to get remote systems in Japan, Ireland, India, etc. connected and receiving mail.  We were still completely overworked and under water.  So this director had deleted a mail message from his mailbox that was being protected by nightly tape backups managed by an overworked engineer who was manually swapping tapes during a frantic rebuilding process.  To make matters worse, the director wasn’t sure when he deleted it.  This meant that we would need to take a valued engineer off his tasks and have him rifle-through backup tapes, restoring several tapes to recovery servers in an attempt to find message that might NOT have been captured in a backup job.  Long story short, I lost control.  I didn’t become violent.  I just started sobbing.  You can imagine the impact this had on my fellow parishioners as they were also looking at lamps and shoes and nick-nacks.  I can remember saying to Rob, “this guy wants us to derail this recovery effort because he can’t figure out how to NOT destructively delete a single email message?!  There are 5,000 people missing!  He wants us to recover an email message?!  He should be focused on recovering PEOPLE!  Has he gone mad?!” So there I was, on a perfectly sunny Sunday morning, in a church parking lot, at a fund raising flea market, sobbing and yelling into my cell phone.  Not pretty.  I took some mandatory PTO later that day.

My point in documenting any of this is partly cathartic, partly to share my experiences, but mostly to provide those who were similarly affected a voice — a brother in arms — a common link.  I have spoken to dozens of people about 9/11 over the past 13 years.  Each one has a story that starts with “I remember exactly where I was…”  It’s our brain’s natural way of cataloging information — spacial pinning.  It’s the brain’s way of helping us recall an event with as many details as possible.

Unlike those inside the towers, my view was from outside the towers.  I had an awesome and dreadful view of the towers from the window of my apartment on 42nd Street.  My apartment was on the 50th floor of a new multi-tower apartment complex on the West Side Highway.  My view — through massive plate windows that opened — was of downtown.  I had a direct — line of site — view of the the two towers.  They filled most of the Window.  I saw the second plane hit.  I saw the towers fall.  First one, then some time later, the second.  The feeling of helplessness is still devastating.  The feeling that something needed to be done, but couldn’t be done, is something that still causes me to stare into the air and ponder.

For the Fire Fighters and Rescue personnel and those who ran into the towers in an attempt to simply DO SOMETHING, my heart is yours.

I made some very good friends during those weeks.  We are all damaged by the experience.  Even if we had learned anything good during the process, we cannot take any joy in the learning because it was among so much BAD.  We have not given ourselves permission, even now, to enjoy the lessons we learned.  We, instead, hold them in a vault in our souls — a place we reserve for the memories of those who gave their lives that day.

Every year, around August 15th, I begin to relive the experience.  For the weeks leading up to September 11th, I notice the number eleven everywhere.  It seems that every time I look at the clock, it’s eleven minutes past the hour.  And on the eve of September 11th, I wonder what I should do to commemorate the day.  Just like my fateful day in the church parking lot, I feel lost, alone, frustrated, but mostly just sad.  Just very, very sad.

So on this day, please take a moment to think about those who lost lives, those who lost others, and those who worked tirelessly to support the companies who were affected by the events of September 11th, 2001.

God bless you.

EMC and Microsoft – Friend or Foe?

An incredible summary from my good friend, Craig Allen

Craig Allen's avatarmsftDude

You may think that EMC and Microsoft are bitter competitors, and you may be right because in some areas they do fiercely compete. But more often than not, the two companies work together to solve customer problems with joint solutions from both sides. Even with competing products, overlapping functionality, and differences in approach, it is the customer who ultimately drives the decision. That decision is rarely a one sided architecture. Most often we see a blend of technologies from both EMC and Microsoft that come together to solve business issues and initiatives.

Competitive

Microsoft has competitive technologies, guidance, and solutions in several areas. EMC has a different set of direction and federation partners.

Such examples are:

  1. Windows Azure public cloud services – These services compete directly against VMware vCHS public cloud services.
  • Microsoft’s Azure started off as PaaS (Platform as a Service) with providing custom .NET application hosting and SQL…

View original post 603 more words

Upgrade to iOS 5 Fails but the fix is Simple

When you plug in your iPhone to receive the iOS 5 update, you may see a message claiming that the restore process cannot be completed.  The remedy is simple.  When you plug in the phone (before or after the upgrade) DO NOT follow the prompts for upgrading.  Instead, reject (Not Now) the upgrade and allow iTunes to sync your phone.  Allow it to backup, update apps, download new music, etc. jus the same way you normally would complete a sync.  After that happens.  Unplug the phone from the USB cable and plug it in again.  This time, follow the upgrade wizard.  Voila!

Cheers.

Should I use a Lagged Copy in my DAG?

Customer asks:

I hope you remember the discussion we had back in October 2010 around the need for backups for Exchange.  Well, I feel like we are running in circles around Exchange 2010 backup and recovery.  I felt relatively comfortable with the idea until recently when we started making some further changes to the environment in response to the growth of the log files.  We are attempting to switch to circular logging because Avamar doesn’t reset the archive bit and thus the logs never get truncated and grow until the volume is full.  Allegedly having 3 copies of the db’s is supported/recommended using circular logging but it has raised some questions:
*   If we have 2 db servers running on vm’s in the Cincinnati data center, is it advisable to use the lag copy for the 3rd db server in Lexington or does this violate the “rule of 3″?
*   Is it better to have all 3 in the same DAG with no lag at all vs. having the lag copy?
*   In what scenario would having the lag copy be beneficial?
*   If we enable circular logging on the active db server, do all of the other servers mirror that setting?  Is this Circular Logging Continuos Replication (CLCR)?
*   If the passive copies don’t pick up the change in log file settings, is it possible/advisable to have an alternate log mechanism on the other copies?
*   Now that we have increased our capacity in Avamar, what would a reasonable DR backup plan look like?  Currently we are using weekly.
*   How do SAN snapshots fill in the gaps between backups and log files?  We use Dell/Equallogic storage today and while the auto-snapshot feature sounds nice, in reality the snapshots are huge and thus we can only keep one snapshot per db.  I’m sure you can comment on how much better EMC’s tools are and I actually welcome that discussion.

Ok… Here’s what I wrote back:

First, thank you for your message.  I have escalated your issue with the Avamar client.  Not truncating logs at the end of a successful backup is errant behavior — we need to find out why this this is not working for you.
Second… I’ll take your questions in a Q&A format:
Q *   If we have 2 db servers running on vm’s in the Cincinnati data center, is it advisable to use the lag copy for the 3rd db server in Lexington or does this violate the “rule of 3″?
A *  A lagged copy violates the “rule of three” — the three copies become block-for-block voting members in repairing torn pages (likely if running on SATA Direct Attached/non-intelligent storage devices).  A lagged copy will not be counted as a voting member.
Q  *   Is it better to have all 3 in the same DAG with no lag at all vs. having the lag copy?
A * If you are running on an intelligent array (like ALL of EMC’s arrays which perform advanced CRC tracking and scanning), you will not need Exchange to vote on page repairs — there won’t be any! — so a lagged copy can be used to backwards in time.  Technically, you would need two “near” copies, and one “far” lagged copy to satisfy your temporal repair issues.  More on this below.
Q  *   In what scenario would having the lag copy be beneficial?
A * a corruption event that effects the Active copies of the database — you could use the lagged copy with its unplayed logs to isolate the corruption within an unplayed log and prevent the corruption from infecting the only unaffected copy you have left…  This is why a backup becomes important…
Q  *   If we enable circular logging on the active db server, do all of the other servers mirror that setting?  Is this Circular Logging Continuos Replication (CLCR)?
A *   In a word, YES.  See http://technet.microsoft.com/en-us/library/dd876874.aspx#FleMaiPro
Q *   If the passive copies don’t pick up the change in log file settings, is it possible/advisable to have an alternate log mechanism on the other copies?
A *  no
Q  *   Now that we have increased our capacity in Avamar, what would a reasonable DR backup plan look like?  Currently we are using weekly.
A *  Excellent Question!!  Microsoft’s Exchange Product group has begun to intermix the notions of HA, Backup/Recovery, and DR.  In a perfect world, you would have five copies of each database (sounds like SAP eh?) Two at the primary data center (PDC), three at the backup data center (BDC), plus one more lagged copy at the BDC.  In that architecture, you can use Exchange Native Backup, but that doesn’t answer your question.  Assuming that we can get Avamar to truncate your logs, a DR plan would look like this: Use a two-site DAG with two local copies and one remote copy — all Active copies.  You will place a database activation block on the remote databases to prevent accidental routing of mail to the remote databases.  Now… Install the Avamar client on the remote mailbox server and back it up every night (full only!!).  Keep the backups for 14-days.  Also, at the PDC, configure Reserve LUN Pool space for one of the mailbox servers.  It does not matter if the server is Active or Passive.  Configure Replication Manager to create VSS snapshots of all the databases on that server.  Execute snaps once per day.  Never back them up.  Expire the snapshots after eight days.  This will provide:
*Local HA in real-time via DAG-Replication — the ability to perform patch management every week with no downtime,
*Local BC at less than 24-hour RPO.  This will allow you to recover your database in the strange event that a rolling corruption event makes it to the remote data center.
*Remote failover for DR/BC and Site Maintenance.
*Remote BC/DR in the event of a smoking hole during a corruption event.
Q *   How do SAN snapshots fill in the gaps between backups and log files?  We use Dell/Equallogic storage today and while the auto-snapshot feature sounds nice, in reality the snapshots are huge and thus we can only keep one snapshot per db.  I’m sure you can comment on how much better EMC’s tools are and I actually welcome that discussion.
A *  Hmmm… Yes.  EMC’s snapshots on CX4 and VMAX require only the snapshot space required to hold the deltas, not the entire LUN.  Array-based snapshots do not actually fill a gap between log files and backups, however they provide an additional layer of recoverability that closes a temporal gap associated with recovery from backup.  For example, I can recover a 1TB database from a snapshot in about 4-minutes.  I can recover a 1TB database from DISK (B2D) in about 14-hours!!
I hope this helps!!

Can I boot my Hyper-V Guests from SAN?

Yes, you can (and should) boot your Hyper-V children/guests from your shared storage array.  Here’s how:

Create a Hyper-V cluster of at least two Hyper-V Parent machines using a third File Share Witness (I used a share off my DC).  After you create your Failover Cluster, zone and mask storage to both nodes of your new cluster.  Use the Failover Cluster Manager to add the newly zoned storage as a cluster resource, then convert the volume to a Cluster Shared Volume.  This is far easier than I make it sound.

Now, the Hyper-V Manager to edit your default properties to store machines and virtual machine definition files on the newly created CSV.  Now, use the Failover Cluster Manager to create a new Virtual Machine — and make sure you are storing it on the CSV.  Viola.  Done.  You are now booting your VM from SAN.

The data partitions, can now be stored in the same CSV (NOT recommended!!!), stored on a second or third VM, stored on a SCSI Passthrough volume, or stored via iSCSI initiator — all on the SAN — in separate storage pools, RAID Groups, etc.

Ok, but what are you DOING?!

Dell is buying 3PAR.  Oracle has Sun and Exadata.  EMC now has Greenplum.  Cisco sells telephony and servers.  IBM is selling SATA drives to the enterprise as XIV and is reselling NTAP, but its storage architect is out on the loose, again.  Who did I leave out?  Oh, HP…  The only one NOT making moves is Microsoft?!  Actually that’s not completely true… Microsoft continues to “innograte” — the act of innovating by integrating acquired technology into your existing products’ evolution.  DATAllegro, Opalis, and so on…

I keep thinking about the famous economic principle that states “you will ultimately be undone by your past”.  If EMC could just shed its dependency on Dell’s relationships with all those purchasing agents…  If Cisco could just lose the MDS and the Catalyst… if IBM could just forget about the XIV and buy NTAP already!…  And HP… Oh HP…

It has become painfully clear that the order of the components in the OSI model, in fact, serves as a roadmap.  Never forget that the Application is always at the top.  Whatever the application needs, the application gets.  If you don’t sell an application, don’t expect to tell anyone what to do.  Ever.  If your application is actually a utility for another application, don’t forget that fact.  Your “utility” is NOT the application.

Oracle has Java, yes.  Java is a development tool, not an application, but how about Siebel, PeopleSoft, JDEdwards?  Dell has … nothing.  HP has … nothing.  IBM has DB2 and Lotus.  Cisco has Unity — yes Voicemail is an application.  EMC has … hmmm VMware? VMware is actually an infrastructure tool — it’s like a server hardware manufacturer that lets you use whatever server vendor you want.  EMC also has Documentum — a utility that is configured as an application.  Microsoft, on the other hand, has all the applications you can shake a stick at.  If Microsoft says their application needs 100-spinning dancers to run, guess what you’re buying?

The way to true technology marketplace leadership is through applications that people actually use.  People love Microsoft Office.  They love their iPhones, their Androids, their Blackberries (often seen as tools… but they’re not — an iPhone is a collection of applications — remember… “there’s and APP for that!!!”).  People also love web-based applications like Facebook, Salesforce, gmail, and LinkedIn.  It seems to me… that HP, IBM, Dell, and EMC would do well to think about what people are using to “run their lives” and follow those markets.

Does it matter that Dell has 3PAR?  Does it matter that EMC has DataDomain?  Does it matter that Oracle has Sun and VirtualIron and BAE, and Java? — I don’t think so — but Oracle DOES have Siebel, PeopleSoft, and JDEdwards — real APPLICATIONS.  So at the end of the day, I think it becomes pretty clear who will be pulling and who will be pushing…  Take a look at what people are DOING.  The truth will show you the way:

  • Email — Google, Microsoft, Yahoo, Apple (MobileMe).  All Cloud-based email systems (and Microsoft even has a version of email that runs inside your firewall <g>)
  • Banking — a scattered field with many banks offering their own applications, plus Quicken
  • Social Networking — Facebook is dominating, but LinkedIn, MySpace, etc. continue to stay afloat
  • Media Sharing — Flickr, SnapFish, PhanFare
  • Media Consumption — Netflix, Pandora, Amazon’s Kindle, iTunes — all retailers with massive followings
  • Spreadsheets and Documents — Microsoft OWNS this space with trickles from OpenOffice, and iWork

So, just some idle advice from the sidelines for Dell, HP, even EMC — look at what people are doing; and go DO that!

Microsoft makes Massively Parallel Process Database available for the masses

Microsoft and its Parallel Data Warehouse (Madison) — a product derived from Microsoft’s purchase of DATAllegro — seems to be struggling to convince anyone that they actually understands what data warehousing really is.

As an example, Madison has no workload manager — a key element to data warehouses that allows the business to define “roles” that have access to the resources of the DW. Without a workload manager, all users/roles share the access to the valuable DW without prejudice.

In a mature (read “true”) data warehouse, “roles” are defined that allow the business to state, for example, all VIPs can run queries any time, but they cannot consume more than x resources on the system, they can only return x number of rows, they can only process x queries at one time, etc. Likewise, there is an adhoc group that might be governed more stringently and an analytics group that has higher privileges because the DW Admins can trust that the queries that the analytics group submits will not bring the DW to its knees.

Microsoft’s implementation of PDW 1.0 allows none of these controls. Instead they govern the entire system to 32 simultaneous queries. Each query can consume (theoretically) 1/32 of the Compute nodes’ resources. So there is no means to escalate, nor a means to sublimate queries. This may seem like nit-picking, but what DW Admins will find themselves doing (more than reading novels or starting ETL jobs) is killing run-away queries so that the analytics groups can run the queries the business needs.

Another example of Microsoft’s oversimplification of the DW space, PDW utilizes a “landing zone” server to stage data during ingestation. The landing zone server has been doubted for many years as a major bottleneck. And with good reason; a typical ETL datastream device like an Informatica server is a single device that outputs to Named Pipes. Many ETL implementations rely on Informatica’s engine to pump data at multi-Gbps speeds. Since most ETLs depend on Ethernet, this has dictated multiple Infiniband connections to obtain the throughput many organizations need.

In tests widely confirmed by Microsoft, the landing zone was able to pump “hundreds of GBs per hour”. For comparison purposes, a typical Windows 2008 File Server can pump about 275-300GB/hour — coincidence? Here’s the rub, the entire array of PDW servers uses Infiniband to link them together. It uses another fabric, Fibre Channel, to link the servers to storage. The system is flush with bandwidth — what is doesn’t have is a mechanism to avoid OS-level bottlenecks.

On the other hand, a purpose-built ETL server — like the ones used at eBay — written in lightweight C++ and designed to avoid the OS as it pumps hundreds of rows a second, can produce upwards of 4TBs/hour. A difference of about 16x. Not 2x, not 4x; 16x!

In typical “Microsoft fashion”, The SQL Server product group is attempting to show the enterprise that it has the world on a string by leveraging well-known infrastructure to solve even more complex business problems. Who could blame them? They are simply reacting to the same market indicators that Teradata, Oracle, and Netezza are seeing. Oracle brought RAC to the world many years ago and they are STILL struggling to master the DW dragon. We should applaud the efforts of Microsoft. Let’s all stand up and clap. Ok, that was fun. Now let’s take a deep breath and realize that Microsoft has years to go before they can acquire all of the technology they will need to simply enter the MPP/SN space of the truly massive data warehouse world. PDW is version 1.0. In 2015, PDW 3.0 will ship and… Thanks for reading.