Tuesday 13 May 2008

What VMware Site Recovery Manager isn't...

Straight up front - this is not a cynical post. My main point is NOT that SRM has some kind of product or design flaw. The reason for such a post is that there will be many people who will write about what SRM does offer, so I thought I'd balance it a little... to help people keep sight of the fact that it is not a panacea (not that it's purporting to be, but the marketing hyperbole is hardly going to point out why you need additional BCP / DR products). Personally I consider SRM a necessity, for the mere fact that keeping those BCP / DR VMs offline will save a fortune in system administration overheads associated with having them online, which are easily the biggest chunk of TCO. Enough of the disclaimers, onto the meat of the post!

When you think about why you would invoke a DR plan in the virtual world, it pretty much boils down to 2 things:

1) Catastrophes, like an entire datacenter or array outage.

2) Configuration errors that can't be recovered from within the application's RTO

Point 1 is obviously what SRM is designed to address, it is called Site Recovery Manager after all.

Point 2 however, is not what SRM can / should be used for, and one would certainly hope that configuration errors, like an OS or application patch that breaks something or a change request gone wrong, are much more probable than catastrophes.

Of course there are a number of ways you can address point 2. Snapshots can go some way towards it, but that can be very difficult in large enterprises where the VMware admins may not know about application changes in order to take the snap beforehand. You could schedule regular snaps and merges, effectively keeping VM's continuously in a snapshotted state, but I seem to recall something about SCSI reservations being used by VMFS to do metadata updates... stuff like extending a snapshot file when it gets written to - if you've got 20 VM's on a LUN that simultaneously kick off a virus scan which writes to a log as well as reads the entire filesystem, that might have some implications. Regular image level VCB backups could be used to similar effect, but you probably don't want to use SRM and take images of the entire virtual infrastructure. And as there's not really an elegant interface to track and manage specific VM image backups via VCB (at least not that I have seen), there's definitely room for the 3rd party tools that offer scheduled / asynchronous replication of an online production VM to an offline DR partner. If anything, it makes the need for their products more obvious.

So it's probably worth keeping the above in mind if you're coming up with a business case for putting SRM into your environment... include the need for a single VM rcovery solution as well if you don't have one already, to save getting caught out and then having to explain why that DR application you spent all that money on actually isn't the be all and end all.