Sunday 18 January 2009

The Myth of Infrastructure Contention

Back... caught u lookin for the same thing. It's a new thing, check out this... oh no wait. It ain't a new thing, it's just another Sunday Arvo Architecture And Philoshopy post. This time I'm going to focus on a long time thorn in many of our sides - the myth of infrastructure resource contention.

This ugly beast rears it's head in many ways, but in particular when consolidating workloads or changing hardware standards (such as standardising on blade). Of course, the people raising these arguments are often server admins with little or no knowledge of the storage and network architecture in their environments, or consultants who have either never worked in large environments or also do not know the storage and network architectures of the environment they have come into. Which is not their fault - due to the necessary delineation of responsibility in any enterprise, they just don't get exposure to the big picture. And again, I should say from the outset that I'm talking ENTERPRISE people! Seriously, if I cop shit from one more person who claims to know better based on their 20 "strong" ESX infrastructure or home fucking lab, I am going to break out the shuriken. YES THE ENTERPRISE IS DIFFERENT. If you have never worked in a large environment, you can probably stop reading right now (unless you want to work in a large environment, in which case you should pay close attention). Can you tell how much these baseless concerns get to me? Now where was I...

OK a few more disclaimers. In this post I will try to stay as generic as possible to allow for broader applicability, and focus on single paths and simplistic components for the sake of clarity. Yes I know about Layer 3 switches and other convergent network devices and topologies, but they don't help to clarify things for those who may not know of such things. Additionally, the diagrams below are a wierd kind of mish-mash of various things I've seen in my time in several large enterprises, and I suck at Visio. Again, I have labelled things for clarity more than accuracy, and chopped stuff down in the name of broad applicability. Keep that in mind before you write to me saying I've got it all wrong.

IP Networks
So lets tackle the big one first, IP networks. Before virtualisation, your network may have looked something like this:

Does that surprise you? If it does, go ask one of your network team to draw out what a typical server class network looks like, from border to server. I bet I'm not far off. Go and do it now, I'll wait for you to get back.

OK, enlightened now? And in fact if you are, the penny has probably already dropped. But in case it hasn't, lets see what happens when the virtualisatoin train comes rolling in, and your friendly architecture and engineering team propose putting those 100 phsyicals into a blade chassis. It is precisely at this point that most operations staff, without a view of the big picture, start screaming bloody murder. "You idiot designers, how the hell do you think you can connect 100 boxes with only 4GB (active) links!!! @$%#@%# no way I'm letting that into production you @#%$%!!!". However, when we virtualise those 100 physical boxes and throw them all into a blade chassis, our diagram becomes:

OK, _now_ the penny has definitely dropped (or you shouldn't have administrative access to production systems). IT DOESN"T MATTER WHAT IS BELOW THE ACCESS LAYER. Because a single hop away (or 2 if you're lucky), all that bandwidth is concentrated by an order of magnitude. The networks guys have known this all along. They probably laughed at the server guys demands for GbE to the endpoints, knowing that in the grand scheme of things it would make fuck all difference in 90% of cases. But they humoured us anyway. And lucky for them they did, because the average network guy's mantra of "the core needs to be a multiple of the edge" needs to be tossed out on it's arse, for different reasons. But that's another post :-).

Fibre Channel Networks
I know I know, I really don't need to be as blatant about it this time, because you know I'm going to follow the exact same logic with storage. But just to drive the point home, here again we have our before virtualisation infrastructure:

And again, after sticking everything onto a blade chassis:

I don't think the above needs any futher explanation.

I'm sure there are a million variations out there that may give rise to what some may think as legitimate arguments. You may have a dedicated backup network, it may even be non-routed. To which I would ask, what is the backup server connected at? What are you backing up to? Whats the overall throughput of that backup system? Point is, there will _always_ be concentration of bandwidth on the backend, be it networking or storage, and your physical boxes don't use anywhere near the amount of bandwdith that you think they do. You may get the odd outlier, sure. Just stick it on it's own box, but still put ESX underneath it - even without the added benefits of SAN and cluster membership, from an administrative perspective you still get many advantages of virtualising the OS (remember, enterprise. We don't pay on a per-host basis, so the additional cost of ESX doesn't factor in for enterprises like it would for smaller shops).

OK, time to wrap this one up. Your environment may vary from the diagrams above, but you _will_ have concentration points like those above, somewhere. That being the case, if you don't have network or storage bandwidth problems before virtualisation, don't think that you will have them afterwards just because you massively cut the aggregate endpoint connectivity.