Sunday, 18 May 2008

How a stateless ESXi infrastructure might work

Yep, it's Sunday afternoon, and thus time for another installment of Sunday Afternoon Architecture and Philosophy! Advance Warning: Get your reading specs, this post is a big 'un.

I've mused several times on the whole stateless thing, especially with regards to ESXi, today I'm going to take it a bit further in the hope that someone out there from VMware may actually be reading (besides JT from Communities :-).

Previously I've showed how you can PXE boot ESXi. While completely unsupported, it at least lends itself to some interesting possibilities, as with ESXi, VMware are uniquely positioned to offer such a capability. The Xen hypervisor may be 50,000 lines of code but it's useless without that bloated Dom0 sitting on top of it. Check out the video (if you can be bothered to register, are they so desparate for sales leads that you need to register to watch a video???) of the XenServer "embedded" product - it still requires going though what is essentially a full Linux install, except instead of reading from a CD and installing to a hard drive it's all on a flash device attached the mainboard. But i digress...

So lets start at the top, and take a stroll through how you might string this ESXi stateless infrastructure together in your everyday enterprise. And I'll say upfront, I'm a Microsoft guy so a lot of the options in here are Microsoft centric. In my defense however, every enterprise runs Active Directory and it's easy to leverage some peripheral Windows technologies for what we want to achieve.

First up, the TFTP server. RIS (or WDS) is not entirely necessary for what we want to do - a simple ol TFTP server will do, even the one you can freely install from a Windows CD. In this example we'll use good ol' pxelinux, so our bootfilename will be 'pxelinux.0' and that file will be in the root of the TFTP server. The directory structure TFTP root could be something as follows:

In the TFTP root pictured above I have 3 directories named after the ESXi build. The 'default' file in the pxelinx.cfg directory presents a menu so I can select which kernel to boot. I could also have a file in the pxelinux.cfg directory named after the GUID of the client, which would allow me to specify which kernel to boot for a particular client.

If you already have RIS / WDS in your environment, things are a little less clunky... can simply create a machine account in AD, enter the GUID of the box when prompted and then set the 'netbootMachineFilePath' attribute on computer object to the file on the RIS box that you want to boot.

Onto DHCP. Options 66 (TFTP server hostname) and 67 (bootfile name) need to be configured for the relevant scope. DHCP reservations for the ESXi boxen could also be considered a pre-requisite. The ESXi startup scripts do a nice job of picking that up and handling it accordingly.

So all this stuff is possible today (albeit unsupported). If ESXi doesn't have a filesystem for scratch space, it simply uses an additional 512MB of RAM for it's scratch - hardly a big overhead in comparison to the flexibility PXE gives you. Booting of an embedded USB device is cool, but having a single centralised image is way cooler. As you can see, there's nothing stopping you from keeping multiple build versions on the TFTP server, making rollbacks a snap. With this in place, you are halfway to a stateless infrastructure. New ESXi boxes can be provisioned almost as fast as they can be booted.

After booting, they need to be configured though... and that's where we move onto theory...

The biggest roadblock by far in making this truly stateless, is the lack of state management. There's no reason why VirtualCenter couldn't perform this function. But there's other stuff that would need to change too in order to support it. For example, something like the following might enable a fully functioning stateless infrastructure:

1) Move the VirtualCenter configuration store to Lightweight Directory Services (what used to be called ADAM), allowing VirtualCenter to become a federated, mutli-master application like Active Directory. The VMware Desktop Manager team are already aware that lightweght directory services make a _much_ better configuration store than SQL Server does. SQL Server would still be needed for performance data, but the recommendation for enterprises these days is to have SQL Server on a separate host anyway.

2) Enhance VirtualCenter so that you can define configurations on a cluster-wide basis. VirtualCenter would then just have to track which hosts belonged to what cluster. XenServer kind of works this way currently - as soon as you join a XenServer host to a cluster, the configurations from the other hosts are replicated to it so you don't have to do anything further on the host in order to start moving workloads onto it. This is probably the only thing XenServer does _way_ better than VI3 currently. Let's be honest - in the enterprise, the atomic unit of computing resource is the cluster these days, not the individual host. Additionally, configuration information could be further defined at a resource pool or vmfolder level.

3) Use SRV records to allow clients to locate VirtualCenter hosts (ie the Virtual Infrastructure Management service). Modify the startup process of ESXi so that it sends out a query for this SRV record everytime it boots.

4) Regardless of which VirtualCenter the ESXi box hit, since it would be federated it can tell the ESXi box which VirtualCenter host is closest to it. The ESXi box would then connect to this closest VC, and ask for configuration information.

By now all the Windows people reading this are thinking "Hmmm, something about that sounds all too familiar". And they'd be right - Windows domains work almost exactly in this way.

SRV records are used to allow clients to locate kerberos and LDAP services, ie Domain Controllers. The closest Domain Controller to the client is identified during the logon process (or from cache), and the client authenticates to this Domain Controller and pulls down configuration information (ie user profile and homedrive paths, group membership information for the user and machine accounts, Group Policy, logon scripts etc). This information is then applied during the logon process, resulting in the user receiving a fully configured environment by the time they logon.

I haven't had enough of a chance to run SCVMM 2008 and Hyper-V through their paces to see if they operate in this manner. If they don't, VMware can consider themselves lucky and would do well to get this functionality into the managment layer ASAP (even if it means releasing yet another product with "Manager" in the title :-).

If Microsoft have implmented this kind of functionality however, VMware needs to take notice and respond quickly. Given that the management layer will become more and more important as virtualisation moves into hardware, VMware can't afford to slip on this front.

Congratulations if you made it this far. Hopefully you've enjoyed reading and as always for this kind of post, comments are open!


Duncan said...

someone should create a plugin for this! awesome!

Paul said...

You may want to visit, management software for estates of stateless pxe-booted machines, and dealing with the state-mgmt stuff you mention. While the underlying technology stack largely pre-dates the prevalance of hypervisors, as you note in your blog, the two go together nicely and they are complementary. This positions one to manage compute infrastructure from the perspective of centralized image-libraries and subscriber-schedules time-sharing a unified estate, rather than the traditional archipelago of stateful-DAS.