I manage a fairly large colocated file server for a small business who uses the site as a geographic dump location for video content that gets encoded and delivered to users on the fly through a web interface. The content is all hi-definition material and is unencoded in Transport Stream media format. They keep the content in this format so that they can dynamically serve an array of different encodings depending on the client’s preference. They have a desktop application that their clients can use to connect to the content delivery server and specify what encoding types, aspect ratios, and resolutions that they desire. Anybody who has ever worked with transport stream media knows how large some of these files can get when working with hi-definition content… Some of the files are 100GB in size. The content is typically advertising material for clients and needs to be able to be delivered fast to users across the globe. This is a pretty hefty requirement for a small business, so as a method of management, they decided to get a souped up colocated server located in Europe that will essentially mirror the production content server here in the US. This allows them to have near-real-time availability of content to their three or so clients located in Europe, and with a bonded dual gigabit connection, delivery also occurs in near real time. The idea is really a genius, homebrewed, poor man’s proprietary CDN, and I wish that I could take credit for having thought of the idea.
Without money to rent SAN space at the datacenter, the network engineers for this company decided to stack the HP server full of 2TB hard drives, and service all of the content delivery from local disks. This works great, but being a rented server, maintenance on the box is next to impossible. And with the requirement of high uptime SLAs (indeed a moment of downtime can break a small business), we ran into the issue of OS patches and downtime windows. I solved this problem in a realatively creative way, but it involved a little bit of trickery on my part to be able to get this stuff to all work.
My idea was to turn the box into an ESXi server and create four virtual machines on top of the 22TB array (RAID-5 — actual size is 24TB). Three of the boxes would serve as the content delivery servers with shared “physical” access to about 22TB of disk space attached from the primary datastore. In theory, this is a great idea… I was going to cluster the servers, have them use GFS2 for filesystem locking and clusters, and then the fourth VM would be the forward facing server running an LVS load balancer to balance clients between each of the three boxes. When we needed a maintenance window for anything, we would just round-robin pull a box from the load balancer, perform the software updates, and reactive it from Piranha. I was on to something good here…
Then I ran into a real deal-breaker — ESX/i’s VMFS file system does not support logical volumes greater than 2TB – 512B in size. When I realized this, it completely blew my mind — in fact, I couldn’t even get the hypervisor to register the logical volume that was greater than 2TB, so any kind of access to the disk was out of the question. I tried many different things, including creating smaller, individual logical volumes in smaller chunks, but nothing that I seemed to do could get a logical volume of greater than 2TB in size to appear. I needed the RAID backend for redundancy, and the HP Smart Array P212 does not let you carve out small logical volumes from a disk array, and with each disk being 2TB in size as it was, the only option that I could really see was to create a logical volume on each disk and create datastores on each of the 2TB disks. I would then have to create thick-provisioned hard disks on each of the datastores, maximizing the size of the virtual hard drive, and present it to each of the three VMs, where I could stripe them into a volume group with LVM. This idea sucks for a number of reasons, not the least of which is that I would lose the backend RAID array, making a single disk failure catastrophic.
So I went back to the drawing board and put my elite and advanced Systems Engineer (I am an Engineer afterall, aren’t I?) abilities to work, and hit the google machine. Hours, and hours, and hours, and hours, and hours of google’ing around and my conclusion is that nobody on the planet Earth has figured this problem out — it simply is not possible. So I had to be a little bit creative with this decision. The idea was simply not worth scrapping because of all of the added benefits that we would get out of a virtualized environment. The owners of the company really like the VMware Infrastructure Client for monitoring the server usage, and I really like being able to do OS updates again! The obvious question comes to the forefront, why not just use Xen or KVM or something simple like that? The simple answer to this question is, management-eye-candy. The owners of the company, and indeed I too, really enjoy the professional and simplistic interface of the VIC — it’s unmatched, in my opinion. A more agressive answer is that VMware is an immensely mature product, there is a phenomenal amount of community involvement, and management is simple and can be performed by just about anybody (no offense, VCPs ). Xen and KVM are hard to manage and lack the simplistic user interface for management. The “which hypervisor is better” debate is ruthless and never ending, and I don’t want to go into it — VMware is what management wanted, ’nuff said.
Not being a native VMware guy, I rarely paroose the Virtual Appliance Marketplace, but on this day I felt it necessary. VMware has excellent community involvement, and I figured that I would find my solution in there. Though it didn’t immediately jump off of the page and smack me in the face, I did, in fact, find the answer that I was looking for in the form of a VMware appliance.
OpenFiler is an extremely robust and mature product put together by an amazingly knowledgeable and cooperative team of software developers and engineers, and they provide it free of charge to the community. They have even gone as far as to package it into a 200MB-compressed downloadable VMDK, with all settings and everything preconfigured, so that you can just attach your storage, fire the box up, and you’re up and running. The idea of a virtualized SAN had never even occured to me, but after reading more and more about OpenFiler, I knew that it was the solution I was looking for.
I honestly and openly admit that I don’t know a great deal about RAID configurations outside of 0,1,10,5,50, and 6 — things like stride length, stripe width, and chunk size were all foreign terms to me before I started this adventure. Aligning stripe widths with the chunk sizes of the array and block sizes of the file system all seemed a bit out of my league, but after reading for days and performing several tests, I soon came to realize the importance of these things. Luckily, I didn’t have to — OpenFiler knows what’s best for you, and choosing the defaults for everything when creating the volume group really made the setup easy.
I wish that I had some screenshots to post here on the setup of OpenFiler, but I will do my best to convey the process in words. Let me first begin by explaining what the goal is and end result:
* Take 24TB of physical disks and create a RAID-5 volume
* Take new RAID-5 volume and present to each of the three content delivery VMs over iSCSI
* LVM, GFS2, Cluster the volume.
For starters, I came across this guide that proved to be an excellent resource for getting the iSCSI part of the OpenFiler set up. The GUI for OpenFiler made this process much more pleasent than the alternative of having to set this all up by hand on a dedicated Linux box — trust me, in this case easier is way better.
Let me take a quick step back and describe how I have the network configured. It was actually a really simple process, and I think that it is easily understood… vmnic0 has the routable IP address that we will be using as our outside interface. vSwitch0 (VM Network) has vmnic0 as its physical interface. vSwitch1 is a zero-interface virtual switch simulating 10gbE connection to each of the connected guests. I have a Virtual Machine Port Group configured on vSwitch1 labeled “Internal”. This Port Group is representative of the Internal network that I constructed as part of the grand Virtual Infrastructure. Each of the three content delivery VMs have their NICs configured to be part of the Internal Port Group. They have statically assigned IP addresses on the 10.1.1.0/24 network, simply 10.1.1.10, 10.1.1.11, and 10.1.1.12. The fourth VM (LVS is what I called it), that will eventually serve as our Load Balancer between the three boxes, has two network interfaces. One is attached to the VM Network Port Group on vSwitch0, the other is a member of the Internal Port Group. For security purposes, the management interface for the ESXi box is also a member of the 10.1.1.0/24 network (I’ll cover securing ESXi in a later post), and the LVS VM is configured to assume the public IP address of vmnic0 on its eth0 interface. LVS’s eth1 interface is connected to the Internal Port Group and configured with IP address 10.1.1.1. IP Port forwarding was enabled on the server and NAT and MASQUERADING iptables rules were setup to allow the three content delivery VMs to use the LVS server as their primary gateway.
For simplicity, the OpenFiler server was also configured on the “Internal” Port Group with the LVS server as its primary gateway.
The bare metal:
– prelude — For the purposes of my setup, I am working with 12, 2TB SAS disks physically and directly connected to the server through the RAID controller. For each of the disks, I used the RAID configuration BIOS to create 12 independent 2TB (actually, 1.86 when it all worked out) logical volumes. I then booted up the hypervisor, connected with VIC, and on each of the 12 independent disks, I created a 2TB datastore. I labeled the datastores as “DEV_SDA”, “DEV_SDB”, “DEV_SDC”, etc… as I knew how their device names would appear to the OpenFiler server (/dev/sda, /dev/sdb, /dev/sdc, etc…). This would allow me to easily identify a failed drive if one should go out (disk 1 = /dev/sda = “DEV_SDA” datastore, and so forth).
At this point I have successfully mapped my 22 (after all was said and done) terabyte RAID to each of the three content delivery VMs as a single volume. Tailing /var/log/messages showed the addition of the new disk, and indeed performing an “fdisk -l” also showed the disk there. Simply enough, create the cluster, perform a pvcreate on the device, create the logical volume on top of the volume group, format for gfs2, and mount up on each of the nodes. I have previously written a guide on how to perform the cluster setup and configuration using GNBD (somewhat of a precursor to iSCSI), and you can follow that guide for setting up your cluster and GFS2 shares.
This is a fairly complex means to an end, but the scaleability and manageability of the infrastructure is irreplaceable. Indeed, my contract has ended with this company and I was easily able to hand off the management to their existing Network Engineer (less technically inclined) with no worries or issues. I check in with them periodically and they have nothing but praises for the setup.
Feel free to email me with any questions……… email@example.com