As I mentioned previously, when I started with agilezen back in april the app was hosted on a single powerful machine (Prometheus). While this setup handled our load well enough, it was becoming apparent that we would soon outgrow the machine we were currently hosting on so it was time to decide how we wanted to scale. Certainly we could have just looked into a more powerful machine, but for real internet scale and growth, you need to scale horizontally, and the first step was to figure out how to get a private cluster running.
Originally our plan was to setup a VMware vSphere cluster on softlayer. The only real reason we were considering this is because our parent company, Rally, hosts their software on VMware which means that we would be taking advantage of their knowledge base on that platform. In retrospect, I’m pretty happy we didn’t go this way. We would have gotten some fault tolerance in that our VMs would be redundant, allowing fun things such as auto-deployment, but there would still only be 2 or 3 physical machines behind all the VMs. As such, reliability isn’t quite the same as being backed by all of AWS.
There is also the complexity of vSphere. As AZ’s primary devops guy I had the pleasure of going through all the pain of figuring out how to set it up, including the wacky config necessary in order to provision VMs on a redundant SAN, and all the licensing BS that goes along with it (it is not at all clear exactly what you need to license un order use tools such as vMotion). Honestly, the answers are all pretty freely available on the internet, but the reality is that the way they have licensing set up, you probably won’t know what questions to ask in order to find the answers that you’ll need. This means Nate and I spent at least 2 weeks working with softlayer trying to get our vSphere cluster up and running, and all the while we kept running into gotchas that meant it was actually going to be a lot more expensive than we thought. Finally at some point we said “screw it” and decided to give AWS a shot.
Enter AWS. AWS is not at all simple either, frankly. However, there is a lot of very good documentation and the pricing is extremely clear cut*. It took me weeks to figure out vSphere; it only took me a few days to get all of our machines setup and running in EC2**.
Now, vSphere has some advantages over AWS, sure. The main one that comes to mind for me is that with vSphere you can snapshot or make a template from a VM while the VM is running and you cannot do this with EC2. Truth be told, the CPU hit to do so in vSphere is lage enough that really you shouldn’t do it anyway, so even that is of little real value. There are also some oddities with AWS that caused me some heartburn. For example, with vSphere I was able to setup a virtual IP for failover for a couple nginx proxy machines to the IIS FE machines. EC2 doesn’t have VIPs, so I had to setup an elastic load balancer instead. Seems straight forward right? Except Amazon won’t give you or allow you to setup a static IP for use with the ELB. The way they have it setup, you can only CNAME your domain to the ELB. But wait! you can’t CNAME your domain’s apex record, you can only give it an A record, which requires an IP address. Amazon’s only solution for this is to host your DNS with them (route 53), in which case they abstract away the A records behind a. “A (Alias)” record. Oh and guess what? There is no web interface for route 53; you have to use the web service for it directly. If you find yourself in this position, check out R53 Fox.
* At least on the surface. What I wasn’t expecting was exactly how much EBS storage we would use. Thankfully EBS is extremely cheap so it didn’t inflate our bill all that much beyond my projections.
** At this point we were looking at one powerful front end connecting to a few replicated sql servers, plus an ejabberd machine for notifications. I’ll talk about architecture changes next, promise.
