It happened again. An EC2 instance with an EBS root volume just died. I’m pretty sick of it and can’t wait to migrate everything away from EC2. It’s been a shitty seven days or so for our infrastructure. This was our primary MongoDB server on a m2.4xlarge instance for Postmark. Thankfully we use replica sets. My main concern was that the Mongo data was on four attached EBS volumes in RAID10. I managed to recover it after a bit of searching. Here is the process, mostly for my own notes.
Read more …
Since we launched Beanstalk we’ve hosted at Media Temple, Softlayer, Engine Yard and Rackspace. Along the way we’ve gone from relying on others to plan our architecture to controlling everything on our own. We’ve been at Rackspace for almost two years, starting as a Managed customer and migrating to what they call Managed Colo, where they provide the servers and network and we do the rest.
In the past six months we’ve been doing some research on alternative hosting providers. We’ve experimented with AWS, which is where some of our staging servers and deployment servers reside and Softlayer, which offers some amazing prices and customization options. In looking at Rackspace, Softlayer and EC2, here is what I have found.
Read more …
I learned this the hard way. We had MongoDB on top of software RAID0 EBS volumes. The latency on EBS was so bad that it would sometimes take hours to access data. It took a while to figure out (and some guidance from 10gen). Once we moved to RAID10 Mongo has been performing extremely well. Next step: get the hell away from AWS and on to some real hardware with SSDs.
I’ve been doing some more testing since my last post. It turns out that since Softlayer’s network is shared the throughput is about 60% of the actual 10Gbit. Still better than 1Gbit for sure. I ran some more tests using the same tar extraction of the linux kernel, but this time with more SSDs for the log devices. We are using the Micron drives which state 45K write iops each.
- No SSD: 3 min 23 seconds
- Single Micron SSD mirror: 1 min 30 seconds
- Two Micron SSD mirrors: 1 min 15 seconds
- Single Intel X25-E SSD: 1 min 30 seconds
- Single Micron SSD: 1 min 16 seconds
- Four Micron SSD no mirroring: 56 seconds
Of course, the last one is not recommended, because if a log device fails you are in trouble. For now I think I might just stick with a single SSD mirror and add more if we see demand in writes. It’s nice to see that the Micron SSDs do in fact out perform the Intel.
I’ve been playing around with Nexenta on some new servers at Softlayer. Why Softlayer? Well, by hosting in their new Dal05 data center I can easily get Micron SSD and 10Gbit-E (public and private) for a great price. Since I use ZFS and NFS, both of these options can go a long way for performance.
Installing Nexenta was pretty easy. I downloaded the latest iso from Nexentastor and booted it from IPMI using Softlayer’s lockbox. As for the hardware, I am running it on the 56xx processor, 48GB ram, Adaptec 5805Z controllers, SSDs for the boot, zil, and l2arc and a bunch of 2TB disks for storage.
As I said, installing was the easy part. The real trick is to run through some performance tests to see how it worked. On our current servers at Rackspace we are not using SSDs for the ZIL, which hurts us when it comes to write performance over NFS. The reason is due to how NFS commits writes using fsync(). A more detailed description about the ZIL can be found here and here. A good test, for me at least, is to simply exact a large set of files to an NFS share and time how fast it is. I grabbed a tar of the linux kernel and went to work. Before I started running the tests, I wanted to make sure the settings were correct. When it comes to ZFS, this can be a pain. Not only do you have a lot of options in ZFS, but you also need to figure out how ZFS should interact with your storage. For instance, if you are using a RAID controller with write cache and a BBU, you might need to adjust ZFS properties. Here are some of the settings I used and the results of the tests.
The basic test
My control for the test was to extract the linux kernel from local disk onto an NFS share. Before doing this, I also tested what the performance was like when extracting to local disk (an SSD mirror) and to our existing system that does not have SSDs for ZIL.
- Extract tar to local disk: 5.5 seconds
- Extract tar to NFS on old server: 6 min 26 seconds
- Extract tar to NFS on new server: Almost useless (read below)
Disabling nocacheflush
When you first setup Nexenta, it gives an option to disable cache flushing for your pools. Usually ZFS is used with JBOD and no RAID controllers. Since we are using the Adaptec controllers, we already have write cache and flushing on the hardware, along with a battery backup (and UPS) in case of power loss. In some cases the controller can ignore the default cache flush requests from ZFS, but in our case it does not. When I tried to extract the tar over NFS it was so slow I had to kill the process. The easy fix is to set zfs_nocacheflush to yes in the appliance preferences.
- Extracted tar to NFS: 54 seconds
Turning off the controller cache
A lot of times you will see recommendations to turn off the write cache on the controller if you use ZFS. It was designed to work with cheaper hardware and to manage the data integrity in the file system. I figured it would be worth testing this, so I went into the controller settings and disabled write and read caching for all of the disks and turned on cache flushing in ZFS again. I then went back and ran the tests again.
- Extracted tar to NFS: 2 min 30 seconds
I was actually surprised to see this. I figured that with ZFS handling everything the speed would be about the same. From what I have read the Adaptec Controller is pretty good, so it seems worth it to use the NVRAM write cache of the hardware.
Disabling ZIL (don’t)
If you have ever read the Evil Tuning Guide you know that one option for great NFS performance is to disable the ZIL. Essentially this will bypass the data protection that ZFS provides in exchange for better performance. It allows the NFS client to write files without acknowledgment that it has been written to stable storage. I wanted to test this to see what the difference was out of curiousity.
- Extracted tar to NFS: 30 sec
That is by far the fastest I can get it over NFS, but not worth it at all for our system. Data integrity and protection is actually more important that uptime or performance, so I’m not willing to take any risks. Instead, we can actually rely on the Micron SSDs for log devices.
Conclusion
For now I am going to settle with the having very fast network combined with disabling the cacheflush on ZFS. While it is nothing near the write speed local disk, it’s pretty good considering the test is extracting thousands of small files. When it comes to NFS and shared file systems the overhead is mostly in the protocol. For us, problems really happen when we have spikes in activity or heavy load, so these faster disks combined with better network should really help. I’ll have to run some load tests next.
Filed under nfs zfs softlayer nexenta solaris
I did a quick test on a new Dell 100GB SSD drive on one of our storage servers. I had to fake the RAM in bonnie because the drive is smaller than our RAM (72GB).
bonnie++ -d /ssd/bonnie -s 44g -r 22g -n 0 -m ssd -f -b -u root
Using uid:0, gid:0.
Writing intelligently...done
Rewriting...done
Reading intelligently...done
start 'em...done...done...done...
Version 1.03 ------Sequential Output------ --Sequential Input- --Random-
-Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP
ssd 44G 181336 30 106424 29 251624 26 6480 24
ssd,44G,,,181336,30,106424,29,,,251624,26,6479.9,24,,,,,,,,,,,,,
In short, 181MB/sec write, 251MB/sec read and 6480 random seeks. This seems right on par with the Intel SSDs.
I thought I was going to talk about interesting stuff until I just discovered this blog. The company AnandTech decided to go with ZFS for their storage and blogged the entire way about performance and architecture. Really thorough and lots of great info:
http://www.zfsbuild.com/
Filed under zfs
We’ve been thinking about Amazon EC2 for our apps. We love using Rackspace and having full control, but in many cases sales is slow and servers are overpriced (or oddly priced I should say). With Beanstalk, everything is about I/O. We rely on it for SVN / GIT, Backups of terabytes of data and over 6000 deployment checkouts per day. It has to be fast.
In our Rackspace environment we use Nexenta, which is basically some software built on top of Solaris to make it a storage appliance. Right now it is a set of two Dell R710 servers (8 cores, 72GB RAM each) with a DELL MD3000 + MD1000 direct attached storage device, holding up to 45 x 300GB SAS drives. It’s a beast and serves us well. We use NFS to share the storage across our machines on a dedicated LAN.
Moving to EC2 would mean either local storage with block level replication or EBS. So I set out to test a bunch of scenarios. Keep in mind, this is not really my area of expertise, so when it comes to the test plan it might not be perfect. For me, it was a good overview and experiment.
Read more …
We’ve found NFS 4 to perform much better with lots of small files compared to NFS3. Which is strange, because most people I ask (experts) say that NFS3 is usually preferred unless you need rigid ACLs. Whatever, all I know is that we get better performance from NFS4. The thing is, NFS4 documentation sucks, so I want to cover some of the steps that I used to get it working on EC2.
I am using an OpenSolaris instance with EBS. If you know ZFS, you know that setting up an NFS server is as easy as zfs set sharenfs=yes myfs and you are done. With that, you can easily connect to the server from an NFS client using v3. Getting NFS4 to work is a bit more complicated.
NFS 4 uses domains to authenticate. If you mount NFS4 with a client, you will run into permission issues. To fix this, you need to do a few things. We use Debian or Ubuntu, so I am only familiar with the commands there.
Read more …
Filed under NFS EC2
I’ve been researching which load balancing solution we might use for Postmark once we complete the move into EC2. There are ton of options like Nginx, HAProxy, ldirectord, AWS’s elastic LB and even DNS round robin. Initially I was thinking to use DNS RR since we use Dynect for DNS. They have a nice tool for distributing load and doing health checks. If you set a low TTL you can have minimal downtime with a very simple (and fast_ LB approach. The problem is their minimum TTL is 30 seconds, which is way too long considering the number of requests we handle. That would mean 30 seconds of lost emails, not good.
I found some nice articles/threads about load balancing as well.
The Rightscale benchmark is really nice. I think we might try ELB for now. In Dynect, we could have either DNS Failover or Round Robin with Failover to point to multiple ELBs in different zones. Eventually I want to use the Real Time Traffic Management from Dynect so we can route people to the closest zone.
Filed under AWS load balancers