While considering different options for a database server, I decided to do some digging into Amazon Web Services (AWS) as an alternative to dedicated servers from an ISP. I was most curious about the I/O of the Elastic Block Storage (EBS) on the Elastic Compute Cloud (EC2). What I tested was a number of different file systems EXT3, JFS, XFS, ReiserFS as single block devices and then some different software RAID configurations leveraging JFS. The tests were run using .
The configuration was vanilla, no special tuning was done, just the default values that are assigned by the tools. I used Fedora Core 9 as the OS from the default Amazon AMI and used “yum install” to aquire the necessary utilities (more on that below). I expect with further tuning, some increases in performance can still be obtained. I used the small instance for cost reasons, which includes “moderate” I/O performance. Running on a large or extra-large standard instance should perform even better with “high” I/O performance. You can get all the from Amazon.
First I wanted to determine what the EBS devices would compare to in the physical world. I ran Bonnie against a few entry level boxes provided by a number of ISP’s and found the performance roughly matched a locally attached SATA or SCSI drive when formatted with EXT3. I also found that JFS, XFS and ReiserFS performed slightly better than EXT3 in most tests except block writes.
The Numbers
Again, let me re-iterate that some numbers may not be accurately reflected in your production environment. Amazon states, small instances have “moderate” I/O availability. Presumably if your running this for a production DB, you’ll want to consider a large or extra-large instance for the memory and so you should see slightly better performance from your configuration. Also note, that the drives I allocated were rather small (to keep testing costs low) so you may experience different results with larger capacities.
Note: The graph below is in KB, not bytes as titled.
Size (Filesystem) | Output Per Char | Output Block | Output Re-write | Input Per Char | Input Block |
---|---|---|---|---|---|
4×5Gb RAID5 (JFS) | 22,349 | 58,672 | 39,149 | 25,332 | 84,863 |
4×5Gb RAID0 (JFS) | 24,271 | 99,152 | 43,053 | 26,086 | 96,320 |
10Gb (XFS) | 20,944 | 43,897 | 24,386 | 25,029 | 65,710 |
10Gb (ReiserFS) | 22,864 | 57,248 | 17,880 | 21,716 | 44,554 |
10Gb (JFS) | 23,905 | 47,868 | 21,725 | 24,585 | 55,688 |
10Gb (EXT3) | 22,986 | 57,840 | 22,100 | 24,317 | 48,502 |
As expected, RAID 0 does best with read/write speed and RAID 5 does very well on reads (input block) as well. For InnoDB, the re-write and block read (input)/write (output) operations are the most critical values. Longer bars mean better values. To better understand what the test is doing, be sure to read the of each field.
Making Devices
The process for making a device is simple. There are many tutorials on how to make this persistent and you can certainly build this into your own AMI when you’re done – this is not a tutorial on how to do that. To get a volume up and running you’ll follow these basic steps:
- Determine what you want to create – capacity, filesystem type etc.
- Allocate EBS storage
- Attache the EBS storage to your EC2 instance
- If using RAID, create the volume.
- Format the filesystem
- Create the mount point on the instance filesystem
- Mount the storage
- Add any necessary entries to mount storage at boot time.
Single Disk Images
Remember, the speed and efficiency of the single EBS device is roughly comparable to a modern SATA or SCSI drive. Use of a different filesystem (other than EXT3) can increase different aspects of drive performance, just as it would with a physical hard drive. This isn’t a comparison of the pros and cons of different engines, but simply providing my findings during testing.
JFS | yum install jfsutils |
XFS | yum install xfsprogs |
ReiserFS | yum install reiserfs-utils |
I didn’t test any other filesystems such as ZFS, because I’ve read some other filesystems are unstable on Linux and I’ll be running production on Linux so the extra time for the tests seemed unnecessary. I am interested in other alternatives that could increase performance if you have any to share I’d love to hear about them.
You can quickly get a volume setup with the following:
Next time you mount the volume, you won’t need to use “mkfs” because the drive is already formatted.
RAID
The default AMI already includes support for RAID, but if you needed to add them to your yum enabled system, it’s “yum install mdadm”. On the Fedora Core 9 test rig I was using, RAID 0, 1, 5, 6 were supported, YMMV.
To create a 4 disk RAID 0 volume, it’s simply:
To create a 4 disk RAID 5 volume instead, it’s simply:
This example assumes you have 4 EBS volumes attached to the system. AWS shows 7 possible mount points /dev/sdf – /dev/sdl in the web console, however, the documentation states you can use through /dev/sdp, which is 11 EBS volumes in addition to the non-persistent storage. This would be a theoretical maximum of 10TB of RAID 5 or 11TB of RAID 0 storage!
另:因为instance需要重启后重新挂载卷标,所以,需要在卷标挂载完成后重新组成raid,即在挂载卷标的命令后面添加如:
mdadm --assemble --verbose /dev/md0 /dev/sdh /dev/sdi /dev/sdj
详细的查看:mdadm --help 或mdadm --assemble --help
注:当使用snapshot创建volume时,容量大小一定要和原始容量相同,否则在assemble磁盘时会找不到superblock:
检查是否存在superblock:mdadm: looking for devices for /dev/md1
mdadm: no RAID superblock on /dev/sdi
mdadm: /dev/sdi has no superblock - assembly aborted
mdadm --examine /dev/sdi
mdadm --detail /dev/md0
[root@domU-12-31-39-00-74-01 /]# mdadm --detail /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Wed Aug 12 11:31:45 2009 Raid Level : raid5 Array Size : 220200768 (210.00 GiB 225.49 GB) Used Dev Size : 73400256 (70.00 GiB 75.16 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Wed Aug 12 13:05:45 2009 State : clean, degraded, recovering Active Devices : 3 Working Devices : 4 Failed Devices : 0 Spare Devices : 1 Layout : left-symmetric Chunk Size : 64K Rebuild Status : 98% complete UUID : f640874b:6fbc1d8d:c629de9f:223f899b Events : 0.10 Number Major Minor RaidDevice State 0 8 176 0 active sync /dev/sdl 1 8 192 1 active sync /dev/sdm 2 8 208 2 active sync /dev/sdn 4 8 224 3 spare rebuilding /dev/sdo [root@domU-12-31-39-00-74-01 /]#
|
[root@domU-12-31-39-00-74-01 /]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4]
md1 : active raid5 sdo[3] sdn[2] sdm[1] sdl[0]
220200768 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
unused devices: <none>
Checking in on things…
- cat /proc/mdstat
is a great way to check in on the RAID volume. If you run it directly after creating a mirroring or striping array, you’ll also be able to see the scrubbing process and how far along it is. - mount -l
shows the currently mounted devices and any options specified. - df
disk free provides a nice list of device mounts and their total, available and used space.
Conclusion
It’s clear from the numbers that software RAID offer a clear performance advantage over a ESB volume. Since with EBS you pay per Gb not per disk, it’s certainly cost effective to create a robust RAID volume. The question that remains is how careful do you need to be with your data? RAID 0 offered blistering fast performance but like a traditional array, without redundancy. You can always set it up as RAID 5, RAID 6 or RAID 10 but this of course requires more unusable space to handle the redundancy.
Since the volumes on EBS are theoretically invincible, it may be okay to run unprotected by a mirror or parity drive, however, I haven’t found anyone who would recommend this in production. If anyone knows of a good reason to ignore the saftey of RAID 10 or RAID 6 or RAID 5, I’d love to hear the reasoning.
I am also curious if these drives maintain a consistent throughput over the full capacity of the disk or will they slow down as the drive fills like a traditional drive? I did not test this. It remains open for another test (and subsequent blog post). Should anyone against a 100Gb+ drive and figure that out, please let me know.
Fine Print – The Costs
Storage starts at a reasonable $0.10/GB-Month which is reasonable and is prorated for only the time you use it. A 1Tb RAID 0 volume made of 10×100Gb volumes would only cost $1,200 per year. Good luck getting performance/dollar costs for 1Tb like that from any SAN solution at a typical ISP. There are however some hidden costs in the I/O that you’ll need to pay attention to. Each time you read or write a block to disk, there’s an incremental cost. The pricing is $0.10 per million I/O requests – which seems cheap, but just running the simple tests I ran with Bonnie++ I consumed almost 2 million requests in less than 3 hours of instance time. If you have a high number of reads or writes, which you likely do if you’re considering reading this, you’ll need to factor these costs in.
The total AWS cost for running these tests was $0.71 of which $0.19 were storage related. The balance was the machine instances and bandwidth.
Resources
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdc 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sda1 0.00 0.00 1.00 0.00 8.00 0.00 8.00 0.00 2.00 2.00 0.20 sdg 1.00 0.00 160.00 13.00 1608.00 74.00 9.72 2.52 14.81 5.62 97.30 |
rrqm/s and wrqm/s: read (write) requests merged per second. These numbers indicate the number of requests destined for the device that were able to be merged before submission to the device. Requests can be merged if they are contiguous. These numbers are not super relevant to diagnosing performance issues.
r/s and r/s: reads (writes) per second. This is the number of reads and writes completed (not submitted) per second during the reporting period. Looking at these numbers tells you the rate at which EBS is servicing your I/O requests, but if they drop, it doesn't really tell you much. They could drop because EBS is having problems, or they could drop because your application is submitting few requests.
rsec/s and wsec/s: read (written) sectors per second. This is the number of 512 byte sectors read or written per second during the reporting period. Dividing this number by reads/writes per second gives you the average request size.
avgrq-sz: average request size. This number (combined for reads and writes), or the equivalent number for reads and writes computed as described above gives you an idea of how random your I/O is. In general if this number is below 16 (16 * 512 bytes = 8KB), you are doing extremely random I/O. The max you should ever see is 256, as 128KB is the maximum I/O request size by default for Linux. If this number is low (<50), you are going to be IOPS limited. If it's high (>100), you are likely to be bandwidth limited.
avgqu-sz: average queue size. This indicates how many requests are queued waiting to be serviced. The maximum number this can be is found in /sys/block/
await: average wait. The average amount of time the requests that were completed during this period waited from when they entered the queue to when they were serviced. This number is a combination of the queue length and the average service time. It's usually more revealing to look at them separately.
svctm: service time. The average amount of time the requests that were completed during this period waited from when they were submitted to the device to when they were serviced.
util: device utilization. I believe that it's the percentage of the reporting period in which the queue was not empty. I rarely use this number for diagnostic purposes, as the other numbers tell the story.
So where does that leave us? Whenever you reduce the number of TPS in your application, you're (almost certainly) going to reduce the number of IOPS in your EBS volume. The question always is, "which is cause and which is effect?" Is the application slowing down because the EBS volume is slowing down, or is the EBS volume doing less work because the application is presenting less work for it to do?
To figure out the answer, you have to pick apart the numbers a little. To first order, the most important number is the svctm. In general, this number should be below 100ms, and it's usually much below. For read-dominated work loads, I would expect to usually see this number in the 10-20ms range and for write-dominated work loads, it could be as low as single-digits. However, EBS is a shared resource; under periods of high load as I wrote above, it could be operating normally with service times in the 100ms range.
If the svctm looks good but your application still is running slower than you expect, the other numbers can help diagnose why. If avgqu-sz gets big (>30), your application is submitting more requests per second than the volume can handle. The solution here is to stripe across multiple EBS volumes using LVM or RAID-0.
If svctm looks good and avgqu-sz is low, either there's something else wrong with your application (e.g. it's spinning CPU somewhere) or your expectations are unreasonable. What I mean by the latter is that reads take 10-20ms and if you're doing a lot of reads (especially if they are serial reads), it's going to take a while to get all that data.
Looking at the numbers above, you look like you're running right at the edge of what a single volume can be expected to do (svctm low and avgqu-sz of 14), and you're likely to experience slow-downs during periods of contention. I would suggest striping your work load over several volumes. To increase the number of parallel operations you can have in flight, which should reduce your queue length and increase your overall throughput.