Amazon EC2 Disk（EBS） Performance-xjc2694-ChinaUnix博客

Xiajc - 工作笔记xjc2694.blog.chinaunix.net

首页　| 　博文目录　| 　关于我

xjc2694

博客访问： 3070656
博文数量： 535
博客积分： 15788
博客等级：上将
技术积分： 6507
用户组：普通用户
注册时间： 2007-03-07 09:11

文章分类

全部博文（535）

Puppet（6）
Solaris（1）
hadoop（15）
虚拟化（8）
C（1）
DB（44）
perl（35）
云计算（27）
系统监控（26）
Others（27）
WWW（100）
Mail（20）
Linux（213）
未分配的博文（12）

文章存档

2016年（1）

2015年（1）

2014年（10）

2013年（26）

2012年（43）

2011年（86）

2010年（76）

2009年（136）

2008年（97）

2007年（59）

我的朋友

The Numbers

Again, let me re-iterate that some numbers may not be accurately reflected in your production environment. Amazon states, small instances have “moderate” I/O availability. Presumably if your running this for a production DB, you’ll want to consider a large or extra-large instance for the memory and so you should see slightly better performance from your configuration. Also note, that the drives I allocated were rather small (to keep testing costs low) so you may experience different results with larger capacities.

Note: The graph below is in KB, not bytes as titled.

Bonnie Disk Performance on EC2

Size (Filesystem)	Output Per Char	Output Block	Output Re-write	Input Per Char	Input Block
4×5Gb RAID5 (JFS)	22,349	58,672	39,149	25,332	84,863
4×5Gb RAID0 (JFS)	24,271	99,152	43,053	26,086	96,320
10Gb (XFS)	20,944	43,897	24,386	25,029	65,710
10Gb (ReiserFS)	22,864	57,248	17,880	21,716	44,554
10Gb (JFS)	23,905	47,868	21,725	24,585	55,688
10Gb (EXT3)	22,986	57,840	22,100	24,317	48,502

As expected, RAID 0 does best with read/write speed and RAID 5 does very well on reads (input block) as well. For InnoDB, the re-write and block read (input)/write (output) operations are the most critical values. Longer bars mean better values. To better understand what the test is doing, be sure to read the of each field.

Making Devices

The process for making a device is simple. There are many tutorials on how to make this persistent and you can certainly build this into your own AMI when you’re done – this is not a tutorial on how to do that. To get a volume up and running you’ll follow these basic steps:

Determine what you want to create – capacity, filesystem type etc.
Allocate EBS storage
Attache the EBS storage to your EC2 instance
If using RAID, create the volume.
Format the filesystem
Create the mount point on the instance filesystem
Mount the storage
Add any necessary entries to mount storage at boot time.

Single Disk Images

Remember, the speed and efficiency of the single EBS device is roughly comparable to a modern SATA or SCSI drive. Use of a different filesystem (other than EXT3) can increase different aspects of drive performance, just as it would with a physical hard drive. This isn’t a comparison of the pros and cons of different engines, but simply providing my findings during testing.

JFS	yum install jfsutils
XFS	yum install xfsprogs
ReiserFS	yum install reiserfs-utils

I didn’t test any other filesystems such as ZFS, because I’ve read some other filesystems are unstable on Linux and I’ll be running production on Linux so the extra time for the tests seemed unnecessary. I am interested in other alternatives that could increase performance if you have any to share I’d love to hear about them.

You can quickly get a volume setup with the following:

mkfs -t ext3 /dev/sdf
mkdir /vol1
mount /dev/sdf /vol1

Next time you mount the volume, you won’t need to use “mkfs” because the drive is already formatted.

RAID

The default AMI already includes support for RAID, but if you needed to add them to your yum enabled system, it’s “yum install mdadm”. On the Fedora Core 9 test rig I was using, RAID 0, 1, 5, 6 were supported, YMMV.

To create a 4 disk RAID 0 volume, it’s simply:

mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sdf /dev/sdg /dev/sdh /dev/sdi
mkfs -t ext3 /dev/md0
mkdir /raid
mount /dev/md0 /raid

To create a 4 disk RAID 5 volume instead, it’s simply:

mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdf /dev/sdg /dev/sdh /dev/sdi
mkfs -t ext3 /dev/md0
mkdir /raid
mount /dev/md0 /raid

This example assumes you have 4 EBS volumes attached to the system. AWS shows 7 possible mount points /dev/sdf – /dev/sdl in the web console, however, the documentation states you can use through /dev/sdp, which is 11 EBS volumes in addition to the non-persistent storage. This would be a theoretical maximum of 10TB of RAID 5 or 11TB of RAID 0 storage!

另：因为instance需要重启后重新挂载卷标，所以，需要在卷标挂载完成后重新组成raid，即在挂载卷标的命令后面添加如：

mdadm --assemble --verbose /dev/md0 /dev/sdh /dev/sdi /dev/sdj

详细的查看：mdadm --help 或mdadm --assemble --help

注：当使用snapshot创建volume时，容量大小一定要和原始容量相同，否则在assemble磁盘时会找不到superblock：

mdadm: looking for devices for /dev/md1 mdadm: no RAID superblock on /dev/sdi mdadm: /dev/sdi has no superblock - assembly aborted

检查是否存在superblock：

mdadm --examine /dev/sdi

mdadm --detail /dev/md0

[root@domU-12-31-39-00-74-01 /]# mdadm --detail /dev/md1

/dev/md1:

Version : 00.90.03

Creation Time : Wed Aug 12 11:31:45 2009

Raid Level : raid5

Array Size : 220200768 (210.00 GiB 225.49 GB)

Used Dev Size : 73400256 (70.00 GiB 75.16 GB)

Raid Devices : 4

Total Devices : 4

Preferred Minor : 1

Persistence : Superblock is persistent

Update Time : Wed Aug 12 13:05:45 2009

State : clean, degraded, recovering

Active Devices : 3

Working Devices : 4

Failed Devices : 0

Spare Devices : 1

Layout : left-symmetric

Chunk Size : 64K

Rebuild Status : 98% complete

UUID : f640874b:6fbc1d8d:c629de9f:223f899b

Events : 0.10

Number Major Minor RaidDevice State

0 8 176 0 active sync /dev/sdl

1 8 192 1 active sync /dev/sdm

2 8 208 2 active sync /dev/sdn

4 8 224 3 spare rebuilding /dev/sdo

[root@domU-12-31-39-00-74-01 /]#

另同步中：
[root@domU-12-31-39-00-74-01 /]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md1 : active raid5 sdo[4] sdn[2] sdm[1] sdl[0] 220200768 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_] [===================>.] recovery = 97.2% (71405740/73400256)finish=3.1min speed=10627K/sec unused devices: <none>
同步完成后

[root@domU-12-31-39-00-74-01 /]# cat /proc/mdstat Personalities : [raid6] [raid5] [raid4] md1 : active raid5 sdo[3] sdn[2] sdm[1] sdl[0] 220200768 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none>

Checking in on things…

cat /proc/mdstat
is a great way to check in on the RAID volume. If you run it directly after creating a mirroring or striping array, you’ll also be able to see the scrubbing process and how far along it is.
mount -l
shows the currently mounted devices and any options specified.
df
disk free provides a nice list of device mounts and their total, available and used space.

Conclusion

It’s clear from the numbers that software RAID offer a clear performance advantage over a ESB volume. Since with EBS you pay per Gb not per disk, it’s certainly cost effective to create a robust RAID volume. The question that remains is how careful do you need to be with your data? RAID 0 offered blistering fast performance but like a traditional array, without redundancy. You can always set it up as RAID 5, RAID 6 or RAID 10 but this of course requires more unusable space to handle the redundancy.

Since the volumes on EBS are theoretically invincible, it may be okay to run unprotected by a mirror or parity drive, however, I haven’t found anyone who would recommend this in production. If anyone knows of a good reason to ignore the saftey of RAID 10 or RAID 6 or RAID 5, I’d love to hear the reasoning.

I am also curious if these drives maintain a consistent throughput over the full capacity of the disk or will they slow down as the drive fills like a traditional drive? I did not test this. It remains open for another test (and subsequent blog post). Should anyone against a 100Gb+ drive and figure that out, please let me know.

Fine Print – The Costs

Storage starts at a reasonable $0.10/GB-Month which is reasonable and is prorated for only the time you use it. A 1Tb RAID 0 volume made of 10×100Gb volumes would only cost $1,200 per year. Good luck getting performance/dollar costs for 1Tb like that from any SAN solution at a typical ISP. There are however some hidden costs in the I/O that you’ll need to pay attention to. Each time you read or write a block to disk, there’s an incremental cost. The pricing is $0.10 per million I/O requests – which seems cheap, but just running the simple tests I ran with Bonnie++ I consumed almost 2 million requests in less than 3 hours of instance time. If you have a high number of reads or writes, which you likely do if you’re considering reading this, you’ll need to factor these costs in.

The total AWS cost for running these tests was $0.71 of which $0.19 were storage related. The balance was the machine instances and bandwidth.

Resources

Software-RAID (Linux Documentation Project)

关于对raid做snapshot，需要对每个单独的ebs做快照。如果是运行着类似数据库的应用，需要保证ebs的一致性�。详细的google一下：ebs raid snapshot

http://developer.amazonwebservices.com/connect/thread.jspa?messageID=106826& #106826

把&和#中间的空格去掉

关于iowait高的问题，强烈推荐下面的几个帖子：

http://developer.amazonwebservices.com/connect/thread.jspa?messageID=124044#124044

http://developer.amazonwebservices.com/connect/thread.jspa?messageID=123834𞎺

http://developer.amazonwebservices.com/connect/thread.jspa?messageID=99366𘐦

The iostat trace that you show me doesn't show any significant problems with EBS. Here's some explanation of the numbers:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda1              0.00     0.00    1.00    0.00     8.00     0.00     8.00     0.00    2.00   2.00   0.20
sdg               1.00     0.00  160.00   13.00  1608.00    74.00     9.72     2.52   14.81   5.62  97.30

From the left, the columns are as follow:

rrqm/s and wrqm/s: read (write) requests merged per second. These numbers indicate the number of requests destined for the device that were able to be merged before submission to the device. Requests can be merged if they are contiguous. These numbers are not super relevant to diagnosing performance issues.

r/s and r/s: reads (writes) per second. This is the number of reads and writes completed (not submitted) per second during the reporting period. Looking at these numbers tells you the rate at which EBS is servicing your I/O requests, but if they drop, it doesn't really tell you much. They could drop because EBS is having problems, or they could drop because your application is submitting few requests.

rsec/s and wsec/s: read (written) sectors per second. This is the number of 512 byte sectors read or written per second during the reporting period. Dividing this number by reads/writes per second gives you the average request size.

avgrq-sz: average request size. This number (combined for reads and writes), or the equivalent number for reads and writes computed as described above gives you an idea of how random your I/O is. In general if this number is below 16 (16 * 512 bytes = 8KB), you are doing extremely random I/O. The max you should ever see is 256, as 128KB is the maximum I/O request size by default for Linux. If this number is low (<50), you are going to be IOPS limited. If it's high (>100), you are likely to be bandwidth limited.

avgqu-sz: average queue size. This indicates how many requests are queued waiting to be serviced. The maximum number this can be is found in /sys/block//queue/nr_requests. By default, the max is 128. If you are seeing numbers approaching this level, it means that your application is making requests faster than EBS can service them. If it's low, EBS is keeping up with the incoming requests. This isn't the whole story, though.

await: average wait. The average amount of time the requests that were completed during this period waited from when they entered the queue to when they were serviced. This number is a combination of the queue length and the average service time. It's usually more revealing to look at them separately.

svctm: service time. The average amount of time the requests that were completed during this period waited from when they were submitted to the device to when they were serviced.

util: device utilization. I believe that it's the percentage of the reporting period in which the queue was not empty. I rarely use this number for diagnostic purposes, as the other numbers tell the story.

So where does that leave us? Whenever you reduce the number of TPS in your application, you're (almost certainly) going to reduce the number of IOPS in your EBS volume. The question always is, "which is cause and which is effect?" Is the application slowing down because the EBS volume is slowing down, or is the EBS volume doing less work because the application is presenting less work for it to do?

To figure out the answer, you have to pick apart the numbers a little. To first order, the most important number is the svctm. In general, this number should be below 100ms, and it's usually much below. For read-dominated work loads, I would expect to usually see this number in the 10-20ms range and for write-dominated work loads, it could be as low as single-digits. However, EBS is a shared resource; under periods of high load as I wrote above, it could be operating normally with service times in the 100ms range.

If the svctm looks good but your application still is running slower than you expect, the other numbers can help diagnose why. If avgqu-sz gets big (>30), your application is submitting more requests per second than the volume can handle. The solution here is to stripe across multiple EBS volumes using LVM or RAID-0.

If svctm looks good and avgqu-sz is low, either there's something else wrong with your application (e.g. it's spinning CPU somewhere) or your expectations are unreasonable. What I mean by the latter is that reads take 10-20ms and if you're doing a lot of reads (especially if they are serial reads), it's going to take a while to get all that data.

Looking at the numbers above, you look like you're running right at the edge of what a single volume can be expected to do (svctm low and avgqu-sz of 14), and you're likely to experience slow-downs during periods of contention. I would suggest striping your work load over several volumes. To increase the number of parallel operations you can have in flight, which should reduce your queue length and increase your overall throughput.

阅读(3256) | 评论(3) | 转发(0) |

上一篇：nginx perl fastcgi

下一篇：使用Bonnie++进行系统IO性能测试

给主人留下些什么吧！~~

chinaunix网友2009-10-12 11:53:05

非常感谢,看了这篇文档更清楚了些. 兄弟,在北京上班吗? 我的 QQ 573966 , 希望能和你多交流.

回复 | 举报

chinaunix网友2009-10-10 17:53:18

谢谢,兄弟的回复,仍有疑问: 那这样就涉及到当前文件系统和上一次snapshot的比较啊?还有snapshot的存储格式是什么样的,知道吗?

回复 | 举报

chinaunix网友2009-10-09 17:54:29

看兄弟对Amazon的系统很有研究啊,知道ESB的增量snapshot是怎么实现的吗

回复 | 举报

感谢所有关心和支持过ChinaUnix的朋友们

16024965号-6