Chinaunix首页 | 论坛 | 博客
  • 博客访问: 3016270
  • 博文数量: 535
  • 博客积分: 15788
  • 博客等级: 上将
  • 技术积分: 6507
  • 用 户 组: 普通用户
  • 注册时间: 2007-03-07 09:11
文章分类

全部博文(535)

文章存档

2016年(1)

2015年(1)

2014年(10)

2013年(26)

2012年(43)

2011年(86)

2010年(76)

2009年(136)

2008年(97)

2007年(59)

分类: 系统运维

2009-08-07 09:45:41

首先:推荐在ec2上作raid0而不是做raid5,因为raid5中有一块磁盘会用来做校验,这块磁盘会成为整个raid磁盘的IO瓶颈。

February 27th, 2009 by Erik

Amazon Web Services LogoWhile considering different options for a database server, I decided to do some digging into Amazon Web Services (AWS) as an alternative to dedicated servers from an ISP. I was most curious about the I/O of the Elastic Block Storage (EBS) on the Elastic Compute Cloud (EC2). What I tested was a number of different file systems EXT3, JFS, XFS, ReiserFS as single block devices and then some different software RAID configurations leveraging JFS. The tests were run using .

The configuration was vanilla, no special tuning was done, just the default values that are assigned by the tools. I used Fedora Core 9 as the OS from the default Amazon AMI and used “yum install” to aquire the necessary utilities (more on that below). I expect with further tuning, some increases in performance can still be obtained. I used the small instance for cost reasons, which includes “moderate” I/O performance. Running on a large or extra-large standard instance should perform even better with “high” I/O performance. You can get all the  from Amazon.

First I wanted to determine what the EBS devices would compare to in the physical world. I ran Bonnie against a few entry level boxes provided by a number of ISP’s and found the performance roughly matched a locally attached SATA or SCSI drive when formatted with EXT3. I also found that JFS, XFS and ReiserFS performed slightly better than EXT3 in most tests except block writes.

The Numbers

Again, let me re-iterate that some numbers may not be accurately reflected in your production environment. Amazon states, small instances have “moderate” I/O availability. Presumably if your running this for a production DB, you’ll want to consider a large or extra-large instance for the memory and so you should see slightly better performance from your configuration. Also note, that the drives I allocated were rather small (to keep testing costs low) so you may experience different results with larger capacities.

Note: The graph below is in KB, not bytes as titled.

Bonnie Disk Performance on EC2

Size (Filesystem)Output Per CharOutput BlockOutput Re-writeInput Per CharInput Block
4×5Gb RAID5 (JFS)22,34958,67239,14925,33284,863
4×5Gb RAID0 (JFS)24,27199,15243,05326,08696,320
10Gb (XFS)20,94443,89724,38625,02965,710
10Gb (ReiserFS)22,86457,24817,88021,71644,554
10Gb (JFS)23,90547,86821,72524,58555,688
10Gb (EXT3)22,98657,84022,10024,31748,502

As expected, RAID 0 does best with read/write speed and RAID 5 does very well on reads (input block) as well. For InnoDB, the re-write and block read (input)/write (output) operations are the most critical values. Longer bars mean better values. To better understand what the test is doing, be sure to read the  of each field.

Making Devices

The process for making a device is simple. There are many tutorials on how to make this persistent and you can certainly build this into your own AMI when you’re done – this is not a tutorial on how to do that. To get a volume up and running you’ll follow these basic steps:

  1. Determine what you want to create – capacity, filesystem type etc.
  2. Allocate EBS storage
  3. Attache the EBS storage to your EC2 instance
  4. If using RAID, create the volume.
  5. Format the filesystem
  6. Create the mount point on the instance filesystem
  7. Mount the storage
  8. Add any necessary entries to mount storage at boot time.

Single Disk Images

Remember, the speed and efficiency of the single EBS device is roughly comparable to a modern SATA or SCSI drive. Use of a different filesystem (other than EXT3) can increase different aspects of drive performance, just as it would with a physical hard drive. This isn’t a comparison of the pros and cons of different engines, but simply providing my findings during testing.

JFSyum install jfsutils
XFSyum install xfsprogs
ReiserFSyum install reiserfs-utils

I didn’t test any other filesystems such as ZFS, because I’ve read some other filesystems are unstable on Linux and I’ll be running production on Linux so the extra time for the tests seemed unnecessary. I am interested in other alternatives that could increase performance if you have any to share I’d love to hear about them.

You can quickly get a volume setup with the following:

 SH
mkfs -t ext3 /dev/sdf
mkdir /vol1
mount /dev/sdf /vol1

Next time you mount the volume, you won’t need to use “mkfs” because the drive is already formatted.

RAID

The default AMI already includes support for RAID, but if you needed to add them to your yum enabled system, it’s “yum install mdadm”. On the Fedora Core 9 test rig I was using, RAID 0, 1, 5, 6 were supported, YMMV.

To create a 4 disk RAID 0 volume, it’s simply:

 SH
mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sdf /dev/sdg /dev/sdh /dev/sdi
mkfs -t ext3 /dev/md0
mkdir /raid
mount /dev/md0 /raid

To create a 4 disk RAID 5 volume instead, it’s simply:

 SH
mdadm --create --verbose /dev/md0 --level=5 --raid-devices=4 /dev/sdf /dev/sdg /dev/sdh /dev/sdi
mkfs -t ext3 /dev/md0
mkdir /raid
mount /dev/md0 /raid

This example assumes you have 4 EBS volumes attached to the system. AWS shows 7 possible mount points /dev/sdf – /dev/sdl in the web console, however, the documentation states you can use through /dev/sdp, which is 11 EBS volumes in addition to the non-persistent storage. This would be a theoretical maximum of 10TB of RAID 5 or 11TB of RAID 0 storage!


另:因为instance需要重启后重新挂载卷标,所以,需要在卷标挂载完成后重新组成raid,即在挂载卷标的命令后面添加如:

mdadm --assemble --verbose /dev/md0   /dev/sdh /dev/sdi /dev/sdj

详细的查看:mdadm --help 或mdadm --assemble --help

注:当使用snapshot创建volume时,容量大小一定要和原始容量相同,否则在assemble磁盘时会找不到superblock:

mdadm: looking for devices for /dev/md1
mdadm: no RAID superblock on /dev/sdi
mdadm: /dev/sdi has no superblock - assembly aborted

检查是否存在superblock:

mdadm --examine /dev/sdi

mdadm --detail /dev/md0

[root@domU-12-31-39-00-74-01 /]# mdadm --detail /dev/md1

/dev/md1:

        Version : 00.90.03

  Creation Time : Wed Aug 12 11:31:45 2009

     Raid Level : raid5

     Array Size : 220200768 (210.00 GiB 225.49 GB)

  Used Dev Size : 73400256 (70.00 GiB 75.16 GB)

   Raid Devices : 4

  Total Devices : 4

Preferred Minor : 1

    Persistence : Superblock is persistent


    Update Time : Wed Aug 12 13:05:45 2009

          State : clean, degraded, recovering

 Active Devices : 3

Working Devices : 4

 Failed Devices : 0

  Spare Devices : 1


         Layout : left-symmetric

     Chunk Size : 64K


 Rebuild Status : 98% complete


           UUID : f640874b:6fbc1d8d:c629de9f:223f899b

         Events : 0.10


    Number   Major   Minor   RaidDevice State

       0       8      176        0      active sync   /dev/sdl

       1       8      192        1      active sync   /dev/sdm

       2       8      208        2      active sync   /dev/sdn

       4       8      224        3      spare rebuilding   /dev/sdo

[root@domU-12-31-39-00-74-01 /]# 

另同步中:

[root@domU-12-31-39-00-74-01 /]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid5 sdo[4] sdn[2] sdm[1] sdl[0]
      220200768 blocks level 5, 64k chunk, algorithm 2 [4/3] [UUU_]
      [===================>.] recovery = 97.2% (71405740/73400256)finish=3.1min speed=10627K/sec
      
unused devices: <none>

同步完成后

[root@domU-12-31-39-00-74-01 /]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md1 : active raid5 sdo[3] sdn[2] sdm[1] sdl[0]
      220200768 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU]
      
unused devices: <none>

Checking in on things…

  • cat /proc/mdstat
    is a great way to check in on the RAID volume. If you run it directly after creating a mirroring or striping array, you’ll also be able to see the scrubbing process and how far along it is.
  • mount -l
    shows the currently mounted devices and any options specified.
  • df
    disk free provides a nice list of device mounts and their total, available and used space.

Conclusion

It’s clear from the numbers that software RAID offer a clear performance advantage over a ESB volume. Since with EBS you pay per Gb not per disk, it’s certainly cost effective to create a robust RAID volume. The question that remains is how careful do you need to be with your data? RAID 0 offered blistering fast performance but like a traditional array, without redundancy. You can always set it up as RAID 5, RAID 6 or RAID 10 but this of course requires more unusable space to handle the redundancy.

Since the volumes on EBS are theoretically invincible, it may be okay to run unprotected by a mirror or parity drive, however, I haven’t found anyone who would recommend this in production. If anyone knows of a good reason to ignore the saftey of RAID 10 or RAID 6 or RAID 5, I’d love to hear the reasoning.

I am also curious if these drives maintain a consistent throughput over the full capacity of the disk or will they slow down as the drive fills like a traditional drive? I did not test this. It remains open for another test (and subsequent blog post). Should anyone  against a 100Gb+ drive and figure that out, please let me know.

Fine Print – The Costs

Storage starts at a reasonable $0.10/GB-Month which is reasonable and is prorated for only the time you use it. A 1Tb RAID 0 volume made of 10×100Gb volumes would only cost $1,200 per year. Good luck getting performance/dollar costs for 1Tb like that from any SAN solution at a typical ISP. There are however some hidden costs in the I/O that you’ll need to pay attention to. Each time you read or write a block to disk, there’s an incremental cost. The pricing is $0.10 per million I/O requests – which seems cheap, but just running the simple tests I ran with Bonnie++ I consumed almost 2 million requests in less than 3 hours of instance time. If you have a high number of reads or writes, which you likely do if you’re considering reading this, you’ll need to factor these costs in.

The total AWS cost for running these tests was $0.71 of which $0.19 were storage related. The balance was the machine instances and bandwidth.

Resources


关于对raid做snapshot,需要对每个单独的ebs做快照。如果是运行着类似数据库的应用,需要保证ebs的一致性。详细的google一下:ebs raid snapshot
把&和#中间的空格去掉





关于iowait高的问题,强烈推荐下面的几个帖子:

The iostat trace that you show me doesn't show any significant problems with EBS.  Here's some explanation of the numbers: 

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdb               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     0.00    0.00   0.00   0.00
sda1              0.00     0.00    1.00    0.00     8.00     0.00     8.00     0.00    2.00   2.00   0.20
sdg               1.00     0.00  160.00   13.00  1608.00    74.00     9.72     2.52   14.81   5.62  97.30
From the left, the columns are as follow: 

rrqm/s and wrqm/s: read (write) requests merged per second.  These numbers indicate the number of requests destined for the device that were able to be merged before submission to the device.  Requests can be merged if they are contiguous.  These numbers are not super relevant to diagnosing performance issues. 

r/s and r/s: reads (writes) per second.  This is the number of reads and writes completed (not submitted) per second during the reporting period.  Looking at these numbers tells you the rate at which EBS is servicing your I/O requests, but if they drop, it doesn't really tell you much.  They could drop because EBS is having problems, or they could drop because your application is submitting few requests. 

rsec/s and wsec/s: read (written) sectors per second.  This is the number of 512 byte sectors read or written per second during the reporting period.  Dividing this number by reads/writes per second gives you the average request size. 

avgrq-sz: average request size.  This number (combined for reads and writes), or the equivalent number for reads and writes computed as described above gives you an idea of how random your I/O is.  In general if this number is below 16 (16 * 512 bytes = 8KB), you are doing extremely random I/O.  The max you should ever see is 256, as 128KB is the maximum I/O request size by default for Linux.  If this number is low (<50), you are going to be IOPS limited.  If it's high (>100), you are likely to be bandwidth limited. 

avgqu-sz: average queue size.  This indicates how many requests are queued waiting to be serviced.  The maximum number this can be is found in /sys/block//queue/nr_requests.  By default, the max is 128.  If you are seeing numbers approaching this level, it means that your application is making requests faster than EBS can service them.  If it's low, EBS is keeping up with the incoming requests.  This isn't the whole story, though. 

await: average wait.  The average amount of time the requests that were completed during this period waited from when they entered the queue to when they were serviced.  This number is a combination of the queue length and the average service time.  It's usually more revealing to look at them separately. 

svctm: service time.  The average amount of time the requests that were completed during this period waited from when they were submitted to the device to when they were serviced. 

util: device utilization.  I believe that it's the percentage of the reporting period in which the queue was not empty.  I rarely use this number for diagnostic purposes, as the other numbers tell the story. 

So where does that leave us?  Whenever you reduce the number of TPS in your application, you're (almost certainly) going to reduce the number of IOPS in your EBS volume.  The question always is, "which is cause and which is effect?"  Is the application slowing down because the EBS volume is slowing down, or is the EBS volume doing less work because the application is presenting less work for it to do? 

To figure out the answer, you have to pick apart the numbers a little.  To first order, the most important number is the svctm.  In general, this number should be below 100ms, and it's usually much below.  For read-dominated work loads, I would expect to usually see this number in the 10-20ms range and for write-dominated work loads, it could be as low as single-digits.  However, EBS is a shared resource; under periods of high load as I wrote above, it could be operating normally with service times in the 100ms range. 

If the svctm looks good but your application still is running slower than you expect, the other numbers can help diagnose why.  If avgqu-sz gets big (>30), your application is submitting more requests per second than the volume can handle.  The solution here is to stripe across multiple EBS volumes using LVM or RAID-0. 

If svctm looks good and avgqu-sz is low, either there's something else wrong with your application (e.g. it's spinning CPU somewhere) or your expectations are unreasonable.  What I mean by the latter is that reads take 10-20ms and if you're doing a lot of reads (especially if they are serial reads), it's going to take a while to get all that data. 

Looking at the numbers above, you look like you're running right at the edge of what a single volume can be expected to do (svctm low and avgqu-sz of 14), and you're likely to experience slow-downs during periods of contention.  I would suggest striping your work load over several volumes.  To increase the number of parallel operations you can have in flight, which should reduce your queue length and increase your overall throughput. 








阅读(3122) | 评论(3) | 转发(0) |
给主人留下些什么吧!~~

chinaunix网友2009-10-12 11:53:05

非常感谢,看了这篇文档更清楚了些. 兄弟,在北京上班吗? 我的 QQ 573966 , 希望能和你多交流.

chinaunix网友2009-10-10 17:53:18

谢谢,兄弟的回复,仍有疑问: 那这样就涉及到当前文件系统和上一次snapshot的比较啊?还有snapshot的存储格式是什么样的,知道吗?

chinaunix网友2009-10-09 17:54:29

看兄弟对Amazon的系统很有研究啊,知道ESB的增量snapshot是怎么实现的吗