分类: 服务器与存储
2013-07-26 15:19:09
Recently, I built a FreeBSD server with ZFS file system using six 2TB harddisks. If you like to learn what I have done or understand more about ZFS, you can read the story here: .
Many people found a problem on their ZFS system. The speed is slow! It is slow to read or write files to the system. In this article, I am going to show you some tips on improving the speed of your ZFS file system.
Table of Content
Traditionally, we are told to use a less powerful computer for a file/data server. That’s not true for ZFS. ZFS is more than a file system. It uses a lot of resources to improve the performance of the input/output, such as compressing data on the fly. For example, suppose you need to write a 1GB file. Without enabling the compression, the system will write the entire 1GB file on the disk. With the compression being enabled, the CPU will compress the data first, and write the data on the disk after that. Since the compressed file is smaller, it takes shorter time to write to the disk, which results a higher writing speed. The same thing can be applied for reading. ZFS can cache the file for you in the memory, it will result a higher reading speed.
That’s why a 64-bit CPU and higher amount of memory is recommended. I recommended at least a Quad Core CPU with 4GB of memory (I personally use i7 920@2.67GHz + 10GB).
Please make sure that the rams have the same frequencies/speed. If you have multiple rams with different speed, try to group the rams with same speed in the same channel., e.g., Channel 1 and Channel 2: 1000 MHz, Channel 3 and Channel 4: 800 MHz.
Let’s do a test. Suppose I am going to create a 10GB file with all zero. Let’s see how long does it take to write on the disk:
#CPU: i7 920 (8 cores) + 8GB Memory + FreeBSD 8.2 64-bit #time dd if=/dev/zero of=./file.out bs=1M count=10k 10240+0 records in 10240+0 records out 10737418240 bytes transferred in 6.138918 secs (1749073364 bytes/sec) real 0m6.140s user 0m0.023s sys 0m3.658s
That’s 1.6GB/s! Why is it so fast? That’s because it is a zero based file. After the compression, a compressed 10GB file may result in several bytes only. Since the performance of the compression is highly depended on the CPU, that’s why a fast CPU matters.
Now, let’s do the same thing on a not-so-fast CPU:
#CPU: AMD 4600 (2 cores) + 5GB Memory + FreeBSD 8.2 64-bit #time dd if=/dev/zero of=./file.out bs=1M count=10k 10240+0 records in 10240+0 records out 10737418240 bytes transferred in 23.672373 secs (453584362 bytes/sec) real 0m23.675s user 0m0.091s sys 0m22.409s
That’s 434MB/s only. See the difference?
Many people complain about ZFS for its stability issues, such as kernel panic, reboot randomly, crash when copying large files (> 2GB) at full speed etc. It has something to do with the boot loader settings. By default, ZFS will not work smoothly without tweaking the system parameters system. Even FreeBSD claims that no tweaking is necessary for 64-bit system, my FreeBSD server crashes very often when writing large files to the pool. After trial and error for many times, I figure out few equations. You can tweak your boot loader (/boot/loader.conf) using the following parameters. Notice that I only tested the following on FreeBSD. Please let me know whether the following tweaks work on other operating systems.
Warning: Make sure that you save a copy before doing anything to the boot loader. Also, if you experience anything unusual, please remove your changes and go back to the original settings.
#Assuming 8GB of memory #If Ram = 4GB, set the value to 512M #If Ram = 8GB, set the value to 1024M vfs.zfs.arc_min="1024M" #Ram x 0.5 - 512 MB vfs.zfs.arc_max="3584M" #Ram x 2 vm.kmem_size_max="16G" #Ram x 1.5 vm.kmem_size="12G" #The following were copied from FreeBSD ZFS Tunning Guide # # Disable ZFS prefetching # # Increases overall speed of ZFS, but when disk flushing/writes occur, # system is less responsive (due to extreme disk I/O). # NOTE: Systems with 4 GB of RAM or more have prefetch enabled by default. vfs.zfs.prefetch_disable="1" # Decrease ZFS txg timeout value from 30 (default) to 5 seconds. This # should increase throughput and decrease the "bursty" stalls that # happen during immense I/O with ZFS. # # # default in FreeBSD since ZFS v28 vfs.zfs.txg.timeout="5" # Increase number of vnodes; we've seen vfs.numvnodes reach 115,000 # at times. Default max is a little over 200,000. Playing it safe... # If numvnodes reaches maxvnode performance substantially decreases. kern.maxvnodes=250000 # Set TXG write limit to a lower threshold. This helps "level out" # the throughput rate (see "zpool iostat"). A value of 256MB works well # for systems with 4 GB of RAM, while 1 GB works well for us w/ 8 GB on # disks which have 64 MB cache. # NOTE: in v27 or below , this tunable is called 'vfs.zfs.txg.write_limit_override'. vfs.zfs.write_limit_override=1073741824
Don’t forget to reboot your system after making any changes. After changing to the new settings, the writing speed improves from 60MB/s to 80MB/s, sometimes it even goes above 110MB/s! That’s a 33% improvement!
By the way, if you found that the system still crashes often, the problem could be an uncleaned file system.
After a system crashes, it may cause the file links to be broken (e.g., the system see the file tag, but unable to locate the files). Usually FreeBSD will automatically run fsck after the crash. However it will not fix the problem for you. In fact, there is no way to clean up the file system when the system is running (because the partition is mounted). The only way to clean up the file system is by entering the Single User Mode (a reboot is required).
After you enter the single user mode, make sure that each partition is cleaned. For example, here is my df result:
Filesystem Size Used Avail Capacity Mounted on /dev/ad8s1a 989M 418M 491M 46% / devfs 1.0k 1.0k 0B 100% /dev /dev/ad8s1e 989M 23M 887M 3% /tmp /dev/ad8s1f 159G 11G 134G 8% /usr /dev/ad8s1d 15G 1.9G 12G 13% /var
Try running the following commands:
fsck -y -f /dev/ad8s1a fsck -y -f /dev/ad8s1d fsck -y -f /dev/ad8s1e fsck -y -f /dev/ad8s1f
These command will clean up the affected file systems. The parameters f means force, and y means yes.
After the clean up is done, type reboot and let the system to boot to the normal mode.
A lot of people may not realize the importance of using exact the same hardware. Mixing different disks of different models/manufacturers can bring performance penalty. For example, if you are mixing a slower disk (e.g., 5900 rpm) and a faster disk(7200 rpm) in the same virtual device (vdev), the overall speed will depend on the slowest disk. Also, different harddrives may have different sector size. For example, Western Digital releases a harddrive with 4k sector, while the general harddrives use 512 byte. Mixing harddrives with different sectors can bring performance penalty too. Here is a quick way to check the model of your harddrive:
sudo dmesg | grep ad
In my cases, I have the following:
ad10: 1907729MB Hitachi HDS722020ALA330 JKAOA20N at ata5-master UDMA100 SATA 3Gb/s ad11: 1907729MB Seagate ST32000542AS CC34 at ata5-slave UDMA100 SATA 3Gb/s ad12: 1907729MB WDC WD20EARS-00MVWB0 51.0AB51 at ata6-master UDMA100 SATA 3Gb/s ad13: 1907729MB Hitachi HDS5C3020ALA632 ML6OA580 at ata6-slave UDMA100 SATA 3Gb/s ad14: 1907729MB WDC WD20EARS-00MVWB0 50.0AB50 at ata7-master UDMA100 SATA 3Gb/s ad16: 1907729MB WDC WD20EARS-00MVWB0 51.0AB51 at ata8-master UDMA100 SATA 3Gb/s
Notice that my Western Digital harddrives with 4k sectors all ends with EARS in the model number.
If you don’t have enough budget to replace all disks with the same specifications, try to group the disks with similar specifications in the same vdev.
ZFS supports compressing the data on the fly. This is a nice feature that improves the I/O speed – only if you have a high speed CPU (such as Quad core or higher). If your CPU is not fast enough, I don’t recommend you to turn on the compression feature, because the benefit from reducing the file size is smaller than the time spent on the CPU calculation. Also, the compression algorithm plays an important role here. ZFS supports two compression algorithms, LZJB and GZIP. I personally use LZJB because it gives a better balance between the compression radio and the performance. You can also use GZIP and specify your own compression ratio (i.e., GZIP-N). FYI, I tried GZIP-9 (The maximum compression ratio available) and I found that the overall performance gets worse on my i7 with 8GB of memory.
There is no solid answer here because it all depends on what files you store. Different files such as large file, small files, already compressed files (such as Xvid movie) need different compression settings.
If you cannot decide, just go with lzjb. It can’t be wrong:
sudo zfs set compression=lzjb mypool
Updated: Use lz4 is your ZFS supports it. It performs better than lzjb. It is available via ZFS pool feature flags. As of June 17, 2013, it is available in FreeBSD 8.4, but not available in FreeBSD 9.1.
By default, ZFS enables a lot of settings for data security, such as checksum etc. If you don’t care about the additional data security, just disable them. You can use the following command to view your ZFS settings.
sudo zfs get all
FYI, I usually disable the following:
#Depending on how important is your data #Very important: Leave the checksum option on #Less important: Set your pool type to RAIDZ. The parity should give you a basic data protection. You can turn off the the checksum feature. #Not important: Just use the default pool type (i.e., stripping) and turn off the checksum. sudo zfs set checksum=off myzpool #I don't need ZFS to update the access time when reading the file sudo zfs set atime=off myzpool
Other suggestions:
sudo zfs set primarycache=metadata myzpool sudo zfs set recordsize=16k myzpool
For more info, go to:
man zfs
By default, ZFS will not update the file system itself even if a newer version is available on the system. For example, I created a ZFS file system on FreeBSD 8.1 with ZFS version 14. After upgrading to FreeBSD 8.2 (which supports ZFS version 15), my ZFS file system was still on version 14. I needed to upgrade it manually using the following commands:
sudo zfs upgrade my_pool sudo zpool upgrade my_pool
Suppose you have a very fast SSD harddrive. You can use it for logging/caching the data for your ZFS pool.
To improve the reading performance:
sudo zpool add 'zpool name' cache 'ssd device name'
To improve the writing performance:
sudo zpool add 'zpool name' log 'ssd device name'
It was impossible to remove the log devices without losing the data until ZFS v.19 (FreeBSD 8.3+/9.0+). I highly recommend to add the log drives as a mirror, i.e.,
sudo zpool add zpool_name log mirror /dev/log_drive1 /dev/log_drive2
Now you may ask a question. How about using a ram disk as log / cache devices? First, ZFS already uses your system memory for I/O, so you don’t need to set up a dedicated ram disk by yourself. Also, using a ram disk for log (writing) devices is not a good idea. When somethings go wrong, such as power failure, you will end up losing your data during the writing.
Do you know ZFS works faster on multiple devices pool than single device pool, even they have the same storage size?
If you need performance, go with mirror, not RAIDZ. When ZFS stores the data in a mirror pool, it simply stores the whole file in each device. When it reads the file, it simply get the partial copy from each device first, and combine them at the end. Since this process can happen in parallel, it will speed up the reading process.
On the other hand, RAIDZ works a bit differently. Suppose there are N devices in your RAIDZ pool. When the data is written to a RAIDZ pool, ZFS needs to divided it into N-1 parts first, calculate the parity, and write all of them into the N devices. When reading the data, ZFS will read the N-1 devices first, make sure that the result is okay (otherwise it will read the data again from the remaining device), and combine them together. This additional work adds more work. That’s why RAIDZ is always slower than mirror.
So this is the ideal set up (given that you have enough budget):
Command to create a mirror zpool.
sudo zpool create zpool_name mirror /dev/hd1 /dev/hd2 mirror /dev/hd3 /dev/hd4 mirror /dev/hd5 /dev/hd6
Here is another model: Strip only, very fast with no data security.
Command to create a strip only zpool.
sudo zpool create zpool_name /dev/hd1 /dev/hd2 /dev/hd3 /dev/hd4 /dev/hd5 /dev/hd6
Here is the most popular model: RAIDZ, not so fast, but with okay data security.
Command to create a RAIDZ zpool.
sudo zpool create zpool_name raidz /dev/hd1 /dev/hd2 /dev/hd3 /dev/hd4 /dev/hd5 /dev/hd6
One of the important tricks to improve ZFS performance is to keep the free space evenly distributed across all devices.
You can check it using the following command:
zpool iostat -v
The free space is show on the second column (available capacity)
capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- storage 3.23T 1.41T 0 3 49.1K 439K ad4 647G 281G 0 0 5.79K 49.2K ad8 647G 281G 0 0 5.79K 49.6K ad10 647G 281G 0 0 5.82K 49.6K ad16 647G 281G 0 0 5.82K 49.6K ad18 647G 281G 0 0 5.77K 49.5K
When ZFS writes a new file to replace the old file in the system, it will first write the file in the free space first, then move the file pointer from the old one to the new one. In this case, even there is a power failure during writing the data, no data will be lost because the file pointer is still pointing to the old file. That’s why ZFS does not need fsck (file system check).
In order to keep the performance at a good level, we need to make sure that the free space is available in every device in the pool. Otherwise ZFS can only write the data to some of the devices only (instead of all). In the other words, the higher number of devices ZFS write, the better the performance.
Technically, if the structure of a zpool has not been modified or alternated, you should not need to worry about the free space distribution because ZFS will take care of that for you automatically. However, when you add a new device to an existing pool, that will be a different story, e.g.,
capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- storage 3.88T 2.33T 0 3 49.1K 439K ad4 647G 281G 0 0 5.79K 49.2K ad8 647G 281G 0 0 5.79K 49.6K ad10 647G 281G 0 0 5.82K 49.6K ad16 647G 281G 0 0 5.82K 49.6K ad18 647G 281G 0 0 5.77K 49.5K ad20 0 928G 0 0 5.77K 49.5K
In this example, I add a 1TB hard drive (ad20) to my existing pool, which gives about 928GB of free space. Let say I add a 6GB file, the free space will look something like this:
capacity operations bandwidth pool used avail read write read write ---------- ----- ----- ----- ----- ----- ----- storage 4.48T 1.73T 0 3 49.1K 439K ad4 648G 280G 0 0 5.79K 49.2K ad8 648G 280G 0 0 5.79K 49.6K ad10 648G 280G 0 0 5.82K 49.6K ad16 648G 280G 0 0 5.82K 49.6K ad18 648G 280G 0 0 5.77K 49.5K ad20 1G 927G 0 0 5.77K 49.5K
In the other words, ZFS will still divide my 6GB file into six equal pieces and write each piece to each device. Eventually, ZFS will use up the free space in the older devices, and it can write the data to the new devices only (ad20), which will decrease the performance. Unfortunately, there is no way to redistribute the data / free space evenly without destroying the pool, i.e.,
1. Back up your data 2. Destroy the pool 3. Rebuild the pool 4. Put the data back
Depending on how much data do you have, it can take 2 to 3 days to copy 5TB of data from one server to another server over a gigabit network. You don’t want to use scp to do it because you will need to re-do everything again if the process is dropped. In my case, I use rsync:
(One single line)
#Run this command on the production server: rsync -avzr --delete-before backup_server:/path_to_zpool_in_backup_server /path_to_zpool_in_production_server
Of course, netcat is a faster way if you don’t care about the security. (scp / rsync will encrypt the data during transfer).
ZFS comes with a very cool feature. It allows you to save multiple copies of the same data in the same pool. This adds an additional layer on data security. However, I don’t recommend using this feature for backup purpose because it adds more work when writing the data to the disks. Also, I don’t think this is a good way to secure the data. I prefer to set up a mirror on a different server (Master-Slave). Since the chance of two machines fail at the same time is much smaller than one machine fails. Therefore the data is safer in this settings.
Here is how I synchronize two machines together:
Create a script in the slave machine: getContentFromMaster.sh
(One single line)
rsync -avzr -e ssh --delete-before master:/path/to/zpool/in/master/machine /path/to/zpool/in/slave/machine
And put this file in a cronjob, i.e.,
/etc/crontab
@daily root /path/to/getContentFromMaster.sh
Now, you may ask a question. Should I go with strip-only ZFS (i.e., stripping only. No mirror, RAIDZ, RAIDZ2) when I set up my pool? Yes or no. ZFS allows you to mix any size of har ddrive in one single pool. Unlike RAIZ{0,1,5,10} and concatenation, it can be any size and there is no lost in the disk space, i.e., you can connect 1TB, 2TB, 3TB into one single pool while enjoying the data-stripping (Total usable space = 6TB). It is fast (because there is no overhead such as parity etc) and simple. The only down side is that the entire pool will stop working if there is at least one device fails.
Let’s come back to the question, should we employ simple stripping in production environment? I prefer not. Strip-only ZFS divides all data into all vdev. If each vdev is simply a hard drive, and if one fails, there is NO WAY to get the original data back. If something screws up in the master machine, the only way is to destroy and rebuild the pool, and restore the data from the backup. (This process can takes hours to days if you have large amount of data, say 6TB.) Therefore, I strongly recommend to use at least RAIDZ in the production environment. If one device fails, the pool will keep working and no data is lost. Simply replace the bad hard drive with a good one and everything is good to go.
To minimize the downtime when something goes wrong, go with at least RAIDZ in a production environment (ideally, RAIDZ or strip-mirror).
For the backup machine, I think using simple stripping is completely fine.
Here is how to build a pool with simple stripping, i.e., no parity, mirror or anything
zpool create mypool /dev/dev1 /dev/dev2 /dev/dev3
And here is how to monitor the health
zpool status
Some websites suggest to use the following command instead:
zpool status -x
Don’t believe it! This command will return “all pools are healthy” even if one device is failed in a RAIDZ pool. In the other words, your data is healthy doesn’t mean all devices in your pool are healthy. So go with “zpool status” at any time.
FYI, it can easily takes few days to copy 10TB of data from one machine to another through a gigabit network. In case you need to restore large amount of data through the network, use rsync, not scp. I found that scp sometimes fail in the middle of transfer. Using rsync allows me to resume it at any time.
So what’s the main difference between rsync and ZFS send? What’s the advantage of one over the other?
rsync is a file level synchronization tool. It simply goes through the source, find out which files have been changed, and copy the corresponding files to the destination.
ZFS send is doing something similar. First, it takes a snapshot on the ZFS pool first:
zfs snapshot mypool/vdev@20120417
After that, you can generate a file that contains the pool and data information, copy to the new server to restore it:
#Method 1: Generate a file first zfs send mypool/vdev@20120417 > myZFSfile scp myZFSfile backupServer:~/ zfs receive mypool/vdev@20120417 < ~/myZFSfile
Or you can do everything in one single command line:
#Method 2: Do everything over the pipe (One command) zfs send pool/vdev@20120417 | ssh backupServer zfs receive pool/vdev@20120417
In general, the preparation time of ZFS send is much shorter than rsync, because ZFS already knows which files have been modified. Unlike rsync, a file-level tool, ZFS send does not need to go though the entire pool and find out such information. In terms of the transfer speed, both of them are similar.
So why do I prefer rsync over ZFS send (both methods)? It’s because the latter one is not practical! In method #1, the obvious issue is the storage space. Since it requires generating a file that contains your entire pool information. For example, suppose your pool is 10TB, and you have 8TB of data (i.e., 2TB of free space), if you go with method #1, you will need another 8TB of free space to store the file. In the other words, you will need to make sure that at least 50% of free space is available all the time. This is a quite expensive way to run ZFS.
What about method #2? Yes, it does not have the storage problem because it copies everything over the pipe line. However, what if the process is interrupted? It is a common thing due to high traffic in the network, high I/O to the disk etc. Worst worst case, you will need to re-do everything again, say, copying 8TB over the network, again.
rsync does not have these two problems. In rsync, it uses relatively small space for temporary storage, and in case the rsync process is interrupted, you can easily resume the process without copying everything again.
Deduplication (dedup) is a space-saving technology. It simply stores one copy of your file instead of storing multiple copies. For example, suppose you have ten identical folders with the same files. If the dedup is enabled, ZFS only stores one copy instead of multiple copies. Notice that dedup is not the same as compression.
The idea of dedup is very simple. ZFS maintains an index of your files. Before writing any incoming files to the pool, it checks whether the storage has a copy of this file or not. If the file already exists, it will skip the file. With dedup enabled, instead of store 10 identical files, it stores one only copy. Unfortunately, the drawback is that it needs to check every incoming file before making any decision.
After upgrading my ZFS pool to version 28, I enabled dedup for testing. I found that it really caused huge performance hit. The writing speed over the network dropped from 80MB/s to 5MB/s!!! After disabling this feature, the speed goes up again.
sudo zfs set dedup=off your-zpool
In general, dedup is an expensive feature that requires a lot of hardware resources. You will need 5GB memory per 1TB of storage (). For example, if zpool is 10TB, I will need 50GB of memory! (Which I only have 12GB). Therefore, think twice before enabling dedup!
Notice that it won’t solve all the performance problem by disabling the dedup. For example, if you enable dedup before and disable it afterward, all files stored during this period are dedup dependent, even dedup is disabled. When you need to update these files (e.g., delete), the system still needs to check again the dedup index before any processing your file. Therefore, the performance issue still exists when working with these affected files. For the new files, it should be okay. Unfortunately, there is no way to find out the affected dedup files. The only way is to destroy and re-build the ZFS pool, which will clear the list of dedup files.
Sometimes, reinstalling your old system from scratch may help to improve the performance. Recently, I decided to reinstall my FreeBSD box. It was an old FreeBSD box that was started with FreeBSD 6 (released in 2005, about 8 years ago from today). Although I upgraded the system every release, it already accumulated many junk and unused files. So I decide to reinstall the system from scratch. After the installation, I can tell that the system is more responsive and stable.
Before you wipe out the system, you can export the ZFS tank using the following command:
sudo zpool export mypool
After the work is done, you can import the data back:
sudo zpool import mypool
Recently, I found that my overall ZFS system is slow no matter what I have done. After some investigations, I noticed that the bottle neck was my RAID card. Here are my suggestions:
1. Connect your disks to the ports with highest speed. For example, my PCI-e RAID card deliveries higher speed than my PCI RAID card. One way to verify the speed is by using dmesg, e.g.,
dmesg | grep MB #Connected via PCI card. Speed is 1.5Gb/s ad4: 953869MB at ata2-master UDMA100 SATA 1.5Gb/s #Connected via PCI-e card. Speed is 3.0 Gb/s ad12: 953869MB at ata6-master UDMA100 SATA 3Gb/s
In this case, the overall speed limit is based on the slowest one (1.5Gb/s), even the rest of my disks are 3Gb/s.
2. Some RAID cards come with some advanced features such as RAID, linear RAID, compression etc. Make sure that you disable these features first. You want to minimize the workload of the card and maximize the I/O speed. It will only slow down the overall process if you enable these additional features. You can disable the settings in the BIOS of the card. FYI, most of the RAID cards in $100 ranges are “software RAID”, i.e., they are using the system CPU to do the work. Personally, I think these fancy features are designed for Windows users. You really don’t need any of these features in Unix world.
Enjoy ZFS.
–Derrick