I think everyone can agree that data storage is exploding at a fairly fast, some say alarming, rate. This means that administrators are having to work overtime to keep everything humming so that users don’t even see the hard work that is going on behind the scenes. These things include: quota management, snapshots, backups, replication, preparing disaster recovery backups, off-site copies of data, restorations of user data that has been erased, monitoring data growth and data usage, and a thousand other tasks that keep things running smoothly (picture synchronized swimmers that look graceful above the water but underneath the surface their legs and hands are moving at a furious rate).
Now that I have equated storage experts to synchronized swimmers and probably upset all of them (my apologies), let’s look at a new technology that is trying to make their life easier while also saving money. This technology is called data deduplication. While it is something of a new technology I hope to show that it’s really an older technology with a new twist that can be used to great effect on many storage systems. Without further ado, let’s examine data deduplication data deduplication.
Introduction
Data deduplication is, quite simply, removing copies (duplicates) of data and replacing it with pointers to the first (unique) copy of the data. Fundamentally, this technology helps reduce the total amount of storage. This can result in many things:
- Saving money (no need to buy additional capacity)
- Reducing the size of backups, snapshots, etc. (saves money, time, etc.)
- Reducing power requirements (less disk, less tape, etc.)
- Reduces network requirements (less data to transmit)
- Time savings
- Since the amount of storage is reduced, disk backups become more possible
These results are the fundamental reason that data deduplication technology is the rage at the moment. Who doesn’t like saving money, time, network bandwidth, etc.? But as with everything, the devil is always in the details. In this article the concepts and issues in data deduplication will be presented.
Deduplication is really not a new technology. It is really an out growth of compression. Compression searches a single file for repeated binary patterns and replaces duplicates with pointers to the original or unique piece of data. Data deduplication extends this concept to include deduplication…
- Within files (just like compression)
- Across files
- Across applications
- Across clients
- Over time
A quick illustration of deduplication versus compression is that if you have two files that are identical, compression applies deduplication to each file independently. But data deduplication recognizes that the files are duplicates and only stores the first one. In addition, it can also search the first file for duplicate data, further reducing the size of the stored data (ala’ compression).
A very simple example of data deduplication is derived from an
Figure 1 - Data Deduplication Example
In this example there are three files. The first file, document1.docx, is a simple Microsoft Word file that is 6MB is size. The second file, document2.docx is just a copy of the first file but with a different file name. And finally, the last file, document_new.docx, is derived from document1.docx but with some small changes to the data and is also 6MB in size.
Let’s assume that a data deduplication process divides the files into 6 pieces (this is a very small number and is for illustrative purposes only). The first file has pieces A, B, C, D, E, and F. The second file, since it’s a copy of the first file, has the exact same pieces. The third file has one piece changed which is labeled G and is 6MB in size. Without data deduplication, a backup of the files would have to backup 18MB of data(6MB times 3). But with data deduplication only the first file and the new block G in the third file are backed up. This is a total of 7MB of data.
One additional feature that data deduplication offers is that after the backup, the pieces, A, B, C, D, E, F, and G are typically stored in a list (sometimes called an index). Then when new files are backed up, their pieces are compared to the ones that have already been backed-up. This is a feature of doing data deduplication over time.
One of the first questions asked after, “what is data deduplication?” is, “what level of deduplication can I expect?” The specific answer depends upon the details of the situation and the dedup implementation, but EMC is quoting a range of 20:1 to 50:1 over a period of time.
Devilish Details
Data deduplication is not a “standard” in any sense so all of the implementations are proprietary. Therefore, each product does things differently. Understanding the fundamental differences is important for determining when and if they fit into your environment. Typically deduplication technology is being used in conjunction with backups, but it is not necessarily limited to only that function. With that in mind let’s examine some of the ways deduplication can be done.
There are really two main types of deduplication with respect to backups, target-based, and source-based. The difference is fairly simple. Target-based deduplication, dedups the data after it has been transferred across the network for backup. Source-based deduplication, dedups the data before it is backed up. The differences are fairly important in understanding the typical ways that deduplication is deployed.
With target-based deduplication, the deduplication is typically done by a device such as a Virtual Tape Library (VTL) that does the deduplication. When using a VTL, the data is passed to the backup server and then to the VTL where it is deduped. So the data is sent across the network without being deduped, increasing the amount of data transferred. But, the target-based approach does allow you to continue to use your existing backup tools and processes.
Alternatively, in a remote backup situation where you communicate over the WAN, network bandwidth is important. If you want to still target-based deduplication, the VTL is placed near the servers to dedup the data before sending it over the network to the backup server.
The opposite of target-based dedup is source-based deduplication. In this case the deduplication is done by the backup software. The backup software on the clients talks to the backup software on the backup server to dedup the data prior to it being transmitted to the backup server. In essence the client sends the pieces of each file that are to be backed-up, to the backup software that compares it to pieces that have already been backed-up. If a duplicate is found, then a pointer is created to the unique piece of data that has already been backed-up.
Source-based dedup can greatly reduce the amount of data transmitted over the network but there is some traffic from the clients to the backup server for deduping the data. In addition, since the dedup takes place in software, no additional hardware is needed. But, you have to use specialized backup software so you may have to give up your existing backup tools to gain the dedup capability.
So far it looks like deduplication is pretty easy, and the fundamental concepts are fairly easy, but many details have been left out. There are many parts to the whole deduplication technology that have to be developed, integrated, and tested for reliability (it is your data after all). Deduplication companies differentiate themselves by these details. Is the deduplication technology target-based or source-based? What’s the nature of the device and/or software? A what level is the deduplication performed? How are the data pieces compared to find duplicates? And on and on.
Before diving into a discussion about deduplication deployment, let’s talk about dedup algorithms. Recall that deduplication can happen on a file basis, or a block basis (the definition of a block is up to the specific dedup implementation), or even on a bit level. It is extremely inefficient to perform deduplication by taking pieces of data and comparing them to an index. To make things easier, dedup algorithms produce a hash of the data piece being deduped using something like MD5 or SHA-1. This hash process should produce a unique number for the specific piece of data and can be easily compared to a hash stored in the dedup index.
One of the problems with using these hash algorithms is hash collisions. A hash collision is something of a “false” positive. That is, the hash for a piece of data may actually correspond to a different piece of data (i.e. the hash is not unique). Consequently, the piece of data may not be backed-up because it has the same hash number as is stored in the index, but in fact the data is different. Obviously this can lead to data corruption. So what data dedup companies do is to use several hash algorithms or combinations of them for deduplication to make sure it truly is a duplicate piece of data. In addition, some dedup vendors will use metadata to help identify and prevent collisions.
To give you an idea of the likely-hood of a hash collision requires a little bit of math. This does a pretty good job explaining the odds of a hash collision. The basic conclusion is that the odds are 1:2^160. This is a huge number. Alternatively, if you have 95 EB (Exabytes - 1,000 Petabytes), then you have a0.00000000000001110223024625156540423631668090820313% chance of getting a false positive in the hash comparison and throwing away a piece of data you should have kept. Given the size of 95 EB, it’s not likely you will encounter this chance even over an extended period of time. But, never say never (after all, someone predicted we’d only need 640KB of memory).
Implementation
Choosing one solution over another is a bit of an art and requires careful consideration of your environment and processes. The previously mentioned has a couple rules of thumb based on the fundamental difference between source-based and target-based deduplication. Source-based dedup approaches are good for situations where network bandwidth may be a premium, such as: File systems (don’t want to transfer the entire file system to deduplicate it and pass back the results), VMware storage, and remote offices or branch offices (network bandwidth to a central backup server may be rather limited). Don’t foget that for source-based dedup, you will likely have to switch backup tools to get the dedup features.
On the other hand, target-based deduplication works well for SANs, LANs, and possibly databases. The reason for this is that moving the data around the network is not very expensive and you may already have your backup packages chosen and in production.
Finally the video also claims that for source-based dedup you can achieve a deduplication of 50:1 and that target-based dedup can achieve 20:1. Both levels of dedup are very impressive. There are a number of articles that discuss how to estimate the deduplication ratios you can achieve. A ratio of 20:1 seems definitely possible.
There are many commercial deduplication products. Any list in this article is incomplete and is not meant as a slight toward a particular company. Nevertheless here is a quick list of companies providing deduplication capabilities:
- (owned by EMC)
- has deduplication available in their products
These are a few of the solutions that are available. There are some smaller companies that offer deduplication products as well.
Deduplication and Open-Source
There are not very many (any?) deduplication projects in the open-source world. However, you can use a target-based deduplication device because it allows you to use your existing backup software which could be open-source. However, it is suggested you talk to the vendor to make sure that they have tested it with Linux.
The only deduplication project that could be found is called . It is a based file system that has built-in deduplication. It is still early in the development process but it has demonstrated deduplication capabilities and has incorporated encryption (ah, the beauty of FUSE).
Summary
This has been a fairly short introductory article to deduplication technology. This is one of the hot technologies in storage right now. It holds the promise of saving money because of the reduction in hardware to store the data, as well as a reduction in network bandwidth.
This article is intended to wet your appetite for examining data deduplication and how it might (or might not) be applicable to your environment. Take a look at the various articles on the net - there has been some hype around the technology - and judge for yourself if this is something that might work for you. If you want to try an open-source project, there aren’t very many (any) at all. The only one that could be found is LessFS which is a FUSE based file system that incorporates deduplication. But it might be worth investigating, even using it for secondary storage and not as your primary file storage.
Jeff Layton is an Enterprise Technologist for HPC at Dell. He can be found lounging around at a nearby Frys enjoying the coffee and waiting for sales (but never during working hours).