Chinaunix首页 | 论坛 | 博客
  • 博客访问: 661786
  • 博文数量: 291
  • 博客积分: 10025
  • 博客等级: 上将
  • 技术积分: 2400
  • 用 户 组: 普通用户
  • 注册时间: 2004-12-04 12:04
文章分类

全部博文(291)

文章存档

2008年(102)

2007年(112)

2006年(75)

2004年(2)

我的朋友

分类: 服务器与存储

2008-01-18 16:15:09




What is data deduplication?
Deduplication is similar to data compression, but it looks for redundancy of very large sequences of bytes across very large comparison windows. Long (8KB+) sequences are compared to the history of other such sequences, and where possible, the first uniquely stored version of a sequence is referenced rather than stored again. In a storage system, this is all hidden from users and applications, so the whole file is readable after having been written.
Why deduplicate data?
Eliminating redundant data can significantly shrink storage requirements and improve bandwidth efficiency. Because primary storage has gotten cheaper over time, enterprises typically store many versions of the same information so that new work can re-use old work. Some operations like Backup store extremely redundant information. Deduplication lowers storage costs since fewer disks are needed, and shortens backup/recovery times since there can be far less data to transfer. In the context of backup and other nearline data, we can make a strong supposition that there is a great deal of duplicate data. The same data keeps getting stored over and over again consuming a lot of unnecessary storage space (disk or tape), electricity (to power and cool the disk or tape drives), and bandwidth (for replication), creating a chain of cost and resource inefficiencies within the organization.
How does data deduplication work?
Deduplication segments the incoming data stream, uniquely identifies the data segments, and then compares the segments to previously stored data. If an incoming data segment is a duplicate of what has already been stored, the segment is not stored again, but a reference is created to it. If the segment is unique, it is stored on disk.

For example, a file or volume that is backed up every week creates a significant amount of duplicate data. Deduplication algorithms analyze the data and can store only the compressed, unique change elements of that file. This process can provide an average of 10-30 times or greater reduction in storage capacity requirements, with average backup retention policies on normal enterprise data. This means that companies can store 10TB to 30TB of backup data on 1 TB of physical disk capacity, which has huge economic benefits.
Is deduplication easy to implement?
This is vendor dependent. Data Domain has made it very easy by creating a fast, application-independent storage system (attachable as a file server over Ethernet or a VTL over fiber channel). No client software or other configuration is required. As a result, Data Domain deduplication should be invisible to backup and recovery and other nearline storage processes. It should easily work with various data movers and workloads, including non-backup data like e-mail archives, reference data and engineering revision libraries. More flexibility means more consolidation is possible using less physical infrastructure.
Is SIS (Single Instance Store) a form of deduplication?
Reducing duplicate file copies is a limited form of deduplication sometimes called single instance storage or SIS. This file level deduplication is intended to eliminate redundant (duplicate) files on a storage system by saving only a single instance of data or a file.

If you change the title of a 2 MB Microsoft Word document, SIS would retain the first copy of the Word document and store the entire copy of the modified document. Any change to a file requires the entire changed file be stored. Frequently changed files would not benefit from SIS. Data deduplication, which reduces sub-file level data, would recognize that only the title had changed - and in effect only store the new title, with pointers to the rest of the document's content segments.

Generally, Data Domain enables 2x-4x data reduction on an initial full backup, 6x-7x reduction on subsequent file-level incrementals, and 50x-60x reduction on subsequent full backups. SIS does not offer benefit to the initial full or to file level incrementals, so at that level, Data Domain deduplication is 80%-90% more efficient (meaning: that much less storage required) than SIS.

With structured data, it is an even bigger gap. Databases change daily and are generally backed up in full daily. SIS offers no benefit here, but Data Domain deduplication can often see 50x compression effects on this data.
What data deduplication rates are expected?
First, redundancy will vary by application, frequency of version capture and retention policy. Significant variables include the rate of data change (few changes mean more data to deduplicate), the frequency of backups (more fulls makes compression effect higher), the retention period (longer retention means more data to compare against), and the size of the data set (more data, more to deduplicate).

When comparing different approaches, be sure to compare with a common baseline. For example, some backup software can offer deduplication, but simultaneously these packages do incrementals-forever backup policies. For high-contrast comparison, they compare their dedupe effect against daily-full-backup policies with very long retention. (Data Domain tends to characterize dedupe behaviors in a daily-incremental, weekly-full backup policy with 1-4 months of retention.)

The deduplication technology approach and granularity of the deduplication process will also affect compression rates. Data reduction techniques typically split each file into segments or chunks; the segment size varies from vendor to vendor. If the segment size is very large, then fewer segment matches will occur, resulting in smaller storage savings (lower compression rates). If the segment size is very small the ability to find more redundancy in the data increases.

Vendors also differ on how to split up the data. Some vendors split data into fixed length segments, while others use variable length segments.
  • Fixed-length segments (also blocks). The main limitation of this approach is that when the data in a file is shifted, for example when adding a slide to a PowerPoint deck, all subsequent blocks in the file will be rewritten and are likely to be considered as different from those in the original file, so the compression effect is less significant. Smaller blocks will get better deduplication than large ones, but it will take more processing to deduplicate.
  • Variable-length segments. A more advanced approach is to anchor variable-length segments based on their interior data patterns. This solves the data shifting problem of the fixed-size block approach.
What is the difference between inline vs. post-process deduplication?
Inline deduplication means the data is deduplicated before it is written to disk (inline). Post-process deduplication analyzes and reduces data after it has been stored to disk.

Inline deduplication is the most efficient and economic method of deduplication. Inline deduplication significantly reduces the raw disk capacity needed in the system since the full, not-yet-deduplicated data set is never written to disk. If replication is supported as part of the inline deduplication process, inline also optimizes time-to-DR (disaster recovery) far beyond all other methods as the system does not need to wait to absorb the entire data set and then deduplicate it before it can begin replicating to the remote site.

Post-process deduplication technologies wait for the data to land in full on disk before initiating the deduplication process. This approach requires a greater initial capacity overhead than inline solutions. It increases the lag time before deduplication is complete, and by extension, when replication will complete, since it is highly advantageous to replicate only deduplicated (small) data. In practice, it also appears to create significant operational issues, since there are two storage zones, each with policies and behaviors to manage. In some cases, since the redundant storage zone is the default and more important design for some vendors, the dedupe zone is also much less performant and resilient.
Is there an advantage to parsing backup formats to deduplicate?
To be application independent and support the broad variety of Nearline applications, it is much more straightforward to work independently of application specific formats. Some vendors go against this trend and are content-dependent. This means they are locked into support of particular backup products and revisions; they parse those formats and create an internal file system, so that when a new file version comes in, they can compare it to its prior entry in its directory and store only the changes, not unlike a version control system for software development.

This approach sounds promising - it could optimize compression tactics for particular data types, for example - but in practice it has more weaknesses than strengths. First, it is very capital intensive to develop. Second, it always involves some amount of reverse engineering, and sometimes the format originators are not supportive, so it will never be universal. Third, it makes it hard to find redundancy in other parts of the originating client space; it only compares versions of files from the same client/file system, and this level of redundancy is much larger than any file-type compression optimization. Finally, it is hard to deploy; it can often require additional policy set-up on a per-backup-policy or per-file-type basis. If done right, it is onerous; if done wrong, it will leave a lot of redundancy undeduplicated.
How does deduplication improve off-site replication and DR?
The effect deduplication has on replication and Disaster Recovery windows can be profound. To start, deduplication means a lot less data needs transmission to keep the DR site up to date, so much less expensive WAN links may be used.

Second, replication goes a lot faster because there is less data to send. The length of the deduplication process (beginning to end) depends on many variables including the deduplication approach, the speed of the architecture and the DR process. For the most efficient time-to-DR, inline deduplication and replication (inline) of deduplicated data will yield the most aggressive and efficient results. In an inline deduplication approach, replication happens during the backup, significantly improving the time by which there is a complete restore point at the DR site, or improving the time to DR readiness.

Typically less than 1% of a full backup is actually new, unique deduplicated data sequences that can be sent over a WAN immediately upon the start of the backup. Aggressive cross-site deduplication, when multiple sites replicate to the same destination, can add additional value by deduplicating across all backup replication streams and all local backups. Unique deduplicated segments previously transferred by any remote site, or held in local backup, are then used in the deduplication process to further improve network efficiency by reducing the data to be vaulted. In other words, if the destination system already has a data sequence that came from a remote site or a local backup and that same sequence is created at another remote site, it will be identified as redundant by the Data Domain system before it consumes bandwidth traveling across the network to the destination system. All of the data collected at the destination site can be safely moved off-site to a single location or multiple DR sites.
Is deduplication of data safe?
It's very difficult to harden a storage system so that it has the resiliency that you need to remain operational through a drive failure or a power failure. Find out what technologies the deduplication solution has to ensure data integrity and protection against system failures. The system should tolerate deletions, cleaning, rebuilding a drive, multiple drive failures, power failures - all without data loss or corruption. While this is always important in storage, it is an even bigger consideration in data protection with deduplication. With deduplication solutions, there may be 1,000 backup images that rely on one copy of source data. Therefore this source data needs to be kept accessible and with a high level of data integrity.

While the need is higher for data integrity in deduplication storage, it also offers new opportunities for data verification. In Data Domain's case, we take full advantage of the small resulting data size to do a complete internal test recovery, end to end, through the file system and to the disk platter, following each backup. There is less data to read than in a normal disk system, so this read-after-write operation is possible.
How will data deduplication affect my backup and restore performance?
Restore access time will be faster than tape, since it is online and random access. Throughput will vary by vendor. Data deduplication is a resource-intensive process. It during writes, it needs to find whether some new small sequence of data has been stored before, often across hundreds of prior terabytes of data. A simple index of this data is too big to fit in RAM unless it is a very small deployment. So it needs to seek on disk, and disk seeks are notoriously slow (and not getting better).

The easiest ways to make data deduplication go fast are (1) to be worse at data reduction, e.g. look only for big sequences, so you don't have to perform disk seeks as frequently; and (2) to add more hardware, e.g. so there are more disks across which to spread the load. Both have the unfortunate side effect of raising system price, so it becomes less attractive against tape from a cost perspective. Vendors vary in their approaches.

Understand:
  • Single stream backup and restore throughput. This is how fast a given file/database can be written, read, or copied to tape for longer-term archiving. The numbers may be different: read speed and write speed may have separate issues. Because of backup windows for critical data, backup throughput is what most people ask about, though restore time is more significant for most SLA's.
  • Aggregate backup/restore throughput per system. With many streams, how fast can a given controller go? This will help gauge the number of controllers/systems needed for your deployment. It is mostly a measure of system management (# systems) and cost - single stream speed is more important for getting the job done.
  • Types of data. For example, will large files, such as databases or Exchange stores, go slower than small files? Some deduplication approaches look for simple tricks to increase average performance, e.g. identifying common whole files. These approaches do not work with structured data, which tends to be large. So the easiest big test of a dedupe system is to see what the dedupe throughput is on big database files day over day. In some cases, it will go slow; in others, it will get poor deduplication (e.g. by using a very large fixed segment).
  • Is the 30th backup different from the 1st? If you backup images and delete them over time, does the performance of the system change? Because deduplication uses so many references around the store for new documents, do the recovery characteristics for a recent backup (what you'll mostly be recovering) a month or two into deployment change vs. the first pilot? In a well-designed deduplication system, restore of a new backup should not change significantly a year into deployment. Surprisingly, not all vendors offer this behavioral consistency.
Performance in your deployment will depend on many factors, including the backup software and the systems and networks supporting it.
Is deduplication performance determined by the number of disk drives used?
In any storage system, the disk drives are the slowest component. In order to get greater performance it is a common practice to stripe data across a large number of drives so they work in parallel to handle I/O. If the system uses this method to reach performance requirements you need to ask what the right balance between performance and capacity is. This is important since the point of data deduplication is to reduce the number of disk drives.

In Data Domain's SISL implementation, an inline, CPU-centric approach, very few disk drives are needed, so its deduplication delivers on the expectation of a smaller storage system.
How much "upfront" capacity does deduplication require?
This is not a question for inline deduplication systems, but it is for a post-process. Post-process methods require additional capacity to temporarily store duplicate backup data. How much disk capacity is needed may depend on the size of the backup data sets; how many backup jobs you run on a daily basis, and how long the deduplication technology "holds on" to the capacity before releasing it. Post-process solutions that wait for the backup process to complete before beginning to deduplicate will require larger disk caches than those that start the deduplication process during the backup process.
What are best practices in choosing a deduplication solution?
  • Ensure ease of integration to existing environment.
  • Get customer references - in your industry.
  • Pilot the product/technology - in your environment.
  • Understand the vendor's roadmap.
from:
阅读(719) | 评论(0) | 转发(0) |
给主人留下些什么吧!~~