What is data
deduplication? Deduplication is similar to data compression, but it
looks for redundancy of very large sequences of bytes across very large
comparison windows. Long (8KB+) sequences are compared to the history of other
such sequences, and where possible, the first uniquely stored version of a
sequence is referenced rather than stored again. In a storage system, this is
all hidden from users and applications, so the whole file is readable after
having been written.
Why
deduplicate data? Eliminating redundant data can significantly
shrink storage requirements and improve bandwidth efficiency. Because primary
storage has gotten cheaper over time, enterprises typically store many versions
of the same information so that new work can re-use old work. Some operations
like Backup store extremely redundant information. Deduplication lowers storage
costs since fewer disks are needed, and shortens backup/recovery times since
there can be far less data to transfer. In the context of backup and other
nearline data, we can make a strong supposition that there is a great deal of
duplicate data. The same data keeps getting stored over and over again consuming
a lot of unnecessary storage space (disk or tape), electricity (to power and
cool the disk or tape drives), and bandwidth (for replication), creating a chain
of cost and resource inefficiencies within the organization.
How does
data deduplication work? Deduplication segments the incoming data
stream, uniquely identifies the data segments, and then compares the segments to
previously stored data. If an incoming data segment is a duplicate of what has
already been stored, the segment is not stored again, but a reference is created
to it. If the segment is unique, it is stored on disk.
For example, a
file or volume that is backed up every week creates a significant amount of
duplicate data. Deduplication algorithms analyze the data and can store only the
compressed, unique change elements of that file. This process can provide an
average of 10-30 times or greater reduction in storage capacity requirements,
with average backup retention policies on normal enterprise data. This means
that companies can store 10TB to 30TB of backup data on 1 TB of physical disk
capacity, which has huge economic benefits.
Is
deduplication easy to implement? This is vendor dependent. Data
Domain has made it very easy by creating a fast, application-independent storage
system (attachable as a file server over Ethernet or a VTL over fiber channel).
No client software or other configuration is required. As a result, Data Domain
deduplication should be invisible to backup and recovery and other nearline
storage processes. It should easily work with various data movers and workloads,
including non-backup data like e-mail archives, reference data and engineering
revision libraries. More flexibility means more consolidation is possible using
less physical infrastructure.
Is SIS
(Single Instance Store) a form of deduplication? Reducing duplicate
file copies is a limited form of deduplication sometimes called single instance
storage or SIS. This file level deduplication is intended to eliminate redundant
(duplicate) files on a storage system by saving only a single instance of data
or a file.
If you change the title of a 2 MB Microsoft Word document,
SIS would retain the first copy of the Word document and store the entire copy
of the modified document. Any change to a file requires the entire changed file
be stored. Frequently changed files would not benefit from SIS. Data
deduplication, which reduces sub-file level data, would recognize that only the
title had changed - and in effect only store the new title, with pointers to the
rest of the document's content segments.
Generally, Data Domain enables
2x-4x data reduction on an initial full backup, 6x-7x reduction on subsequent
file-level incrementals, and 50x-60x reduction on subsequent full backups. SIS
does not offer benefit to the initial full or to file level incrementals, so at
that level, Data Domain deduplication is 80%-90% more efficient (meaning: that
much less storage required) than SIS.
With structured data, it is an
even bigger gap. Databases change daily and are generally backed up in full
daily. SIS offers no benefit here, but Data Domain deduplication can often see
50x compression effects on this data.
What data
deduplication rates are expected? First, redundancy will vary by
application, frequency of version capture and retention policy. Significant
variables include the rate of data change (few changes mean more data to
deduplicate), the frequency of backups (more fulls makes compression effect
higher), the retention period (longer retention means more data to compare
against), and the size of the data set (more data, more to deduplicate).
When comparing different approaches, be sure to compare with a common
baseline. For example, some backup software can offer deduplication, but
simultaneously these packages do incrementals-forever backup policies. For
high-contrast comparison, they compare their dedupe effect against
daily-full-backup policies with very long retention. (Data Domain tends to
characterize dedupe behaviors in a daily-incremental, weekly-full backup policy
with 1-4 months of retention.)
The deduplication technology approach and
granularity of the deduplication process will also affect compression rates.
Data reduction techniques typically split each file into segments or chunks; the
segment size varies from vendor to vendor. If the segment size is very large,
then fewer segment matches will occur, resulting in smaller storage savings
(lower compression rates). If the segment size is very small the ability to find
more redundancy in the data increases.
Vendors also differ on how to
split up the data. Some vendors split data into fixed length segments, while
others use variable length segments.
- Fixed-length segments (also blocks). The main limitation of
this approach is that when the data in a file is shifted, for example when
adding a slide to a PowerPoint deck, all subsequent blocks in the file will be
rewritten and are likely to be considered as different from those in the
original file, so the compression effect is less significant. Smaller blocks
will get better deduplication than large ones, but it will take more processing
to deduplicate.
- Variable-length segments. A more advanced approach is to
anchor variable-length segments based on their interior data patterns. This
solves the data shifting problem of the fixed-size block approach.
What is
the difference between inline vs. post-process deduplication?
Inline deduplication means the data is deduplicated before it is written to
disk (inline). Post-process deduplication analyzes and reduces data after it has
been stored to disk.
Inline deduplication is the most efficient and
economic method of deduplication. Inline deduplication significantly reduces the
raw disk capacity needed in the system since the full, not-yet-deduplicated data
set is never written to disk. If replication is supported as part of the inline
deduplication process, inline also optimizes time-to-DR (disaster recovery) far
beyond all other methods as the system does not need to wait to absorb the
entire data set and then deduplicate it before it can begin replicating to the
remote site.
Post-process deduplication technologies wait for the data
to land in full on disk before initiating the deduplication process. This
approach requires a greater initial capacity overhead than inline solutions. It
increases the lag time before deduplication is complete, and by extension, when
replication will complete, since it is highly advantageous to replicate only
deduplicated (small) data. In practice, it also appears to create significant
operational issues, since there are two storage zones, each with policies and
behaviors to manage. In some cases, since the redundant storage zone is the
default and more important design for some vendors, the dedupe zone is also much
less performant and resilient.
Is there
an advantage to parsing backup formats to deduplicate?
To be
application independent and support the broad variety of Nearline applications,
it is much more straightforward to work independently of application specific
formats. Some vendors go against this trend and are content-dependent. This
means they are locked into support of particular backup products and revisions;
they parse those formats and create an internal file system, so that when a new
file version comes in, they can compare it to its prior entry in its directory
and store only the changes, not unlike a version control system for software
development.
This approach sounds promising - it could optimize
compression tactics for particular data types, for example - but in practice it
has more weaknesses than strengths. First, it is very capital intensive to
develop. Second, it always involves some amount of reverse engineering, and
sometimes the format originators are not supportive, so it will never be
universal. Third, it makes it hard to find redundancy in other parts of the
originating client space; it only compares versions of files from the same
client/file system, and this level of redundancy is much larger than any
file-type compression optimization. Finally, it is hard to deploy; it can often
require additional policy set-up on a per-backup-policy or per-file-type basis.
If done right, it is onerous; if done wrong, it will leave a lot of redundancy
undeduplicated.
How does
deduplication improve off-site replication and DR? The effect
deduplication has on replication and Disaster Recovery windows can be profound.
To start, deduplication means a lot less data needs transmission to keep the DR
site up to date, so much less expensive WAN links may be used.
Second,
replication goes a lot faster because there is less data to send. The length of
the deduplication process (beginning to end) depends on many variables including
the deduplication approach, the speed of the architecture and the DR process.
For the most efficient time-to-DR, inline deduplication and replication (inline)
of deduplicated data will yield the most aggressive and efficient results. In an
inline deduplication approach, replication happens during the backup,
significantly improving the time by which there is a complete restore point at
the DR site, or improving the time to DR readiness.
Typically less than
1% of a full backup is actually new, unique deduplicated data sequences that can
be sent over a WAN immediately upon the start of the backup. Aggressive
cross-site deduplication, when multiple sites replicate to the same destination,
can add additional value by deduplicating across all backup replication streams
and all local backups. Unique deduplicated segments previously transferred by
any remote site, or held in local backup, are then used in the deduplication
process to further improve network efficiency by reducing the data to be
vaulted. In other words, if the destination system already has a data sequence
that came from a remote site or a local backup and that same sequence is created
at another remote site, it will be identified as redundant by the Data Domain
system before it consumes bandwidth traveling across the network to the
destination system. All of the data collected at the destination site can be
safely moved off-site to a single location or multiple DR sites.
Is
deduplication of data safe? It's very difficult to harden a storage
system so that it has the resiliency that you need to remain operational through
a drive failure or a power failure. Find out what technologies the deduplication
solution has to ensure data integrity and protection against system failures.
The system should tolerate deletions, cleaning, rebuilding a drive, multiple
drive failures, power failures - all without data loss or corruption. While this
is always important in storage, it is an even bigger consideration in data
protection with deduplication. With deduplication solutions, there may be 1,000
backup images that rely on one copy of source data. Therefore this source data
needs to be kept accessible and with a high level of data integrity.
While the need is higher for data integrity in deduplication storage, it
also offers new opportunities for data verification. In Data Domain's case, we
take full advantage of the small resulting data size to do a complete internal
test recovery, end to end, through the file system and to the disk platter,
following each backup. There is less data to read than in a normal disk system,
so this read-after-write operation is possible.
How will
data deduplication affect my backup and restore performance?
Restore access time will be faster than tape, since it is online and random
access. Throughput will vary by vendor. Data deduplication is a
resource-intensive process. It during writes, it needs to find whether some new
small sequence of data has been stored before, often across hundreds of prior
terabytes of data. A simple index of this data is too big to fit in RAM unless
it is a very small deployment. So it needs to seek on disk, and disk seeks are
notoriously slow (and not getting better).
The easiest ways to make data
deduplication go fast are (1) to be worse at data reduction, e.g. look only for
big sequences, so you don't have to perform disk seeks as frequently; and (2) to
add more hardware, e.g. so there are more disks across which to spread the load.
Both have the unfortunate side effect of raising system price, so it becomes
less attractive against tape from a cost perspective. Vendors vary in their
approaches.
Understand:
- Single stream backup and restore throughput. This is how
fast a given file/database can be written, read, or copied to tape for
longer-term archiving. The numbers may be different: read speed and write speed
may have separate issues. Because of backup windows for critical data, backup
throughput is what most people ask about, though restore time is more
significant for most SLA's.
- Aggregate backup/restore throughput per system. With many
streams, how fast can a given controller go? This will help gauge the number of
controllers/systems needed for your deployment. It is mostly a measure of system
management (# systems) and cost - single stream speed is more important for
getting the job done.
- Types of data. For example, will large files, such as
databases or Exchange stores, go slower than small files? Some deduplication
approaches look for simple tricks to increase average performance, e.g.
identifying common whole files. These approaches do not work with structured
data, which tends to be large. So the easiest big test of a dedupe system is to
see what the dedupe throughput is on big database files day over day. In some
cases, it will go slow; in others, it will get poor deduplication (e.g. by using
a very large fixed segment).
- Is the 30th backup different from the 1st? If you backup
images and delete them over time, does the performance of the system change?
Because deduplication uses so many references around the store for new
documents, do the recovery characteristics for a recent backup (what you'll
mostly be recovering) a month or two into deployment change vs. the first pilot?
In a well-designed deduplication system, restore of a new backup should not
change significantly a year into deployment. Surprisingly, not all vendors offer
this behavioral consistency.
Performance in your deployment will
depend on many factors, including the backup software and the systems and
networks supporting it.
Is
deduplication performance determined by the number of disk drives used?
In any storage system, the disk drives are the slowest component. In order
to get greater performance it is a common practice to stripe data across a large
number of drives so they work in parallel to handle I/O. If the system uses this
method to reach performance requirements you need to ask what the right balance
between performance and capacity is. This is important since the point of data
deduplication is to reduce the number of disk drives.
In Data Domain's
SISL implementation, an inline, CPU-centric approach, very few disk drives are
needed, so its deduplication delivers on the expectation of a smaller storage
system.
How much
"upfront" capacity does deduplication require? This is not a
question for inline deduplication systems, but it is for a post-process.
Post-process methods require additional capacity to temporarily store duplicate
backup data. How much disk capacity is needed may depend on the size of the
backup data sets; how many backup jobs you run on a daily basis, and how long
the deduplication technology "holds on" to the capacity before releasing it.
Post-process solutions that wait for the backup process to complete before
beginning to deduplicate will require larger disk caches than those that start
the deduplication process during the backup process.
What are
best practices in choosing a deduplication solution?
- Ensure ease of integration to existing environment.
- Get customer references - in your industry.
- Pilot the product/technology - in your environment.
- Understand the vendor's roadmap.
from: