content top

Deduplication: Why for Backup Storage & Why not for Primary Storage?

johnm1
In the big picture, deduplication is a feature, not a function, of disk technology. Nonetheless, it is a really disruptive gateway feature because it means you can store a lot more logical data on disk than you ever could before. The implications for backup data storage are tremendous, the implications for primary storage are less so.

Just for fun, let’s put the Terminator ‘Rise of the Machines’ spin on this topic. In primary storage (data spinning on disks for real-time usage), you have roughly 40-50% duplicate data (at a very detailed level) in data sets that are created by humans (files, emails, etc.), and typically much less duplicate data that is created by machines. So transaction logs for instance are highly unique and not going to contain much duplicate data. Video files (created by processors and software) are nearly entirely unique, unless you’re videotaping grass grow (again, stupid human behavior). Data within databases is mostly unique and machine generated, but the way humans design and deploy databases creates a large amount of duplicate data by spawning multiple copies of the same database or data sets. And that’s just for primary storage.

Then, when humans use backup software to create copies of data for purposes of data protection, huge amounts of duplicate data is created. When you start creating additional copies for backup purposes, the amount of duplicate data created goes up exponentially.

Ironically, if a human applies machine generated compression or encryption algorithms to data, it won’t deduplicate at all because duplicate data patterns are scrambled and undecipherable to deduplication algorithms. So any way you cut it, humans are sloppy and machines aren’t.

So in the land of primary data, the overall amount of duplicate data is roughly 10-15%, and best case 50% if you’re purely running a business on human generated files and emails. So at best case, you’re looking at a 2:1 deduplication ratio for primary data. So you can buy a disk storage device with the deduplication feature, and potentially double your logical storage consumption. So if you’re about to by a NAS device with this feature, eat your heart out, but don’t expect any revolutionary changes to life in the world of storage management.

Depending on the data volatility, data types, the backup platform, and backup policies, deduplication ratios range from 5:1 to 20:1 are typical in backup storage. That’s a game changer for disk vs. tape, making disk more economically viable for backup storage. Deduplication is also a gateway feature, enabling disruptive changes such as:

* Tape reduction and/or elimination strategies
* Remote replication of deduplicated data
* Enhanced disaster recovery capabilities for backup
* Tape elimination in remote sites
* Tape media encryption avoidance strategies
* Wholesale innovations in backup that have not been possible up until now.

That’s why my chips are on the machine (in this case deduplication algorithms run by machines) and backup storage. Next topic, let’s peel back the onion on backup.

-John Merryman, GlassHouse Service Director

1 Comment »

  1. avatar comment-top

    hey this is a very interesting article!

    comment-bottom

RSS feed for comments on this post. TrackBack URL

Leave a comment

Spam Protection by WP-SpamFree