Deduplication and Compression – Exactly the same, but different.
One day many years ago, Lori and I’s oldest son held up two sheets of paper and said “These two things are exactly the same, but different!” Now, he’s a very bright individual, he was just young, and didn’t even get how incongruous the statement was. We, being a fun loving family that likes to tease each other on occasion, we of course have not yet let him live it down. It was honestly more than a decade ago, but all is fair, he doesn’t let Lori live down something funny that she did before he was born. It is all in good fun of course.
Why am I bringing up this family story? Because that phrase does come to mind when you start talking about deduplication and compression. Highly complimentary and very similar, they are pretty much “Exactly the same, but different”. Since these technologies are both used pretty heavily in WAN Optimization, and are growing in use on storage products, this topic intrigued me.
To get this out of the way, at F5, compression is built into the BIG-IP family as a feature of the core BIG-IP LTM product, and deduplication is an added layer implemented over BIG-IP LTM on BIG-IP WAN Optimization Module (WOM). Other vendors have similar but varied (there goes a variant of that phrase again) implementation details.
Before we delve too deeply into this topic though, what caught my attention and started me pondering the whys of this topic was that F5’s deduplication is applied before compression, and it seems that reversing the order changes performance characteristics. I love a good puzzle, and while the fact that one should come before the other was no surprise, I started wanting to know why the order it was, and what the impact of reversing them in processing might be. So I started working to understand the details of implementation for these two technologies. Not understand them from an F5 perspective, though that is certainly where I started, but try to understand how they interact and compliment each other.
While much of this discussion also applies to in-place compression and deduplication such as that used on many storage devices, some of it does not, so assume that I am talking about networking, specifically WAN networking, throughout this blog.
At the very highest level, deduplication and compression are the same thing. They both look for ways to shrink your dataset before passing it along. After that, it gets a bit more complex. If it was really that simple, after all, we wouldn’t call them two different things. Well, okay, we might, IT has a way of having competing standards, product categories, even jobs that we lump together with the same name. But still, they wouldn’t warrant two different names in the same product like F5 does with BIG-IP WOM.
The thing is that compression can do transformations to data to shrink it, and it also looks for small groupings of repetitive byte patterns and replaces them, while deduplication looks for larger groupings of repetitive byte patterns and replaces them.
In the implementation you’ll see on BIG-IP WOM, deduplication looks for larger byte patterns repeated across all streams, while compression applies transformations to the data, and when removing duplication only looks for smaller combinations on a single stream. The net result? The two are very complimentary, but if you run compression before deduplication, it will find a whole collection of small repeating byte patterns and between that and transformations, deduplication will find nothing, making compression work harder and deduplication spin its wheels.
There are other differences – because deduplication deals with large runs of repetitive data (I believe that in BIG-IP the minimum size is over a K), it uses some form of caching to hold patterns that duplicates can match, and the larger the caching, the more strings of bytes you have to compare to. This introduces some fun around where the cache should be stored. In memory is fast, but limited in size, on flash disk is fast and has a greater size, but is expensive, and on disk is slow but has a huge advantage in size. Good deduplication engines can support all three and thus are customizable to what your organization needs and can afford.
Some workloads just won’t benefit from one, but will get a huge benefit from the other. The extremes are good examples of this phenomenon – if you have a lot of in-the-stream repetitive data that is too small for deduplication to pick up, and little or no cross-stream duplication, then deduplication will be of limited use to you, and the act of running through the dedupe engine might actually degrade performance a negligible amount – of course, everything is algorithm dependent, so depending upon your vendor it might degrade performance a large amount also. On the other extreme, if you have a lot of large byte count duplication across streams, but very little within a given stream, deduplication is going to save your day, while compression will, at best, offer you a little benefit.
So yes, they’re exactly the same from the 50,000 foot view, but very very different from the benefits and use cases view. And they’re very complimentary, giving you more bang for the buck.