Friday 22 May 2020

Open source deduplication

At $WORK I have some very expensive Simplivity boxes. When you cut through all the marketing nonsense, each node is a combination of VMWare, HPE Intel server, SSD storage array, inline block deduplication and data replication. There is some pixie dust sprinkled on top (which doesn't work well at our site) but the the components I've listed here work well.

The deduplication is rather important - it gives us a compression ratio of 38:1.

However these boxes are a bit full. Rather than add more Simplivity nodes. I'm planning on building a Proxmox cluster and moving some of our legacy and dev systems there.  I've been running a POC for a couple of months and overall I'm very impressed with Promox.

So dedup is nice on Simplivity and works well - but can you do the same thing on Linux?

A bit of research turned up some interesting results.

BTRFS doesn't yet support inline deduplication for production usage, but it does allow for offline dedup.

animal symcbean # apt-get install dduper
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package dduper
animal symcbean # apt-get install btrfs-dedupe
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package btrfs-dedupe
animal symcbean # apt-get install bees
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package bees


There is a project called lessfs providing inline deduplication and is implemented as a FUSE filesystem. But there are things here which make me a bit uneasy. It's hosted on Sourceforge (so are some of my projects! it used to be a popular place to publish open-source). 2009-2013 saw regular updates, then they just seem to have stopped. Similarly activity on the help and support pages in Sourceforge seems to have stopped in 2013. The project website returns a 403 error.  But it seems people are still using it. Could this actually be a finished piece of software that just works?

animal symcbean # apt-get install lessfs
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package lessfs


Also running as a FUSE filesystem is SDFS by OpenDeDup (I'm a bit confused about the product/branding too). This directly connects to cloud backend storage as well as block devices.

animal symcbean # apt-get install sdfs
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package sdfs


The other open source solution I have found is VDO. This runs as a kernel module rather than FUSE. But I'm struggling to find any references to it on any Linux other than RedHat/Fedora. Another thing I'm trying to move away from.

animal symcbean # apt-get install vdo kmod-kvdo
Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package vdo
E: Unable to locate package kmod-kvdo



ZFS seems to be flavour of the month for large skill Linux based virtualization, but it likes a lot of memory for deduplication, is complex to configure and a LOT more complex on top of iSCSI. Although the infrastructure is not huge, it's big enough that we should separate the storage.

For similar reasons that I am avoiding Docker and Kubernetes, I don't want to make my software stack too sophisticated. Using an SAN/NAS appliance for storage makes my life a lot simpler.

Currently I'm leaning towards using Synology for storage. In addition to the Simplivity boxes, we have some HP MSAs. These are really nice bits of hardware and not ridiculously expensive - but they do cost enough that they need to be under warranty and that means you need to deal with HPE's support centre. Clearly these guys (in India?) are sub-contracted and have targets to reduce warranty claims. Got a 4-hour response time on your contract? Expect your hardware to get fixed in four hours? Think again. At my previous gig, it took 3 weeks to get a replacement power supply out of them. On the last two big repair exercises at my current work, we were promised that there would be no downtime / "completely transparent". Both resulted in major crashes that took a long time to recover from.  I could go on all day with stories about their support.

But the only thing worse than their support is their software.

Synology are the opposite in just about every way. Their software/user interface is a joy to use. But while their hardware is cheap, it is perhaps a little too cheap. It is cheap enough that you don't need to worry about expensive warranties and support contracts.

But using an appliance means more constraints than just the availability of the software. 

 

Update April 2022 

Recently I've switched to PBS for backing up my Proxmox VMs and Containers. This de-duplicates the backups (unlike Simplivity here the primary image is included in the de-duplication set). Strongly recommended.