FINALLY getting rid of duplicate backup files (with rdfind)

I have backups which go back for many years and, over the course of time, it seems like I kind of lost track of things and ended up with multiple copies of the same files in different backup directories.

Now I want to purchase a very large drive – probably a 4TB one – and copy everything to it and try to merge all the various backups where possible. In the new drive I’m really going to try to simplify things and have directories like /photos /music /video /books /documents etc. in order to avoid this same kind of thing from happening again.

Before doing this I need to clean up the drives and remove duplicate files, of which there are probably many.

For the past couple days I’ve been investigating how to sensibly deal with this and I think I may have found the best solution: a cool command-line utility called rdfind.

rdfind takes a list of paths (can be directories or files) and then does its work, comparing everything it finds and optionally deleting the ones which are duplicates.

So, for example, I have multiple drives mounted to my Linux server under /mnt/m /mnt/s and /mnt/v I can then tell rdfind to go to work on them with:

rdfind /mnt/m /mnt/s /mnt/v

But – here’s one thing – and its a good thing – I will just quote from the rdfind manual page:

Given two or more equal files, the one with the highest rank is selected to be the original and the rest are duplicates. The rules of ranking are given below, where the rules are executed from start until an original has been found. Given two files A and B which have equal content, the ranking is as follows:

If A was found while scanning an input argument earlier than than B, A is higher ranked.

If A was found at a depth lower than B, A is higher ranked (A closer to the root)

If A was found earlier than B, A is higher ranked.

The last rule is needed when two files are found in the same directory (obviously not given in separate arguments, otherwise the first rule applies) and gives the same order between the files as the operating system delivers the files while listing the directory. This is operating system specific behaviour.

In order to do a dry-run to test what rdfind would actually do, use the -n true switch. Thus:

rdfind -n true -outputname "rdfind-`date +%Y.%m.%d`-1.log" /mnt/v /mnt/s /mnt/m

will run rdfind in dry-run mode, create an output file like rdfind-2014.08.08-1.log, and by putting /mnt/v first will rank files found there more highly if there happen to be duplicates on /mnt/s or /mnt/m since /mnt/v is an earlier input argument.

Likewise, if duplicates are found on /mnt/v then based on the criteria above it will rank the ones with the shortest directory paths higher.

What is great about this ranking is that I can basically have rdfind prioritize whatever drive (or even directories) I want just by specifying them in order.

One caveat is to be careful not to specify the same directory more than once which could result in deletion of the originals and hence data loss. rdfind is a powerful tool and should be used with caution. In order to avoid this possibility I’m just giving rdfind entire drives as arguments so there is no possibility of this happening. If you’re really curious and want to test it, make a few test directories and try a few runs to see how it really works in different situations.

One other cool thing about rdfind is its efficiency: according to the manual page it only runs a checksum when necessary. There’s also another page “Rdfind – redundant data find” which explains the algorithm it uses in more detail and why it is so efficient. One cool thing on his page is the benchmark where you can see how rdfind seriously kicks ass compared to a couple other alternatives.

Finally, since rdfind is likely to run for a long time, I like to run it in a screen session. screen is an extremely useful and important utility exactly for this type of thing: it ensures that a command will keep running even if a console window is closed.

I like to name my screen sessions which is as simple as specifying -S sessionname when invoking it, e.g. screen -S rdfind

Once the rdfind command with its arguments is invoked the screen session can be detached via Ctrl-a-d and reattached later with Ctrl-x (see the screen manual page for full info).

After removing duplicates and merging everything together there will no doubt be a lot of cleanup work that will need to be done. But this is a big first step.

Here is the output of the actual run of rdfind I did on the three drives using the command:

sudo rdfind -deleteduplicates true -outputname "rdfind-`date +%Y.%m.%d`-1.log" /mnt/v /mnt/s /mnt/m
(run from within a screen session. I would strongly advise to always run rdfind in a screen session for long jobs like this.)

console output from rdfind command
output from rdfind command

It took a good couple days for the command to run and complete. During the long phase where its finding duplicates by calculating checksums it will eat up a lot of disk bandwidth resource so if you’re trying to use your disks to watch videos or do anything disk intensive you might notice some issues. I advise to shut down unnecessary services such as bittorrent servers during this time to speed things up as much as possible.

Finally, here is the amazing result of the enormous amount of space saved on the three drives:

console output showing disk space saved by running rdfind
space on drives before and after running rdfind

Before this I was concerned that if I were to upgrade and consolidate all three drives onto one new 4TB external eSata drive that there would not be enough space. Now after running rdfind it is totally viable and after consolidating the drives I should still have well over 1 TB of free space. Muy muy!

If you want to see the actual list of the duplicate files deleted by rdfind in descending order based on file size you can use a command like this on the output file:

cut -d ' ' -f 4,8-40 rdfind-2014.08.13-1.log |sort -rn |less

And, finally, after rdfind has cleaned things up its a good idea to update the database for the extremely powerful and useful locate command with:

sudo updatedb