lsdupes is a command to do things to files which have identical content.
$ lsdupes --help lsdupes v1.2 Copyright 2006-2010 Ben Clifford. BSD-like licence -h, -? --help show help -r --for-rm rm friendly output -0 --zero NUL-terminated output -i --inode inode mode
Find duplicate files and delete all except one:
lsdupes --for-rm | shThis keeps your directory tree tidy. I use it when copying everything off my camera into a new directory, but wanting to de-dupe against copies that I've had before.
Find duplicate files and make them all hardlinks to the same inode:
lsdupes --inode | shThis preserves everything in the same place, but saves on disk space used by duplicate files. Note that because all the files are now backed by the same inode, modifying any one of those files may cause changes everywhere (depending on how your editor implements 'save')
prerequisites: ghc (type something like: apt-get / yum / port install ghc libghc6-parsec-dev)
$ wget http://www.hawaga.org.uk/ben/tech/lsdupes/lsdupes-1.2.tar.gz $ tar xzf lsdupes-1.2.tar.gz $ cd lsdupes-1.2 $ ./configure $ make $ make install
Licence: BSD-like
Traditional duplicate detection is made in two passes:
A first pass is made, eliminating files with different sizes from consideration. This is fast as it requires no access to the file content, and is almost always right for many kinds of data (for example, in a tree mostly containing 3910 digital camera pictures, 7 files are non-identical but have the same size.
A second pass is made using the md5sum of file content. This is much more expensive per-file as it involves reading the entire content of each file still under consideration. However, it is only performed for the (usually smaller) number of files that were not de-duped in the first stage.
The --inode
mode is similar but takes into account the
inode of each file, in addition to its filename.
I welcome any feedback (positive or negative or even just to say that you actually downloaded and used it): benc@hawaga.org.uk
md5deep has related functionality of taking hashes of many files in a tree; however lsdupes is intended specifically for the purpose of eliminating duplicate files so will resist taking expensive hashes when a simpler measurement (the size) says there can be no duplicate for a file.