lsdupes - find duplicate files

ben / tech / lsdupes

about

lsdupes is a command to do things to files which have identical content.

$ lsdupes --help
lsdupes v1.2
Copyright 2006-2010 Ben Clifford. BSD-like licence
  -h, -?  --help    show help
  -r      --for-rm  rm friendly output
  -0      --zero    NUL-terminated output
  -i      --inode   inode mode

examples

Find duplicate files and delete all except one:

lsdupes --for-rm | sh
This keeps your directory tree tidy. I use it when copying everything off my camera into a new directory, but wanting to de-dupe against copies that I've had before.

Find duplicate files and make them all hardlinks to the same inode:

lsdupes --inode | sh
This preserves everything in the same place, but saves on disk space used by duplicate files. Note that because all the files are now backed by the same inode, modifying any one of those files may cause changes everywhere (depending on how your editor implements 'save')

download

prerequisites: ghc (type something like: apt-get / yum / port install ghc libghc6-parsec-dev)

 $ wget http://www.hawaga.org.uk/ben/tech/lsdupes/lsdupes-1.2.tar.gz
 $ tar xzf lsdupes-1.2.tar.gz
 $ cd lsdupes-1.2
 $ ./configure
 $ make
 $ make install

Licence: BSD-like

how?

Traditional duplicate detection is made in two passes:

A first pass is made, eliminating files with different sizes from consideration. This is fast as it requires no access to the file content, and is almost always right for many kinds of data (for example, in a tree mostly containing 3910 digital camera pictures, 7 files are non-identical but have the same size.

A second pass is made using the md5sum of file content. This is much more expensive per-file as it involves reading the entire content of each file still under consideration. However, it is only performed for the (usually smaller) number of files that were not de-duped in the first stage.

The --inode mode is similar but takes into account the inode of each file, in addition to its filename.

feedback

I welcome any feedback (positive or negative or even just to say that you actually downloaded and used it): benc@hawaga.org.uk

related

md5deep has related functionality of taking hashes of many files in a tree; however lsdupes is intended specifically for the purpose of eliminating duplicate files so will resist taking expensive hashes when a simpler measurement (the size) says there can be no duplicate for a file.